a6c03a091e
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
210 lines
8.3 KiB
Markdown
210 lines
8.3 KiB
Markdown
# GNM — Garda Națională de Mediu (Hand-off Plan)
|
||
|
||
**Status:** PARTIAL. Source publishes only aggregate stats. We capture the
|
||
publicly-named violators (the headline cases) — full per-CUI fines history is
|
||
NOT available without an OUG 109/2007 access-to-info request.
|
||
|
||
## Sources investigated (2026-05-10)
|
||
|
||
| Source | URL | Verdict |
|
||
|---|---|---|
|
||
| gnm.ro homepage | https://www.gnm.ro/ | Only links to PDFs + press releases |
|
||
| Annual reports | gnm.ro/rapoarte-si-note-de-activitate/ | Aggregate stats only — `raport_activitate_<an>.pdf` (2012-2024) |
|
||
| Monthly synthesis | gnm.ro/wp-content/.../sinteza_<luna>_<an>.pdf | 3-page PDFs: per-judet TOTALS only, no per-firm rows |
|
||
| Press releases | gnm.ro/noutati/ | ~358 articles, ~10% enforcement, sporadic firm names |
|
||
| RSS feed | gnm.ro/feed/?paged=N | Same articles, structured XML, 36 pages × 10 items |
|
||
| data.gov.ro `q=mediu` | 45 datasets | Air quality, IPPC, SEVESO inventories — **no fines dataset** |
|
||
| ANPM rapoarte | anpm.ro | IPPC/SEVESO only (already covered by other agents) |
|
||
|
||
**Why per-CUI is impossible:** GNM is exempt from the open-data registry
|
||
obligation (OUG 109/2007). They cite "secret de serviciu" + "operatori
|
||
economici personali date" for not publishing the contravention register. The
|
||
only legal path is per-firm FOIA requests.
|
||
|
||
## Schema applied
|
||
|
||
`services/seap-scraper/sql/037_gnm.sql` — three tables in schema `gnm`:
|
||
|
||
| Table | Purpose | Rows after first run |
|
||
|---|---|---|
|
||
| `gnm.comunicate` | Raw archive of every press release (RSS) | **348** |
|
||
| `gnm.amenzi_extrase` | Regex-extracted (firm, fine_lei) tuples | **1** (after dedup) |
|
||
| `gnm.scrape_log` | Run history (mirrors anre/ancom) | 4 |
|
||
|
||
`is_enforcement` flag = 36/348 (10.3%) of articles match the
|
||
`/amenz|sancțiun|sistare|confiscat|sesizare penal/i` filter.
|
||
|
||
## Files added
|
||
|
||
```
|
||
services/seap-scraper/sql/037_gnm.sql (130 lines)
|
||
services/seap-scraper/src/scrape-gnm.ts (~440 lines)
|
||
services/seap-scraper/cron/scrape-gnm.sh ( 90 lines)
|
||
services/seap-scraper/GNM-PLAN.md (this file)
|
||
```
|
||
|
||
## Sample ingest stats
|
||
|
||
First full backfill (2026-05-10):
|
||
|
||
```
|
||
seen=348 inserted=348 updated=0 skipped=0
|
||
enforcement=36 violators=2 → 1 after dedup
|
||
duration=58s
|
||
```
|
||
|
||
After running the Stage-B fuzzy matcher against `firms.entities`:
|
||
|
||
```
|
||
gnm.amenzi_extrase id=1
|
||
contravenient_name = "Retim Ecologic Service SA"
|
||
contravenient_cui = 9112229 (RETIM ECOLOGIC SERVICE SA, jud. BIHOR)
|
||
cui_match_score = 1.0
|
||
suma_lei = 150000
|
||
context = "Depozitul de Deșeuri Nepericuloase Ghizela, operat de
|
||
Retim Ecologic Service SA. Operatorul a fost
|
||
sancționat cu 150.000 lei amendă..."
|
||
```
|
||
|
||
## Realistic yield estimate
|
||
|
||
Press-release named violators per year ≈ 50-200 firms (out of ~5,000 actual
|
||
fines). Coverage = 1-4%. Acceptable trade-off: the firms that appear in press
|
||
releases are the **biggest** offenders (refineries, large landfills, mining
|
||
operators) — exactly the firms most likely to also win SEAP contracts. The
|
||
tail is invisible but the top of the distribution is captured.
|
||
|
||
## Cross-source SQL recipes
|
||
|
||
### 1. Firms with GNM environmental fines that win SEAP construction contracts
|
||
|
||
```sql
|
||
-- Environmental violators winning state contracts.
|
||
-- Construction CPV codes start with 45; mining/extraction CPV 14/77.
|
||
SELECT
|
||
ge.contravenient_cui,
|
||
ge.contravenient_name,
|
||
ge.suma_lei AS gnm_amenda_lei,
|
||
ge.fapta,
|
||
c.titlu AS gnm_articol,
|
||
c.publicat_la AS gnm_data,
|
||
COUNT(DISTINCT a.id) AS seap_contracte_castigate,
|
||
SUM(a.contract_value_lei) AS seap_valoare_totala_lei,
|
||
STRING_AGG(DISTINCT LEFT(a.cpv_code, 2), ',') AS seap_cpv_prefixes
|
||
FROM gnm.amenzi_extrase ge
|
||
JOIN gnm.comunicate c ON ge.comunicat_id = c.id
|
||
LEFT JOIN seap.announcements a ON a.supplier_cui = ge.contravenient_cui
|
||
AND a.cpv_code LIKE '45%' -- construction
|
||
WHERE ge.contravenient_cui IS NOT NULL
|
||
GROUP BY ge.contravenient_cui, ge.contravenient_name, ge.suma_lei, ge.fapta,
|
||
c.titlu, c.publicat_la
|
||
HAVING COUNT(DISTINCT a.id) > 0
|
||
ORDER BY ge.suma_lei DESC NULLS LAST;
|
||
```
|
||
|
||
### 2. EU funds POIM-Mediu beneficiaries with GNM fines (the double-irony)
|
||
|
||
```sql
|
||
-- POIM = Programul Operațional Infrastructură Mare (Mediu axis).
|
||
-- A firm that receives EU money for environmental projects WHILE being fined
|
||
-- by GNM for environmental violations is the headline scandal pattern.
|
||
SELECT
|
||
ge.contravenient_cui,
|
||
ge.contravenient_name,
|
||
ge.suma_lei AS gnm_amenda_lei,
|
||
ge.fapta AS gnm_fapta,
|
||
fb.proiect_titlu AS eu_proiect,
|
||
fb.valoare_eligibila_eur AS eu_valoare_eur,
|
||
fb.program_finantator AS eu_program
|
||
FROM gnm.amenzi_extrase ge
|
||
JOIN fonduri.beneficiar_proiect fb ON fb.beneficiar_cui = ge.contravenient_cui
|
||
WHERE ge.contravenient_cui IS NOT NULL
|
||
AND fb.program_finantator ILIKE '%POIM%' -- or ILIKE '%mediu%' for broader
|
||
ORDER BY fb.valoare_eligibila_eur DESC NULLS LAST;
|
||
```
|
||
|
||
### 3. Top GNM violators sorted by total fines mentioned across press releases
|
||
|
||
```sql
|
||
SELECT
|
||
contravenient_cui,
|
||
MIN(contravenient_name) AS firma,
|
||
COUNT(*) AS nr_mentions,
|
||
SUM(suma_lei) AS total_amenzi_lei,
|
||
STRING_AGG(DISTINCT judet, ', ') AS judete_implicate
|
||
FROM gnm.amenzi_extrase
|
||
WHERE contravenient_cui IS NOT NULL
|
||
GROUP BY contravenient_cui
|
||
ORDER BY total_amenzi_lei DESC NULLS LAST
|
||
LIMIT 50;
|
||
```
|
||
|
||
## Stage-B fuzzy matcher
|
||
|
||
The scraper stores `contravenient_name_norm` but leaves `contravenient_cui`
|
||
NULL. To populate CUIs, run the following after each scrape (idempotent —
|
||
only updates rows where CUI is NULL):
|
||
|
||
```sql
|
||
WITH unmatched AS (
|
||
SELECT id, contravenient_name_norm
|
||
FROM gnm.amenzi_extrase
|
||
WHERE contravenient_cui IS NULL AND contravenient_name_norm IS NOT NULL
|
||
)
|
||
UPDATE gnm.amenzi_extrase a
|
||
SET contravenient_cui = m.cui,
|
||
cui_match_method = 'fuzzy_name',
|
||
cui_match_score = m.score,
|
||
matched_at = now()
|
||
FROM (
|
||
SELECT u.id, f.cui,
|
||
similarity(u.contravenient_name_norm, firms.normalize_company_name(f.name)) AS score
|
||
FROM unmatched u
|
||
CROSS JOIN LATERAL (
|
||
SELECT cui, name
|
||
FROM firms.entities
|
||
WHERE firms.normalize_company_name(name) % u.contravenient_name_norm
|
||
ORDER BY similarity(firms.normalize_company_name(name), u.contravenient_name_norm) DESC
|
||
LIMIT 1
|
||
) f
|
||
) m
|
||
WHERE a.id = m.id AND m.score >= 0.85;
|
||
```
|
||
|
||
## Operational guidance
|
||
|
||
* **Cron schedule:** weekly (Sundays 03:00) — RSS rarely changes, ~5-10 new
|
||
articles per week. Use `SINCE_DAYS=14` for incremental runs after the first
|
||
full backfill.
|
||
* **Rate limits:** gnm.ro returns `RateLimit-Limit: 100/min`, `1000/hr`. We use
|
||
~36 requests per full scrape with 800 ms sleep — well within budget.
|
||
* **Idempotency:** `gnm.comunicate` UPSERTs on `guid` (WordPress post ID,
|
||
immutable). Skip when `raw_hash` unchanged. Re-extraction wipes only the
|
||
child rows for changed articles.
|
||
* **404 on page 36:** harmless — currently 35.8 pages so we 404 on the trailing
|
||
empty fetch. Captured by retry loop, exits cleanly.
|
||
|
||
## Future enhancements (not in this hand-off)
|
||
|
||
1. **OCR of monthly synthesis PDFs** — IF in future they add per-judet tabular
|
||
detail (currently 3 pages, totals only, OCR adds nothing).
|
||
2. **Annual report PDF** has more granular judet × sector breakdowns
|
||
(waste / air / water / biodiversity) — could add a second extractor for
|
||
`gnm.amenzi_per_judet_sector` aggregates.
|
||
3. **Local press archives** (e.g. monitoruldebuzau.ro, focuspress.ro) often
|
||
name specific firms when GNM does press conferences regionally — could
|
||
harvest via a curated whitelist of regional outlets that beat-cover GNM.
|
||
Estimated +50-100 named firms/year. Risk: licensing.
|
||
4. **FOIA submissions** via the `gnm@gnm.ro` legea-544 path — could request
|
||
the contravention register annually. Civic-tech precedent: prefectura.ro
|
||
data was successfully unblocked this way in 2024.
|
||
|
||
## Time spent
|
||
|
||
~75 minutes:
|
||
- 20 min investigation (gnm.ro / data.gov.ro / RSS reconnaissance)
|
||
- 5 min schema design + apply
|
||
- 35 min scraper write + 3 iterations to tune the regex extractors
|
||
- 5 min Stage-B fuzzy match validation
|
||
- 10 min documentation
|