Files
vreau-digital/services/seap-scraper/GNM-PLAN.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

210 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GNM — Garda Națională de Mediu (Hand-off Plan)
**Status:** PARTIAL. Source publishes only aggregate stats. We capture the
publicly-named violators (the headline cases) — full per-CUI fines history is
NOT available without an OUG 109/2007 access-to-info request.
## Sources investigated (2026-05-10)
| Source | URL | Verdict |
|---|---|---|
| gnm.ro homepage | https://www.gnm.ro/ | Only links to PDFs + press releases |
| Annual reports | gnm.ro/rapoarte-si-note-de-activitate/ | Aggregate stats only — `raport_activitate_<an>.pdf` (2012-2024) |
| Monthly synthesis | gnm.ro/wp-content/.../sinteza_<luna>_<an>.pdf | 3-page PDFs: per-judet TOTALS only, no per-firm rows |
| Press releases | gnm.ro/noutati/ | ~358 articles, ~10% enforcement, sporadic firm names |
| RSS feed | gnm.ro/feed/?paged=N | Same articles, structured XML, 36 pages × 10 items |
| data.gov.ro `q=mediu` | 45 datasets | Air quality, IPPC, SEVESO inventories — **no fines dataset** |
| ANPM rapoarte | anpm.ro | IPPC/SEVESO only (already covered by other agents) |
**Why per-CUI is impossible:** GNM is exempt from the open-data registry
obligation (OUG 109/2007). They cite "secret de serviciu" + "operatori
economici personali date" for not publishing the contravention register. The
only legal path is per-firm FOIA requests.
## Schema applied
`services/seap-scraper/sql/037_gnm.sql` — three tables in schema `gnm`:
| Table | Purpose | Rows after first run |
|---|---|---|
| `gnm.comunicate` | Raw archive of every press release (RSS) | **348** |
| `gnm.amenzi_extrase` | Regex-extracted (firm, fine_lei) tuples | **1** (after dedup) |
| `gnm.scrape_log` | Run history (mirrors anre/ancom) | 4 |
`is_enforcement` flag = 36/348 (10.3%) of articles match the
`/amenz|sancțiun|sistare|confiscat|sesizare penal/i` filter.
## Files added
```
services/seap-scraper/sql/037_gnm.sql (130 lines)
services/seap-scraper/src/scrape-gnm.ts (~440 lines)
services/seap-scraper/cron/scrape-gnm.sh ( 90 lines)
services/seap-scraper/GNM-PLAN.md (this file)
```
## Sample ingest stats
First full backfill (2026-05-10):
```
seen=348 inserted=348 updated=0 skipped=0
enforcement=36 violators=2 → 1 after dedup
duration=58s
```
After running the Stage-B fuzzy matcher against `firms.entities`:
```
gnm.amenzi_extrase id=1
contravenient_name = "Retim Ecologic Service SA"
contravenient_cui = 9112229 (RETIM ECOLOGIC SERVICE SA, jud. BIHOR)
cui_match_score = 1.0
suma_lei = 150000
context = "Depozitul de Deșeuri Nepericuloase Ghizela, operat de
Retim Ecologic Service SA. Operatorul a fost
sancționat cu 150.000 lei amendă..."
```
## Realistic yield estimate
Press-release named violators per year ≈ 50-200 firms (out of ~5,000 actual
fines). Coverage = 1-4%. Acceptable trade-off: the firms that appear in press
releases are the **biggest** offenders (refineries, large landfills, mining
operators) — exactly the firms most likely to also win SEAP contracts. The
tail is invisible but the top of the distribution is captured.
## Cross-source SQL recipes
### 1. Firms with GNM environmental fines that win SEAP construction contracts
```sql
-- Environmental violators winning state contracts.
-- Construction CPV codes start with 45; mining/extraction CPV 14/77.
SELECT
ge.contravenient_cui,
ge.contravenient_name,
ge.suma_lei AS gnm_amenda_lei,
ge.fapta,
c.titlu AS gnm_articol,
c.publicat_la AS gnm_data,
COUNT(DISTINCT a.id) AS seap_contracte_castigate,
SUM(a.contract_value_lei) AS seap_valoare_totala_lei,
STRING_AGG(DISTINCT LEFT(a.cpv_code, 2), ',') AS seap_cpv_prefixes
FROM gnm.amenzi_extrase ge
JOIN gnm.comunicate c ON ge.comunicat_id = c.id
LEFT JOIN seap.announcements a ON a.supplier_cui = ge.contravenient_cui
AND a.cpv_code LIKE '45%' -- construction
WHERE ge.contravenient_cui IS NOT NULL
GROUP BY ge.contravenient_cui, ge.contravenient_name, ge.suma_lei, ge.fapta,
c.titlu, c.publicat_la
HAVING COUNT(DISTINCT a.id) > 0
ORDER BY ge.suma_lei DESC NULLS LAST;
```
### 2. EU funds POIM-Mediu beneficiaries with GNM fines (the double-irony)
```sql
-- POIM = Programul Operațional Infrastructură Mare (Mediu axis).
-- A firm that receives EU money for environmental projects WHILE being fined
-- by GNM for environmental violations is the headline scandal pattern.
SELECT
ge.contravenient_cui,
ge.contravenient_name,
ge.suma_lei AS gnm_amenda_lei,
ge.fapta AS gnm_fapta,
fb.proiect_titlu AS eu_proiect,
fb.valoare_eligibila_eur AS eu_valoare_eur,
fb.program_finantator AS eu_program
FROM gnm.amenzi_extrase ge
JOIN fonduri.beneficiar_proiect fb ON fb.beneficiar_cui = ge.contravenient_cui
WHERE ge.contravenient_cui IS NOT NULL
AND fb.program_finantator ILIKE '%POIM%' -- or ILIKE '%mediu%' for broader
ORDER BY fb.valoare_eligibila_eur DESC NULLS LAST;
```
### 3. Top GNM violators sorted by total fines mentioned across press releases
```sql
SELECT
contravenient_cui,
MIN(contravenient_name) AS firma,
COUNT(*) AS nr_mentions,
SUM(suma_lei) AS total_amenzi_lei,
STRING_AGG(DISTINCT judet, ', ') AS judete_implicate
FROM gnm.amenzi_extrase
WHERE contravenient_cui IS NOT NULL
GROUP BY contravenient_cui
ORDER BY total_amenzi_lei DESC NULLS LAST
LIMIT 50;
```
## Stage-B fuzzy matcher
The scraper stores `contravenient_name_norm` but leaves `contravenient_cui`
NULL. To populate CUIs, run the following after each scrape (idempotent —
only updates rows where CUI is NULL):
```sql
WITH unmatched AS (
SELECT id, contravenient_name_norm
FROM gnm.amenzi_extrase
WHERE contravenient_cui IS NULL AND contravenient_name_norm IS NOT NULL
)
UPDATE gnm.amenzi_extrase a
SET contravenient_cui = m.cui,
cui_match_method = 'fuzzy_name',
cui_match_score = m.score,
matched_at = now()
FROM (
SELECT u.id, f.cui,
similarity(u.contravenient_name_norm, firms.normalize_company_name(f.name)) AS score
FROM unmatched u
CROSS JOIN LATERAL (
SELECT cui, name
FROM firms.entities
WHERE firms.normalize_company_name(name) % u.contravenient_name_norm
ORDER BY similarity(firms.normalize_company_name(name), u.contravenient_name_norm) DESC
LIMIT 1
) f
) m
WHERE a.id = m.id AND m.score >= 0.85;
```
## Operational guidance
* **Cron schedule:** weekly (Sundays 03:00) — RSS rarely changes, ~5-10 new
articles per week. Use `SINCE_DAYS=14` for incremental runs after the first
full backfill.
* **Rate limits:** gnm.ro returns `RateLimit-Limit: 100/min`, `1000/hr`. We use
~36 requests per full scrape with 800 ms sleep — well within budget.
* **Idempotency:** `gnm.comunicate` UPSERTs on `guid` (WordPress post ID,
immutable). Skip when `raw_hash` unchanged. Re-extraction wipes only the
child rows for changed articles.
* **404 on page 36:** harmless — currently 35.8 pages so we 404 on the trailing
empty fetch. Captured by retry loop, exits cleanly.
## Future enhancements (not in this hand-off)
1. **OCR of monthly synthesis PDFs** — IF in future they add per-judet tabular
detail (currently 3 pages, totals only, OCR adds nothing).
2. **Annual report PDF** has more granular judet × sector breakdowns
(waste / air / water / biodiversity) — could add a second extractor for
`gnm.amenzi_per_judet_sector` aggregates.
3. **Local press archives** (e.g. monitoruldebuzau.ro, focuspress.ro) often
name specific firms when GNM does press conferences regionally — could
harvest via a curated whitelist of regional outlets that beat-cover GNM.
Estimated +50-100 named firms/year. Risk: licensing.
4. **FOIA submissions** via the `gnm@gnm.ro` legea-544 path — could request
the contravention register annually. Civic-tech precedent: prefectura.ro
data was successfully unblocked this way in 2024.
## Time spent
~75 minutes:
- 20 min investigation (gnm.ro / data.gov.ro / RSS reconnaissance)
- 5 min schema design + apply
- 35 min scraper write + 3 iterations to tune the regex extractors
- 5 min Stage-B fuzzy match validation
- 10 min documentation