Files
vreau-digital/chatGPT/data-quality/geocoding-strategy-2026-05-11.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

63 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Geocoding strategy — firms.entities
Data: 2026-05-11. Sub-agent A2.
## Final coverage
| Source | Rows | Accuracy | Notes |
|---|---:|---|---|
| `geonames_postal` | 2,128,990 | ~100m2km | Exact 5/6-digit RO postal match against geonames RO.zip (firms.postal_codes). |
| `photon` | 839,643 | ~50500m | Komoot Photon OSM geocoder, free-text `adr_full`. Earlier batch (services/seap-scraper/src/geocode-photon.ts). |
| `uat_centroid` | 670,657 | 530km | UAT polygon centroid match by locality+county. |
| `judet_centroid` | 346,675 | 30150km | Median of all postal codes within the judet. Filled the 2026-05-11 gap where `judet_fallback` was tagged but lat/lng never written. |
| `seap_siruta_centroid` | 4,681 | 530km | NEW stub rows for SEAP-only CUIs (not present in ONRC firme dataset) using SIRUTA → gis_uats centroid. |
| `seap_judet_centroid` | 2,497 | 30150km | NEW stub rows for SEAP-only CUIs with city/county data in seap.cui_location. |
| _unmapped_ | 2 | — | Two firms with literally zero address fields. Out of reach. |
**Total: 3,993,143 / 3,993,145 = 100.00 %.**
## Fallback chain (priority order)
For any new row entering firms.entities, apply in this order, stop at first hit:
1. **Postal-code exact match**`firms.postal_codes.postal_code = adr_cod_postal` (5/6 digit). Source = `geonames_postal`.
2. **Postal-code normalized** (strip non-digit), same lookup. (Adds ~9K to the bucket — already covered in current dataset.)
3. **Photon free-text** on `adr_full` (OSM geocoder, requires network — see geocode-photon.ts).
4. **UAT centroid** by `(adr_localitate, adr_judet)``firms.postal_codes` median of matching place_name + county_code, OR `public.gis_uats` polygon centroid.
5. **Judet centroid** — median of all `firms.postal_codes` rows for the normalized judet name (`upper(unaccent(replace(adr_judet,'MUNICIPIUL ','')))`). 42 distinct judet keys cover all of RO + București.
6. **SIRUTA centroid** — for SEAP-mentioned CUIs only, where firms.entities row didn't exist: `seap.announcements.{authority,supplier}_siruta``gis_uats.siruta` centroid (transformed 3844→4326).
7. **City+county from seap.cui_location** → judet centroid fallback (`seap_judet_centroid`).
## Authority / supplier coverage (downstream)
After backfill, JOIN-based coverage from SEAP:
| Bucket | Total distinct CUIs | Geocoded | Pct |
|---|---:|---:|---:|
| authority_cui | 14,617 | 14,119 | 96.6 % |
| supplier_cui | 65,675 | 64,793 | 98.7 % |
Residual: 498 authorities + 882 suppliers (~1,373 unique) — these CUIs appear nowhere with address data (no siruta, no city/county in seap.cui_location, no usable address in any announcement). Most are malformed CUI strings (commas, semicolons, trailing punctuation) — should be cleaned up at SEAP ingestion. Out of scope for geocoding.
## Cross-schema enrichment
- `aaas.firme` — 11 rows total, all 11 have geocoded parent in firms.entities via CUI. No action needed; UI agents JOIN.
- `anre.licente` — 27,275 rows with titular_cui populated, 11,043 distinct. All 11,043 CUIs match a geocoded firm. UI agents JOIN on `firms.entities.cui = anre.licente.titular_cui`.
- `seap.announcements``supplier_address`, `authority_address`, `supplier_siruta`, `authority_siruta` are populated. After this batch, almost every announcement can render on a map via firms.entities lookup.
## Geom integrity
- `firms.entities.geom` (geography 4326) is now 1:1 with lat/lng (12,735 prior mismatches fixed where judet_fallback had stale geom from an older run).
- 2 unmapped firms have NULL on both. PostGIS spatial indexes still valid.
## Forward maintenance
1. Anyone ingesting new firms (ANAF/ONRC weekly refresh) must apply the fallback chain in code before INSERT.
2. The seap_siruta_centroid and seap_judet_centroid stubs should be **upgraded** the moment an ANAF/ONRC record arrives for the same CUI — re-run the chain with the real `adr_full`.
3. If the SEAP CUI hygiene gets fixed (A1's domain), the 1,373 residual can be re-attempted.
4. `judet_centroid` (and the two seap variants) have only `geocode_score = 0.1` and `0.3`. UI clustering should down-weight or hide these at high zoom.
## Queries used
All idempotent UPDATEs filtered on `lat IS NULL`. Centroid sources read from `firms.postal_codes` and `public.gis_uats` (SRID 3844 → 4326). Saved in-line in the agent transcript; the strategy itself is the artifact.