# Geocoding strategy — firms.entities Data: 2026-05-11. Sub-agent A2. ## Final coverage | Source | Rows | Accuracy | Notes | |---|---:|---|---| | `geonames_postal` | 2,128,990 | ~100m–2km | Exact 5/6-digit RO postal match against geonames RO.zip (firms.postal_codes). | | `photon` | 839,643 | ~50–500m | Komoot Photon OSM geocoder, free-text `adr_full`. Earlier batch (services/seap-scraper/src/geocode-photon.ts). | | `uat_centroid` | 670,657 | 5–30km | UAT polygon centroid match by locality+county. | | `judet_centroid` | 346,675 | 30–150km | Median of all postal codes within the judet. Filled the 2026-05-11 gap where `judet_fallback` was tagged but lat/lng never written. | | `seap_siruta_centroid` | 4,681 | 5–30km | NEW stub rows for SEAP-only CUIs (not present in ONRC firme dataset) using SIRUTA → gis_uats centroid. | | `seap_judet_centroid` | 2,497 | 30–150km | NEW stub rows for SEAP-only CUIs with city/county data in seap.cui_location. | | _unmapped_ | 2 | — | Two firms with literally zero address fields. Out of reach. | **Total: 3,993,143 / 3,993,145 = 100.00 %.** ## Fallback chain (priority order) For any new row entering firms.entities, apply in this order, stop at first hit: 1. **Postal-code exact match** → `firms.postal_codes.postal_code = adr_cod_postal` (5/6 digit). Source = `geonames_postal`. 2. **Postal-code normalized** (strip non-digit), same lookup. (Adds ~9K to the bucket — already covered in current dataset.) 3. **Photon free-text** on `adr_full` (OSM geocoder, requires network — see geocode-photon.ts). 4. **UAT centroid** by `(adr_localitate, adr_judet)` → `firms.postal_codes` median of matching place_name + county_code, OR `public.gis_uats` polygon centroid. 5. **Judet centroid** — median of all `firms.postal_codes` rows for the normalized judet name (`upper(unaccent(replace(adr_judet,'MUNICIPIUL ','')))`). 42 distinct judet keys cover all of RO + București. 6. **SIRUTA centroid** — for SEAP-mentioned CUIs only, where firms.entities row didn't exist: `seap.announcements.{authority,supplier}_siruta` → `gis_uats.siruta` centroid (transformed 3844→4326). 7. **City+county from seap.cui_location** → judet centroid fallback (`seap_judet_centroid`). ## Authority / supplier coverage (downstream) After backfill, JOIN-based coverage from SEAP: | Bucket | Total distinct CUIs | Geocoded | Pct | |---|---:|---:|---:| | authority_cui | 14,617 | 14,119 | 96.6 % | | supplier_cui | 65,675 | 64,793 | 98.7 % | Residual: 498 authorities + 882 suppliers (~1,373 unique) — these CUIs appear nowhere with address data (no siruta, no city/county in seap.cui_location, no usable address in any announcement). Most are malformed CUI strings (commas, semicolons, trailing punctuation) — should be cleaned up at SEAP ingestion. Out of scope for geocoding. ## Cross-schema enrichment - `aaas.firme` — 11 rows total, all 11 have geocoded parent in firms.entities via CUI. No action needed; UI agents JOIN. - `anre.licente` — 27,275 rows with titular_cui populated, 11,043 distinct. All 11,043 CUIs match a geocoded firm. UI agents JOIN on `firms.entities.cui = anre.licente.titular_cui`. - `seap.announcements` — `supplier_address`, `authority_address`, `supplier_siruta`, `authority_siruta` are populated. After this batch, almost every announcement can render on a map via firms.entities lookup. ## Geom integrity - `firms.entities.geom` (geography 4326) is now 1:1 with lat/lng (12,735 prior mismatches fixed where judet_fallback had stale geom from an older run). - 2 unmapped firms have NULL on both. PostGIS spatial indexes still valid. ## Forward maintenance 1. Anyone ingesting new firms (ANAF/ONRC weekly refresh) must apply the fallback chain in code before INSERT. 2. The seap_siruta_centroid and seap_judet_centroid stubs should be **upgraded** the moment an ANAF/ONRC record arrives for the same CUI — re-run the chain with the real `adr_full`. 3. If the SEAP CUI hygiene gets fixed (A1's domain), the 1,373 residual can be re-attempted. 4. `judet_centroid` (and the two seap variants) have only `geocode_score = 0.1` and `0.3`. UI clustering should down-weight or hide these at high zoom. ## Queries used All idempotent UPDATEs filtered on `lat IS NULL`. Centroid sources read from `firms.postal_codes` and `public.gis_uats` (SRID 3844 → 4326). Saved in-line in the agent transcript; the strategy itself is the artifact.