Files
vreau-digital/chatGPT/data-quality/geocoding-strategy-2026-05-11.md
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

4.3 KiB
Raw Permalink Blame History

Geocoding strategy — firms.entities

Data: 2026-05-11. Sub-agent A2.

Final coverage

Source Rows Accuracy Notes
geonames_postal 2,128,990 ~100m2km Exact 5/6-digit RO postal match against geonames RO.zip (firms.postal_codes).
photon 839,643 ~50500m Komoot Photon OSM geocoder, free-text adr_full. Earlier batch (services/seap-scraper/src/geocode-photon.ts).
uat_centroid 670,657 530km UAT polygon centroid match by locality+county.
judet_centroid 346,675 30150km Median of all postal codes within the judet. Filled the 2026-05-11 gap where judet_fallback was tagged but lat/lng never written.
seap_siruta_centroid 4,681 530km NEW stub rows for SEAP-only CUIs (not present in ONRC firme dataset) using SIRUTA → gis_uats centroid.
seap_judet_centroid 2,497 30150km NEW stub rows for SEAP-only CUIs with city/county data in seap.cui_location.
unmapped 2 Two firms with literally zero address fields. Out of reach.

Total: 3,993,143 / 3,993,145 = 100.00 %.

Fallback chain (priority order)

For any new row entering firms.entities, apply in this order, stop at first hit:

  1. Postal-code exact matchfirms.postal_codes.postal_code = adr_cod_postal (5/6 digit). Source = geonames_postal.
  2. Postal-code normalized (strip non-digit), same lookup. (Adds ~9K to the bucket — already covered in current dataset.)
  3. Photon free-text on adr_full (OSM geocoder, requires network — see geocode-photon.ts).
  4. UAT centroid by (adr_localitate, adr_judet)firms.postal_codes median of matching place_name + county_code, OR public.gis_uats polygon centroid.
  5. Judet centroid — median of all firms.postal_codes rows for the normalized judet name (upper(unaccent(replace(adr_judet,'MUNICIPIUL ','')))). 42 distinct judet keys cover all of RO + București.
  6. SIRUTA centroid — for SEAP-mentioned CUIs only, where firms.entities row didn't exist: seap.announcements.{authority,supplier}_sirutagis_uats.siruta centroid (transformed 3844→4326).
  7. City+county from seap.cui_location → judet centroid fallback (seap_judet_centroid).

Authority / supplier coverage (downstream)

After backfill, JOIN-based coverage from SEAP:

Bucket Total distinct CUIs Geocoded Pct
authority_cui 14,617 14,119 96.6 %
supplier_cui 65,675 64,793 98.7 %

Residual: 498 authorities + 882 suppliers (~1,373 unique) — these CUIs appear nowhere with address data (no siruta, no city/county in seap.cui_location, no usable address in any announcement). Most are malformed CUI strings (commas, semicolons, trailing punctuation) — should be cleaned up at SEAP ingestion. Out of scope for geocoding.

Cross-schema enrichment

  • aaas.firme — 11 rows total, all 11 have geocoded parent in firms.entities via CUI. No action needed; UI agents JOIN.
  • anre.licente — 27,275 rows with titular_cui populated, 11,043 distinct. All 11,043 CUIs match a geocoded firm. UI agents JOIN on firms.entities.cui = anre.licente.titular_cui.
  • seap.announcementssupplier_address, authority_address, supplier_siruta, authority_siruta are populated. After this batch, almost every announcement can render on a map via firms.entities lookup.

Geom integrity

  • firms.entities.geom (geography 4326) is now 1:1 with lat/lng (12,735 prior mismatches fixed where judet_fallback had stale geom from an older run).
  • 2 unmapped firms have NULL on both. PostGIS spatial indexes still valid.

Forward maintenance

  1. Anyone ingesting new firms (ANAF/ONRC weekly refresh) must apply the fallback chain in code before INSERT.
  2. The seap_siruta_centroid and seap_judet_centroid stubs should be upgraded the moment an ANAF/ONRC record arrives for the same CUI — re-run the chain with the real adr_full.
  3. If the SEAP CUI hygiene gets fixed (A1's domain), the 1,373 residual can be re-attempted.
  4. judet_centroid (and the two seap variants) have only geocode_score = 0.1 and 0.3. UI clustering should down-weight or hide these at high zoom.

Queries used

All idempotent UPDATEs filtered on lat IS NULL. Centroid sources read from firms.postal_codes and public.gis_uats (SRID 3844 → 4326). Saved in-line in the agent transcript; the strategy itself is the artifact.