initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
This commit is contained in:
@@ -0,0 +1,62 @@
|
||||
# Geocoding strategy — firms.entities
|
||||
|
||||
Data: 2026-05-11. Sub-agent A2.
|
||||
|
||||
## Final coverage
|
||||
|
||||
| Source | Rows | Accuracy | Notes |
|
||||
|---|---:|---|---|
|
||||
| `geonames_postal` | 2,128,990 | ~100m–2km | Exact 5/6-digit RO postal match against geonames RO.zip (firms.postal_codes). |
|
||||
| `photon` | 839,643 | ~50–500m | Komoot Photon OSM geocoder, free-text `adr_full`. Earlier batch (services/seap-scraper/src/geocode-photon.ts). |
|
||||
| `uat_centroid` | 670,657 | 5–30km | UAT polygon centroid match by locality+county. |
|
||||
| `judet_centroid` | 346,675 | 30–150km | Median of all postal codes within the judet. Filled the 2026-05-11 gap where `judet_fallback` was tagged but lat/lng never written. |
|
||||
| `seap_siruta_centroid` | 4,681 | 5–30km | NEW stub rows for SEAP-only CUIs (not present in ONRC firme dataset) using SIRUTA → gis_uats centroid. |
|
||||
| `seap_judet_centroid` | 2,497 | 30–150km | NEW stub rows for SEAP-only CUIs with city/county data in seap.cui_location. |
|
||||
| _unmapped_ | 2 | — | Two firms with literally zero address fields. Out of reach. |
|
||||
|
||||
**Total: 3,993,143 / 3,993,145 = 100.00 %.**
|
||||
|
||||
## Fallback chain (priority order)
|
||||
|
||||
For any new row entering firms.entities, apply in this order, stop at first hit:
|
||||
|
||||
1. **Postal-code exact match** → `firms.postal_codes.postal_code = adr_cod_postal` (5/6 digit). Source = `geonames_postal`.
|
||||
2. **Postal-code normalized** (strip non-digit), same lookup. (Adds ~9K to the bucket — already covered in current dataset.)
|
||||
3. **Photon free-text** on `adr_full` (OSM geocoder, requires network — see geocode-photon.ts).
|
||||
4. **UAT centroid** by `(adr_localitate, adr_judet)` → `firms.postal_codes` median of matching place_name + county_code, OR `public.gis_uats` polygon centroid.
|
||||
5. **Judet centroid** — median of all `firms.postal_codes` rows for the normalized judet name (`upper(unaccent(replace(adr_judet,'MUNICIPIUL ','')))`). 42 distinct judet keys cover all of RO + București.
|
||||
6. **SIRUTA centroid** — for SEAP-mentioned CUIs only, where firms.entities row didn't exist: `seap.announcements.{authority,supplier}_siruta` → `gis_uats.siruta` centroid (transformed 3844→4326).
|
||||
7. **City+county from seap.cui_location** → judet centroid fallback (`seap_judet_centroid`).
|
||||
|
||||
## Authority / supplier coverage (downstream)
|
||||
|
||||
After backfill, JOIN-based coverage from SEAP:
|
||||
|
||||
| Bucket | Total distinct CUIs | Geocoded | Pct |
|
||||
|---|---:|---:|---:|
|
||||
| authority_cui | 14,617 | 14,119 | 96.6 % |
|
||||
| supplier_cui | 65,675 | 64,793 | 98.7 % |
|
||||
|
||||
Residual: 498 authorities + 882 suppliers (~1,373 unique) — these CUIs appear nowhere with address data (no siruta, no city/county in seap.cui_location, no usable address in any announcement). Most are malformed CUI strings (commas, semicolons, trailing punctuation) — should be cleaned up at SEAP ingestion. Out of scope for geocoding.
|
||||
|
||||
## Cross-schema enrichment
|
||||
|
||||
- `aaas.firme` — 11 rows total, all 11 have geocoded parent in firms.entities via CUI. No action needed; UI agents JOIN.
|
||||
- `anre.licente` — 27,275 rows with titular_cui populated, 11,043 distinct. All 11,043 CUIs match a geocoded firm. UI agents JOIN on `firms.entities.cui = anre.licente.titular_cui`.
|
||||
- `seap.announcements` — `supplier_address`, `authority_address`, `supplier_siruta`, `authority_siruta` are populated. After this batch, almost every announcement can render on a map via firms.entities lookup.
|
||||
|
||||
## Geom integrity
|
||||
|
||||
- `firms.entities.geom` (geography 4326) is now 1:1 with lat/lng (12,735 prior mismatches fixed where judet_fallback had stale geom from an older run).
|
||||
- 2 unmapped firms have NULL on both. PostGIS spatial indexes still valid.
|
||||
|
||||
## Forward maintenance
|
||||
|
||||
1. Anyone ingesting new firms (ANAF/ONRC weekly refresh) must apply the fallback chain in code before INSERT.
|
||||
2. The seap_siruta_centroid and seap_judet_centroid stubs should be **upgraded** the moment an ANAF/ONRC record arrives for the same CUI — re-run the chain with the real `adr_full`.
|
||||
3. If the SEAP CUI hygiene gets fixed (A1's domain), the 1,373 residual can be re-attempted.
|
||||
4. `judet_centroid` (and the two seap variants) have only `geocode_score = 0.1` and `0.3`. UI clustering should down-weight or hide these at high zoom.
|
||||
|
||||
## Queries used
|
||||
|
||||
All idempotent UPDATEs filtered on `lat IS NULL`. Centroid sources read from `firms.postal_codes` and `public.gis_uats` (SRID 3844 → 4326). Saved in-line in the agent transcript; the strategy itself is the artifact.
|
||||
Reference in New Issue
Block a user