Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
4.5 KiB
SEAP Historical Backfill — Notes & Caveats
Backfill ingest of data.gov.ro yearly CKAN dumps into seap.announcements.
This file documents schema variants per year, known data quality issues,
and what was deliberately skipped.
Pipeline
scripts/import-seap-historical.py— CSV normalizer (any of,|^;delim,"or|quote)scripts/import-seap-historical.sh— CSV download + ingest wrapperscripts/xlsx-to-csv.py— XLSX (openpyxl) and XLS legacy (xlrd 1.2) → CSV; multi-sheet aware (XLS 65k row limit)scripts/import-seap-xlsx.sh— full XLS/XLSX → CSV → ingest pipeline
Schema variants observed
| Year | Format | Delim | Quote | Header style |
|---|---|---|---|---|
| 2017 | CSV | ^ |
none | CamelCase (Castigator, AutoritateContractanta) |
| 2018 T1 | CSV | ^ |
none | CamelCase |
| 2018 T2-T4 | XLS | n/a | n/a | UPPER_SNAKE_CASE (CASTIGATOR_CUI, CASTIGAOR_LOCALITATE ← typo) |
| 2019 | XLS | n/a | n/a | UPPER_SNAKE_CASE (same as 2018 T2-T4) |
| 2022 T1 | CSV | , |
| |
UPPER_SNAKE_CASE (e.g. header line starts |DENUMIRE_AC|,|CUI_AC|) |
| 2022 T2-T4 | XLS | n/a | n/a | UPPER_SNAKE_CASE |
| 2023 T1-T2 | XLS | n/a | n/a | Title Case with title row as row 1, real header on row 2 |
| 2023 T3 | CSV | | |
" |
UPPER_SNAKE_CASE (with TIP_LESIGLATIE typo) |
| 2023 T4 | CSV | , |
" |
Title Case |
| 2024 | CSV | , |
" |
Title Case (standard) |
Header dedupe: the normalizer uses (type, ref_number) as primary key with first-row-wins; per-lot rows in the same announcement collapse to a single row.
Known data quality issues
2019 T2 ≡ T3 (data.gov.ro upload error)
Files raport-t-2-2019.xls and raport-t-3-2019.xls are byte-identical and contain an unspecified date range mixing months across 2019. The T2 source label was loaded first (5,673 rows); the T3 import showed all-conflicts on the unique constraint. Real Q2 2019 data (Apr-Jun) is missing from the dump.
Workaround: use TED supplement (Jan-Aug 2018 onwards is in TED) or scrape SEAP directly for the missing quarter.
2019 anunturi-initiere XLSX files are 1-cell stubs
All anunturiinitiere2019tX.xlsx files on data.gov.ro contain only the header TIP_ANUNT with no data rows. Same applies to 2018 T2-T4 anunturi-initiere XLSX and 2019 achizitii-directe XLSX. These appear to be broken uploads. Cannot recover from CKAN.
2022 T3 contracte missing September
The T3 file (raport-datagov-contracte-t3-2022.xls) only covers Jul-Aug. September contracts are missing.
Date format ambiguity in 2019 XLS
Dates in 2019 XLS files appear to use DD/MM/YYYY rather than the SEAP-standard MM/DD/YYYY. The MM/DD parser in import-seap-historical.py discards rows where day > 12, partially preserving the data. Consider re-parsing with format detection if pristine 2019 dates are needed.
What was skipped this session
| Dataset | Reason | Estimated row count |
|---|---|---|
| Achizitii directe (cumparari directe) all years | Per task spec — 8M+ row dataset, deferred | ~8,000,000 |
| 2020, 2021 | Per task spec — ministry-only datasets, no CKAN dump | n/a |
| 2017/2018 contracte-subsecvente | Lower priority, can ingest in next session | ~10,000 |
| 2017/2018 invitatii-participare | Low value (intent, not award) | ~5,000 |
| 2018 T2-T4 cumparari-directe XLSX | Skipped per spec | ~3,000,000 |
Current ingest state (post-backfill)
| Year | Rows | Total RON (bln) |
|---|---|---|
| 2017 | 31,271 (contracte 20,478 + initiere 10,793) | 33.20 |
| 2018 | 17,883 (contracte 15,711 + initiere 2,172) | 23.80 |
| 2019 | 16,570 contracte (T1+T2dup+T4) | 36.95 |
| 2022 | 24,677 contracte | 89.99 |
| 2023 | 47,003 (contracte 25,793 + initiere 15,520 + atribuire-fara 5,684) | 187.13 |
| 2024 (PoC) | 750 contracte | 7.33 |
| Total | 138,148 | 378.41 bln RON |
Total seap.announcements table: 781,029 rows.
Next-session work
- 2020 + 2021 gap — TED supplement (
https://ted.europa.eu) covers EU-threshold awards for these years. National-only awards likely lost. - Achizitii directe — 8M rows, separate session: own ingest path with
type='da'. - 2019 Q2 — scrape SEAP-WSP backwards or pull from individual
seapcererearchives. - 2018 anunturi-initiere T2-T4 — broken on CKAN; ANAP RFE or SEAP-WSP scrape.
- CPV name lookup — cpv_code populated for 2017+; cpv_name needs join via
seap.cpv_codesview.