Files
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

4.5 KiB

SEAP Historical Backfill — Notes & Caveats

Backfill ingest of data.gov.ro yearly CKAN dumps into seap.announcements. This file documents schema variants per year, known data quality issues, and what was deliberately skipped.

Pipeline

  • scripts/import-seap-historical.py — CSV normalizer (any of , | ^ ; delim, " or | quote)
  • scripts/import-seap-historical.sh — CSV download + ingest wrapper
  • scripts/xlsx-to-csv.py — XLSX (openpyxl) and XLS legacy (xlrd 1.2) → CSV; multi-sheet aware (XLS 65k row limit)
  • scripts/import-seap-xlsx.sh — full XLS/XLSX → CSV → ingest pipeline

Schema variants observed

Year Format Delim Quote Header style
2017 CSV ^ none CamelCase (Castigator, AutoritateContractanta)
2018 T1 CSV ^ none CamelCase
2018 T2-T4 XLS n/a n/a UPPER_SNAKE_CASE (CASTIGATOR_CUI, CASTIGAOR_LOCALITATE ← typo)
2019 XLS n/a n/a UPPER_SNAKE_CASE (same as 2018 T2-T4)
2022 T1 CSV , | UPPER_SNAKE_CASE (e.g. header line starts |DENUMIRE_AC|,|CUI_AC|)
2022 T2-T4 XLS n/a n/a UPPER_SNAKE_CASE
2023 T1-T2 XLS n/a n/a Title Case with title row as row 1, real header on row 2
2023 T3 CSV | " UPPER_SNAKE_CASE (with TIP_LESIGLATIE typo)
2023 T4 CSV , " Title Case
2024 CSV , " Title Case (standard)

Header dedupe: the normalizer uses (type, ref_number) as primary key with first-row-wins; per-lot rows in the same announcement collapse to a single row.

Known data quality issues

2019 T2 ≡ T3 (data.gov.ro upload error)

Files raport-t-2-2019.xls and raport-t-3-2019.xls are byte-identical and contain an unspecified date range mixing months across 2019. The T2 source label was loaded first (5,673 rows); the T3 import showed all-conflicts on the unique constraint. Real Q2 2019 data (Apr-Jun) is missing from the dump.

Workaround: use TED supplement (Jan-Aug 2018 onwards is in TED) or scrape SEAP directly for the missing quarter.

2019 anunturi-initiere XLSX files are 1-cell stubs

All anunturiinitiere2019tX.xlsx files on data.gov.ro contain only the header TIP_ANUNT with no data rows. Same applies to 2018 T2-T4 anunturi-initiere XLSX and 2019 achizitii-directe XLSX. These appear to be broken uploads. Cannot recover from CKAN.

2022 T3 contracte missing September

The T3 file (raport-datagov-contracte-t3-2022.xls) only covers Jul-Aug. September contracts are missing.

Date format ambiguity in 2019 XLS

Dates in 2019 XLS files appear to use DD/MM/YYYY rather than the SEAP-standard MM/DD/YYYY. The MM/DD parser in import-seap-historical.py discards rows where day > 12, partially preserving the data. Consider re-parsing with format detection if pristine 2019 dates are needed.

What was skipped this session

Dataset Reason Estimated row count
Achizitii directe (cumparari directe) all years Per task spec — 8M+ row dataset, deferred ~8,000,000
2020, 2021 Per task spec — ministry-only datasets, no CKAN dump n/a
2017/2018 contracte-subsecvente Lower priority, can ingest in next session ~10,000
2017/2018 invitatii-participare Low value (intent, not award) ~5,000
2018 T2-T4 cumparari-directe XLSX Skipped per spec ~3,000,000

Current ingest state (post-backfill)

Year Rows Total RON (bln)
2017 31,271 (contracte 20,478 + initiere 10,793) 33.20
2018 17,883 (contracte 15,711 + initiere 2,172) 23.80
2019 16,570 contracte (T1+T2dup+T4) 36.95
2022 24,677 contracte 89.99
2023 47,003 (contracte 25,793 + initiere 15,520 + atribuire-fara 5,684) 187.13
2024 (PoC) 750 contracte 7.33
Total 138,148 378.41 bln RON

Total seap.announcements table: 781,029 rows.

Next-session work

  1. 2020 + 2021 gap — TED supplement (https://ted.europa.eu) covers EU-threshold awards for these years. National-only awards likely lost.
  2. Achizitii directe — 8M rows, separate session: own ingest path with type='da'.
  3. 2019 Q2 — scrape SEAP-WSP backwards or pull from individual seapcerere archives.
  4. 2018 anunturi-initiere T2-T4 — broken on CKAN; ANAP RFE or SEAP-WSP scrape.
  5. CPV name lookup — cpv_code populated for 2017+; cpv_name needs join via seap.cpv_codes view.