Files
vreau-digital/services/seap-scraper/SEAP-HISTORICAL-NOTES.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

82 lines
4.5 KiB
Markdown

# SEAP Historical Backfill — Notes & Caveats
Backfill ingest of data.gov.ro yearly CKAN dumps into `seap.announcements`.
This file documents schema variants per year, known data quality issues,
and what was deliberately skipped.
## Pipeline
- `scripts/import-seap-historical.py` — CSV normalizer (any of `,` `|` `^` `;` delim, `"` or `|` quote)
- `scripts/import-seap-historical.sh` — CSV download + ingest wrapper
- `scripts/xlsx-to-csv.py` — XLSX (openpyxl) **and** XLS legacy (xlrd 1.2) → CSV; multi-sheet aware (XLS 65k row limit)
- `scripts/import-seap-xlsx.sh` — full XLS/XLSX → CSV → ingest pipeline
## Schema variants observed
| Year | Format | Delim | Quote | Header style |
|------|--------|-------|-------|--------------|
| 2017 | CSV | `^` | none | `CamelCase` (`Castigator`, `AutoritateContractanta`) |
| 2018 T1 | CSV | `^` | none | `CamelCase` |
| 2018 T2-T4 | XLS | n/a | n/a | `UPPER_SNAKE_CASE` (`CASTIGATOR_CUI`, `CASTIGAOR_LOCALITATE` ← typo) |
| 2019 | XLS | n/a | n/a | `UPPER_SNAKE_CASE` (same as 2018 T2-T4) |
| 2022 T1 | CSV | `,` | `\|` | `UPPER_SNAKE_CASE` (e.g. header line starts `\|DENUMIRE_AC\|,\|CUI_AC\|`) |
| 2022 T2-T4 | XLS | n/a | n/a | `UPPER_SNAKE_CASE` |
| 2023 T1-T2 | XLS | n/a | n/a | `Title Case` with title row as row 1, real header on row 2 |
| 2023 T3 | CSV | `\|` | `"` | `UPPER_SNAKE_CASE` (with `TIP_LESIGLATIE` typo) |
| 2023 T4 | CSV | `,` | `"` | `Title Case` |
| 2024 | CSV | `,` | `"` | `Title Case` (standard) |
Header dedupe: the normalizer uses `(type, ref_number)` as primary key with first-row-wins; per-lot rows in the same announcement collapse to a single row.
## Known data quality issues
### 2019 T2 ≡ T3 (data.gov.ro upload error)
Files `raport-t-2-2019.xls` and `raport-t-3-2019.xls` are byte-identical and contain an unspecified date range mixing months across 2019. The `T2` source label was loaded first (5,673 rows); the `T3` import showed all-conflicts on the unique constraint. **Real Q2 2019 data (Apr-Jun) is missing from the dump.**
Workaround: use TED supplement (Jan-Aug 2018 onwards is in TED) or scrape SEAP directly for the missing quarter.
### 2019 anunturi-initiere XLSX files are 1-cell stubs
All `anunturiinitiere2019tX.xlsx` files on data.gov.ro contain only the header `TIP_ANUNT` with no data rows. Same applies to **2018 T2-T4 anunturi-initiere XLSX** and **2019 achizitii-directe XLSX**. These appear to be broken uploads. Cannot recover from CKAN.
### 2022 T3 contracte missing September
The T3 file (`raport-datagov-contracte-t3-2022.xls`) only covers Jul-Aug. September contracts are missing.
### Date format ambiguity in 2019 XLS
Dates in 2019 XLS files appear to use `DD/MM/YYYY` rather than the SEAP-standard `MM/DD/YYYY`. The MM/DD parser in `import-seap-historical.py` discards rows where day > 12, partially preserving the data. Consider re-parsing with format detection if pristine 2019 dates are needed.
## What was skipped this session
| Dataset | Reason | Estimated row count |
|---------|--------|---------------------|
| Achizitii directe (cumparari directe) all years | Per task spec — 8M+ row dataset, deferred | ~8,000,000 |
| 2020, 2021 | Per task spec — ministry-only datasets, no CKAN dump | n/a |
| 2017/2018 contracte-subsecvente | Lower priority, can ingest in next session | ~10,000 |
| 2017/2018 invitatii-participare | Low value (intent, not award) | ~5,000 |
| 2018 T2-T4 cumparari-directe XLSX | Skipped per spec | ~3,000,000 |
## Current ingest state (post-backfill)
| Year | Rows | Total RON (bln) |
|------|------|-----------------|
| 2017 | 31,271 (contracte 20,478 + initiere 10,793) | 33.20 |
| 2018 | 17,883 (contracte 15,711 + initiere 2,172) | 23.80 |
| 2019 | 16,570 contracte (T1+T2dup+T4) | 36.95 |
| 2022 | 24,677 contracte | 89.99 |
| 2023 | 47,003 (contracte 25,793 + initiere 15,520 + atribuire-fara 5,684) | 187.13 |
| 2024 (PoC) | 750 contracte | 7.33 |
| **Total** | **138,148** | **378.41 bln RON** |
Total `seap.announcements` table: 781,029 rows.
## Next-session work
1. **2020 + 2021 gap** — TED supplement (`https://ted.europa.eu`) covers EU-threshold awards for these years. National-only awards likely lost.
2. **Achizitii directe** — 8M rows, separate session: own ingest path with `type='da'`.
3. **2019 Q2** — scrape SEAP-WSP backwards or pull from individual `seapcerere` archives.
4. **2018 anunturi-initiere T2-T4** — broken on CKAN; ANAP RFE or SEAP-WSP scrape.
5. **CPV name lookup** — cpv_code populated for 2017+; cpv_name needs join via `seap.cpv_codes` view.