initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
This commit is contained in:
@@ -0,0 +1,81 @@
|
||||
# SEAP Historical Backfill — Notes & Caveats
|
||||
|
||||
Backfill ingest of data.gov.ro yearly CKAN dumps into `seap.announcements`.
|
||||
This file documents schema variants per year, known data quality issues,
|
||||
and what was deliberately skipped.
|
||||
|
||||
## Pipeline
|
||||
|
||||
- `scripts/import-seap-historical.py` — CSV normalizer (any of `,` `|` `^` `;` delim, `"` or `|` quote)
|
||||
- `scripts/import-seap-historical.sh` — CSV download + ingest wrapper
|
||||
- `scripts/xlsx-to-csv.py` — XLSX (openpyxl) **and** XLS legacy (xlrd 1.2) → CSV; multi-sheet aware (XLS 65k row limit)
|
||||
- `scripts/import-seap-xlsx.sh` — full XLS/XLSX → CSV → ingest pipeline
|
||||
|
||||
## Schema variants observed
|
||||
|
||||
| Year | Format | Delim | Quote | Header style |
|
||||
|------|--------|-------|-------|--------------|
|
||||
| 2017 | CSV | `^` | none | `CamelCase` (`Castigator`, `AutoritateContractanta`) |
|
||||
| 2018 T1 | CSV | `^` | none | `CamelCase` |
|
||||
| 2018 T2-T4 | XLS | n/a | n/a | `UPPER_SNAKE_CASE` (`CASTIGATOR_CUI`, `CASTIGAOR_LOCALITATE` ← typo) |
|
||||
| 2019 | XLS | n/a | n/a | `UPPER_SNAKE_CASE` (same as 2018 T2-T4) |
|
||||
| 2022 T1 | CSV | `,` | `\|` | `UPPER_SNAKE_CASE` (e.g. header line starts `\|DENUMIRE_AC\|,\|CUI_AC\|`) |
|
||||
| 2022 T2-T4 | XLS | n/a | n/a | `UPPER_SNAKE_CASE` |
|
||||
| 2023 T1-T2 | XLS | n/a | n/a | `Title Case` with title row as row 1, real header on row 2 |
|
||||
| 2023 T3 | CSV | `\|` | `"` | `UPPER_SNAKE_CASE` (with `TIP_LESIGLATIE` typo) |
|
||||
| 2023 T4 | CSV | `,` | `"` | `Title Case` |
|
||||
| 2024 | CSV | `,` | `"` | `Title Case` (standard) |
|
||||
|
||||
Header dedupe: the normalizer uses `(type, ref_number)` as primary key with first-row-wins; per-lot rows in the same announcement collapse to a single row.
|
||||
|
||||
## Known data quality issues
|
||||
|
||||
### 2019 T2 ≡ T3 (data.gov.ro upload error)
|
||||
|
||||
Files `raport-t-2-2019.xls` and `raport-t-3-2019.xls` are byte-identical and contain an unspecified date range mixing months across 2019. The `T2` source label was loaded first (5,673 rows); the `T3` import showed all-conflicts on the unique constraint. **Real Q2 2019 data (Apr-Jun) is missing from the dump.**
|
||||
|
||||
Workaround: use TED supplement (Jan-Aug 2018 onwards is in TED) or scrape SEAP directly for the missing quarter.
|
||||
|
||||
### 2019 anunturi-initiere XLSX files are 1-cell stubs
|
||||
|
||||
All `anunturiinitiere2019tX.xlsx` files on data.gov.ro contain only the header `TIP_ANUNT` with no data rows. Same applies to **2018 T2-T4 anunturi-initiere XLSX** and **2019 achizitii-directe XLSX**. These appear to be broken uploads. Cannot recover from CKAN.
|
||||
|
||||
### 2022 T3 contracte missing September
|
||||
|
||||
The T3 file (`raport-datagov-contracte-t3-2022.xls`) only covers Jul-Aug. September contracts are missing.
|
||||
|
||||
### Date format ambiguity in 2019 XLS
|
||||
|
||||
Dates in 2019 XLS files appear to use `DD/MM/YYYY` rather than the SEAP-standard `MM/DD/YYYY`. The MM/DD parser in `import-seap-historical.py` discards rows where day > 12, partially preserving the data. Consider re-parsing with format detection if pristine 2019 dates are needed.
|
||||
|
||||
## What was skipped this session
|
||||
|
||||
| Dataset | Reason | Estimated row count |
|
||||
|---------|--------|---------------------|
|
||||
| Achizitii directe (cumparari directe) all years | Per task spec — 8M+ row dataset, deferred | ~8,000,000 |
|
||||
| 2020, 2021 | Per task spec — ministry-only datasets, no CKAN dump | n/a |
|
||||
| 2017/2018 contracte-subsecvente | Lower priority, can ingest in next session | ~10,000 |
|
||||
| 2017/2018 invitatii-participare | Low value (intent, not award) | ~5,000 |
|
||||
| 2018 T2-T4 cumparari-directe XLSX | Skipped per spec | ~3,000,000 |
|
||||
|
||||
## Current ingest state (post-backfill)
|
||||
|
||||
| Year | Rows | Total RON (bln) |
|
||||
|------|------|-----------------|
|
||||
| 2017 | 31,271 (contracte 20,478 + initiere 10,793) | 33.20 |
|
||||
| 2018 | 17,883 (contracte 15,711 + initiere 2,172) | 23.80 |
|
||||
| 2019 | 16,570 contracte (T1+T2dup+T4) | 36.95 |
|
||||
| 2022 | 24,677 contracte | 89.99 |
|
||||
| 2023 | 47,003 (contracte 25,793 + initiere 15,520 + atribuire-fara 5,684) | 187.13 |
|
||||
| 2024 (PoC) | 750 contracte | 7.33 |
|
||||
| **Total** | **138,148** | **378.41 bln RON** |
|
||||
|
||||
Total `seap.announcements` table: 781,029 rows.
|
||||
|
||||
## Next-session work
|
||||
|
||||
1. **2020 + 2021 gap** — TED supplement (`https://ted.europa.eu`) covers EU-threshold awards for these years. National-only awards likely lost.
|
||||
2. **Achizitii directe** — 8M rows, separate session: own ingest path with `type='da'`.
|
||||
3. **2019 Q2** — scrape SEAP-WSP backwards or pull from individual `seapcerere` archives.
|
||||
4. **2018 anunturi-initiere T2-T4** — broken on CKAN; ANAP RFE or SEAP-WSP scrape.
|
||||
5. **CPV name lookup** — cpv_code populated for 2017+; cpv_name needs join via `seap.cpv_codes` view.
|
||||
Reference in New Issue
Block a user