# AFIR Historical Backfill — Plan & Status ## Current state (2026-05-09) | source_year | rows | distinct beneficiars | sum UE (EUR) | fund | |-------------|---------|----------------------|-----------------|-------| | **2023** | 474,720 | 320,230 | 1,411,870,796 | FEADR | | **2024** | 563,310 | 316,304 | 1,373,722,134 | FEADR | | **Total** | 1,037,930 | — | ~2.79 mld EUR | FEADR | Schema: `fonduri.afir_plati` (migration `017_fonduri_afir.sql`). Importer: `cron/import-afir-historical.sh` + `scripts/import-afir-historical.py`. ## Source survey ### AFIR official portal — `https://www.afir.ro/rapoarte/beneficiari-de-fonduri-europene/` Two complementary pages: 1. **`/date-deschise/`** — only the most recent two years are linked. - Currently exposes 2023 + 2024 for **FEADR (xlsx)** and 2023 + 2024 for **FEGA (rar)**. 2. **`/beneficiari-fega-si-feadr/`** — ASP.NET portal at `https://plati.afir.info/Plati/AfisareListaPlatii`. Year selector currently exposes **only 2023 and 2024**. 3.7M total records in the live query interface but no programmatic XLSX dump older than 2023. ### data.gov.ro CKAN — searched `q=afir`, `q=fega`, `q=apia`, `q=feadr` Findings (relevant package IDs only): | Dataset | URL | Notes | |---|---|---| | `Date privind proiectele PNDR` (`a2884dcf-…`) | `proiectepndr2020.csv` (2014-2020), `proiectepndr2013.csv` (2007-2013) | **Project-level, not payment-level.** Useful for joining contracts/projects but does not replace plati. Worth ingesting separately. | | `Contracte AFIR` (`8845aa0d-…`) | `contracte-achizitii-publice-peste-5000-euro-2000.xlsx`, `centralizator-…2021_2022.xlsx` | Procurement contracts >5K EUR run by AFIR itself; not beneficiary payments. Different schema. | | `Lista Fermierilor Campania APIA 2024` (`39e5465d-…`) | `lista-fermieri-apia-2024.xlsx` | One-off small dataset; APIA campaign list. | | `Parcele Agricole APIA LPIS 2025` etc. | shapefiles (.zip) | Geographic parcels, not payments. Useful later for map overlays. | **Conclusion**: data.gov.ro does **not** have `listaplati_2020/2021/2022_*` payment dumps. They exist nowhere public. ### opendata.afir.info A separate CKAN-style portal (`http://opendata.afir.info/`) lists `ProiectePNDR2020` (53K views), `ProiectePS2027`, `AchizitiiPrivate2020`. The page itself doesn't expose direct download URLs without account login. **Worth investigating in next session** — it may contain the 2020-2022 payment data behind an export interface. ## Importer architecture ### Pipeline (FEADR XLSX) ``` AFIR XLSX ──curl──▶ satra:/tmp/afir-historical-{YEAR}-{FUND}/ │ ▼ openpyxl read_only (skips 9 banner rows) │ ▼ pipe-delimited TSV (RO decimals "12.345,67" → "12345.67") │ ▼ \\copy → fonduri.staging_afir │ ▼ DELETE FROM afir_plati WHERE source_year=YEAR (idempotent) │ ▼ INSERT INTO afir_plati (source_year=YEAR, NULLIF + ::numeric casts) ``` ### Why pipe delimiter Beneficiar names contain commas (`"FULOP ZOLTAN, GERGELY"`), Obiectiv contains both `,` and quote chars. Pipe is safer than comma + quoting and the loader already replaces any literal `|` in source text with `/` before serialization. ### Idempotency `DELETE WHERE source_year = N` runs only on full ingests (not when `LIMIT` is set for smoke tests). Re-running for the same year is safe and produces consistent counts. ### Smoke test mode ``` ./import-afir-historical.sh URL YEAR feadr 1000 ``` The 4th arg (LIMIT) skips the DELETE step and truncates the TSV to N rows before COPY, so you can validate end-to-end without trampling production data. ## Next-session work ### 1. FEGA ingest (HIGHEST IMPACT, 30-60 min) **Volume**: 2,476,897 rows in 2023 alone, ~580 MB CSV inside 23 MB RAR. **Source URLs**: - 2023: `https://www.afir.ro/media/sxcnuvwc/listaplati_2023_fega_corectat.rar` - 2024: `https://www.afir.ro/media/dqjddti2/lista-plati-beneificiari-fega-2024.rar` **Schema differences vs FEADR XLSX** (column-by-column): | FEADR XLSX (RO header) | FEGA CSV (concat header) | Notes | |---|---|---| | Numele beneficiarului | `DenumireBeneficiar` | same | | Numele de familie | `NumeFamilie` | same | | Denumirea societatii-mama si codul de inregistrare fiscala | `Cui` | **FEGA CSV exposes a real CUI column** (mostly empty for natural persons, populated for SRL/PFA — bonus enrichment vs FEADR XLSX) | | Localitate | `Localicate` *(typo in source)* | same content | | Codul masurii/tipului de interventie | `Masura` | same; FEGA codes look like `MICA` / scheme acronyms instead of `M 06` etc | | Obiectiv | `ObiectivSpecific` | longer descriptions | | Data inceperii / Data incheierii | `DataIncepere` / `DataSfarsit` | usually empty | | Cuantum {Operatiune,Total} {FEGA,FEADR} | same 4 columns | **decimals already in `.` format** (English-locale, no comma swap needed) | | Cuantum aferent operatiunii | `CuantumAferentOperatiune` | same | | Cuantum total cofinantare beneficiari | `CuantumTotalCofinantareBeneficiar` | same | | Cuantum total UE Beneficiar | `CuantumtotalUEBenefeciar` *(typo in source)* | same | **Implementation choices**: Option A — **augment afir_plati with `tip_fond` discriminator**. Add `ALTER TABLE fonduri.afir_plati ADD COLUMN tip_fond text CHECK (tip_fond IN ('FEADR','FEGA'));` Re-tag existing rows as `'FEADR'`. Importer writes both. Uniform downstream query. Option B — **separate table `fonduri.fega_plati`**. Different cardinality (5x rows), different measure code namespace; some queries naturally separate. But duplicates the index/MV maintenance burden. **Recommendation: Option A**. The schema is identical, the differences are namespace-of-codes only. A single discriminator keeps things simple, fits the existing `gin_trgm` name index, and lets the recipe code do `WHERE tip_fond='FEGA'` cheaply (b-tree on tip_fond if needed). **FEGA importer changes vs current FEADR script**: 1. Download → `unrar x` (already installed on satra now: `apt install unrar` was run). 2. New python normalizer `import-afir-historical-fega.py` — reads CSV not XLSX; column-name remapping; *no* RO-decimal swap. 3. Pass new `FUND=fega` flag → script writes `tip_fond='FEGA'` and uses CSV path. 4. **Cui column passthrough** — write directly into the existing `cui` column when non-empty, with `cui_match_method='afir_self_reported'` and `cui_match_score=1.0`. Skip fuzzy matcher for these. **Volume budget**: 2.48M rows × 2 years = ~5M rows. Same staging table works (TRUNCATE between runs). Postgres COPY @ ~100K rows/s → ~25s/year for COPY, plus ~60s for INSERT. Total ~5 min per year. ### 2. Historical FEADR 2020/2021/2022 (BLOCKED on source) Status: **not publicly available.** Investigation outcome: - AFIR `/date-deschise/` page shows only 2023+2024. - `plati.afir.info` portal shows only 2023+2024. - data.gov.ro CKAN has no `listaplati_` resources. **Options to unblock** (in order of cost): 1. **Email AFIR direct** — `comunicare@afir.info` and request the historical payment lists 2020-2022 under Law 544/2001 (FOIA equivalent). They are legally obligated to provide. Expected: 2-4 week response. 2. **Wayback Machine archive** — check `https://web.archive.org/web/2023*/afir.ro/rapoarte/beneficiari-de-fonduri-europene/date-deschise/` for snapshots that still link to old XLSX files. URLs may still resolve (AFIR media folder is content-addressed: `/media//file.xlsx`). 3. **opendata.afir.info account** — the dataset titles `AchizitiiPrivate2020`, `ProiectePNDR2020` suggest historical exports may live here, but the download interface needs login. Apply for an open-data access account. **Estimated row counts when obtained**: ~450K-500K per year (extrapolating from 2023 = 475K and 2024 = 563K). ### 3. APIA-specific datasets (LOWER PRIORITY) `Lista Fermierilor Campania APIA 2024` (small file, ~50K rows expected). This is a *subset* of FEGA payments (only certain campaigns), so once FEGA 2024 is ingested, this dataset is partially redundant. Worth ingesting into a separate `fonduri.apia_fermieri` table only if it carries the geographic columns (parcel codes) the FEGA dump lacks. Geographic LPIS shapefiles (`Parcele Agricole APIA LPIS 2025`, `Categorii de Folosință`) are **map data**, not payment data — defer to when we add map overlays to /achizitii/firma/[cui] profile pages. ## Files modified/added in this session - **NEW** `services/seap-scraper/scripts/import-afir-historical.py` — XLSX→TSV normalizer - **NEW** `services/seap-scraper/cron/import-afir-historical.sh` — orchestrator - **NEW** `services/seap-scraper/AFIR-HISTORICAL-PLAN.md` (this file) `fonduri.afir_plati` schema unchanged — no migration. The DELETE+INSERT flow uses the existing table as-is. Adding `tip_fond` discriminator is a follow-up migration when FEGA ingest is implemented.