# AFIR Historical Backfill — Plan & Status

## Current state (2026-05-09)

| source_year | rows    | distinct beneficiars | sum UE (EUR)    | fund  |
|-------------|---------|----------------------|-----------------|-------|
| **2023**    | 474,720 | 320,230              | 1,411,870,796   | FEADR |
| **2024**    | 563,310 | 316,304              | 1,373,722,134   | FEADR |
| **Total**   | 1,037,930 | —                  | ~2.79 mld EUR   | FEADR |

Schema: `fonduri.afir_plati` (migration `017_fonduri_afir.sql`).
Importer: `cron/import-afir-historical.sh` + `scripts/import-afir-historical.py`.

## Source survey

### AFIR official portal — `https://www.afir.ro/rapoarte/beneficiari-de-fonduri-europene/`

Two complementary pages:

1. **`/date-deschise/`** — only the most recent two years are linked.
   - Currently exposes 2023 + 2024 for **FEADR (xlsx)** and 2023 + 2024 for **FEGA (rar)**.
2. **`/beneficiari-fega-si-feadr/`** — ASP.NET portal at
   `https://plati.afir.info/Plati/AfisareListaPlatii`. Year selector
   currently exposes **only 2023 and 2024**. 3.7M total records in the live
   query interface but no programmatic XLSX dump older than 2023.

### data.gov.ro CKAN — searched `q=afir`, `q=fega`, `q=apia`, `q=feadr`

Findings (relevant package IDs only):

| Dataset | URL | Notes |
|---|---|---|
| `Date privind proiectele PNDR` (`a2884dcf-…`) | `proiectepndr2020.csv` (2014-2020), `proiectepndr2013.csv` (2007-2013) | **Project-level, not payment-level.** Useful for joining contracts/projects but does not replace plati. Worth ingesting separately. |
| `Contracte AFIR` (`8845aa0d-…`) | `contracte-achizitii-publice-peste-5000-euro-2000.xlsx`, `centralizator-…2021_2022.xlsx` | Procurement contracts >5K EUR run by AFIR itself; not beneficiary payments. Different schema. |
| `Lista Fermierilor Campania APIA 2024` (`39e5465d-…`) | `lista-fermieri-apia-2024.xlsx` | One-off small dataset; APIA campaign list. |
| `Parcele Agricole APIA LPIS 2025` etc. | shapefiles (.zip) | Geographic parcels, not payments. Useful later for map overlays. |

**Conclusion**: data.gov.ro does **not** have `listaplati_2020/2021/2022_*` payment dumps. They exist nowhere public.

### opendata.afir.info

A separate CKAN-style portal (`http://opendata.afir.info/`) lists `ProiectePNDR2020` (53K views), `ProiectePS2027`, `AchizitiiPrivate2020`. The page itself doesn't expose direct download URLs without account login. **Worth investigating in next session** — it may contain the 2020-2022 payment data behind an export interface.

## Importer architecture

### Pipeline (FEADR XLSX)

```
AFIR XLSX  ──curl──▶  satra:/tmp/afir-historical-{YEAR}-{FUND}/
            │
            ▼
       openpyxl read_only  (skips 9 banner rows)
            │
            ▼
       pipe-delimited TSV  (RO decimals "12.345,67" → "12345.67")
            │
            ▼
       \\copy → fonduri.staging_afir
            │
            ▼
       DELETE FROM afir_plati WHERE source_year=YEAR  (idempotent)
            │
            ▼
       INSERT INTO afir_plati (source_year=YEAR, NULLIF + ::numeric casts)
```

### Why pipe delimiter

Beneficiar names contain commas (`"FULOP ZOLTAN, GERGELY"`), Obiectiv contains
both `,` and quote chars. Pipe is safer than comma + quoting and the loader
already replaces any literal `|` in source text with `/` before serialization.

### Idempotency

`DELETE WHERE source_year = N` runs only on full ingests (not when
`LIMIT` is set for smoke tests). Re-running for the same year is safe and
produces consistent counts.

### Smoke test mode

```
./import-afir-historical.sh URL YEAR feadr 1000
```

The 4th arg (LIMIT) skips the DELETE step and truncates the TSV to N rows
before COPY, so you can validate end-to-end without trampling production
data.

## Next-session work

### 1. FEGA ingest (HIGHEST IMPACT, 30-60 min)

**Volume**: 2,476,897 rows in 2023 alone, ~580 MB CSV inside 23 MB RAR.
**Source URLs**:
- 2023: `https://www.afir.ro/media/sxcnuvwc/listaplati_2023_fega_corectat.rar`
- 2024: `https://www.afir.ro/media/dqjddti2/lista-plati-beneificiari-fega-2024.rar`

**Schema differences vs FEADR XLSX** (column-by-column):

| FEADR XLSX (RO header) | FEGA CSV (concat header) | Notes |
|---|---|---|
| Numele beneficiarului | `DenumireBeneficiar` | same |
| Numele de familie | `NumeFamilie` | same |
| Denumirea societatii-mama si codul de inregistrare fiscala | `Cui` | **FEGA CSV exposes a real CUI column** (mostly empty for natural persons, populated for SRL/PFA — bonus enrichment vs FEADR XLSX) |
| Localitate | `Localicate` *(typo in source)* | same content |
| Codul masurii/tipului de interventie | `Masura` | same; FEGA codes look like `MICA` / scheme acronyms instead of `M 06` etc |
| Obiectiv | `ObiectivSpecific` | longer descriptions |
| Data inceperii / Data incheierii | `DataIncepere` / `DataSfarsit` | usually empty |
| Cuantum {Operatiune,Total} {FEGA,FEADR} | same 4 columns | **decimals already in `.` format** (English-locale, no comma swap needed) |
| Cuantum aferent operatiunii | `CuantumAferentOperatiune` | same |
| Cuantum total cofinantare beneficiari | `CuantumTotalCofinantareBeneficiar` | same |
| Cuantum total UE Beneficiar | `CuantumtotalUEBenefeciar` *(typo in source)* | same |

**Implementation choices**:

Option A — **augment afir_plati with `tip_fond` discriminator**.
Add `ALTER TABLE fonduri.afir_plati ADD COLUMN tip_fond text CHECK (tip_fond IN ('FEADR','FEGA'));`
Re-tag existing rows as `'FEADR'`. Importer writes both. Uniform downstream query.

Option B — **separate table `fonduri.fega_plati`**.
Different cardinality (5x rows), different measure code namespace; some
queries naturally separate. But duplicates the index/MV maintenance burden.

**Recommendation: Option A**. The schema is identical, the differences are
namespace-of-codes only. A single discriminator keeps things simple, fits
the existing `gin_trgm` name index, and lets the recipe code do
`WHERE tip_fond='FEGA'` cheaply (b-tree on tip_fond if needed).

**FEGA importer changes vs current FEADR script**:
1. Download → `unrar x` (already installed on satra now: `apt install unrar` was run).
2. New python normalizer `import-afir-historical-fega.py` — reads CSV not XLSX; column-name remapping; *no* RO-decimal swap.
3. Pass new `FUND=fega` flag → script writes `tip_fond='FEGA'` and uses CSV path.
4. **Cui column passthrough** — write directly into the existing `cui` column
   when non-empty, with `cui_match_method='afir_self_reported'` and
   `cui_match_score=1.0`. Skip fuzzy matcher for these.

**Volume budget**: 2.48M rows × 2 years = ~5M rows. Same staging table
works (TRUNCATE between runs). Postgres COPY @ ~100K rows/s → ~25s/year
for COPY, plus ~60s for INSERT. Total ~5 min per year.

### 2. Historical FEADR 2020/2021/2022 (BLOCKED on source)

Status: **not publicly available.**

Investigation outcome:
- AFIR `/date-deschise/` page shows only 2023+2024.
- `plati.afir.info` portal shows only 2023+2024.
- data.gov.ro CKAN has no `listaplati_<year>` resources.

**Options to unblock** (in order of cost):

1. **Email AFIR direct** — `comunicare@afir.info` and request the historical
   payment lists 2020-2022 under Law 544/2001 (FOIA equivalent). They are
   legally obligated to provide. Expected: 2-4 week response.
2. **Wayback Machine archive** — check
   `https://web.archive.org/web/2023*/afir.ro/rapoarte/beneficiari-de-fonduri-europene/date-deschise/`
   for snapshots that still link to old XLSX files. URLs may still resolve
   (AFIR media folder is content-addressed: `/media/<hash>/file.xlsx`).
3. **opendata.afir.info account** — the dataset titles `AchizitiiPrivate2020`,
   `ProiectePNDR2020` suggest historical exports may live here, but the
   download interface needs login. Apply for an open-data access account.

**Estimated row counts when obtained**: ~450K-500K per year (extrapolating
from 2023 = 475K and 2024 = 563K).

### 3. APIA-specific datasets (LOWER PRIORITY)

`Lista Fermierilor Campania APIA 2024` (small file, ~50K rows expected).
This is a *subset* of FEGA payments (only certain campaigns), so once FEGA
2024 is ingested, this dataset is partially redundant. Worth ingesting
into a separate `fonduri.apia_fermieri` table only if it carries the
geographic columns (parcel codes) the FEGA dump lacks.

Geographic LPIS shapefiles (`Parcele Agricole APIA LPIS 2025`,
`Categorii de Folosință`) are **map data**, not payment data — defer to
when we add map overlays to /achizitii/firma/[cui] profile pages.

## Files modified/added in this session

- **NEW** `services/seap-scraper/scripts/import-afir-historical.py` — XLSX→TSV normalizer
- **NEW** `services/seap-scraper/cron/import-afir-historical.sh` — orchestrator
- **NEW** `services/seap-scraper/AFIR-HISTORICAL-PLAN.md` (this file)

`fonduri.afir_plati` schema unchanged — no migration. The DELETE+INSERT
flow uses the existing table as-is. Adding `tip_fond` discriminator is
a follow-up migration when FEGA ingest is implemented.