Files
vreau-digital/services/seap-scraper/AFIR-HISTORICAL-PLAN.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

187 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AFIR Historical Backfill — Plan & Status
## Current state (2026-05-09)
| source_year | rows | distinct beneficiars | sum UE (EUR) | fund |
|-------------|---------|----------------------|-----------------|-------|
| **2023** | 474,720 | 320,230 | 1,411,870,796 | FEADR |
| **2024** | 563,310 | 316,304 | 1,373,722,134 | FEADR |
| **Total** | 1,037,930 | — | ~2.79 mld EUR | FEADR |
Schema: `fonduri.afir_plati` (migration `017_fonduri_afir.sql`).
Importer: `cron/import-afir-historical.sh` + `scripts/import-afir-historical.py`.
## Source survey
### AFIR official portal — `https://www.afir.ro/rapoarte/beneficiari-de-fonduri-europene/`
Two complementary pages:
1. **`/date-deschise/`** — only the most recent two years are linked.
- Currently exposes 2023 + 2024 for **FEADR (xlsx)** and 2023 + 2024 for **FEGA (rar)**.
2. **`/beneficiari-fega-si-feadr/`** — ASP.NET portal at
`https://plati.afir.info/Plati/AfisareListaPlatii`. Year selector
currently exposes **only 2023 and 2024**. 3.7M total records in the live
query interface but no programmatic XLSX dump older than 2023.
### data.gov.ro CKAN — searched `q=afir`, `q=fega`, `q=apia`, `q=feadr`
Findings (relevant package IDs only):
| Dataset | URL | Notes |
|---|---|---|
| `Date privind proiectele PNDR` (`a2884dcf-…`) | `proiectepndr2020.csv` (2014-2020), `proiectepndr2013.csv` (2007-2013) | **Project-level, not payment-level.** Useful for joining contracts/projects but does not replace plati. Worth ingesting separately. |
| `Contracte AFIR` (`8845aa0d-…`) | `contracte-achizitii-publice-peste-5000-euro-2000.xlsx`, `centralizator-…2021_2022.xlsx` | Procurement contracts >5K EUR run by AFIR itself; not beneficiary payments. Different schema. |
| `Lista Fermierilor Campania APIA 2024` (`39e5465d-…`) | `lista-fermieri-apia-2024.xlsx` | One-off small dataset; APIA campaign list. |
| `Parcele Agricole APIA LPIS 2025` etc. | shapefiles (.zip) | Geographic parcels, not payments. Useful later for map overlays. |
**Conclusion**: data.gov.ro does **not** have `listaplati_2020/2021/2022_*` payment dumps. They exist nowhere public.
### opendata.afir.info
A separate CKAN-style portal (`http://opendata.afir.info/`) lists `ProiectePNDR2020` (53K views), `ProiectePS2027`, `AchizitiiPrivate2020`. The page itself doesn't expose direct download URLs without account login. **Worth investigating in next session** — it may contain the 2020-2022 payment data behind an export interface.
## Importer architecture
### Pipeline (FEADR XLSX)
```
AFIR XLSX ──curl──▶ satra:/tmp/afir-historical-{YEAR}-{FUND}/
openpyxl read_only (skips 9 banner rows)
pipe-delimited TSV (RO decimals "12.345,67" → "12345.67")
\\copy → fonduri.staging_afir
DELETE FROM afir_plati WHERE source_year=YEAR (idempotent)
INSERT INTO afir_plati (source_year=YEAR, NULLIF + ::numeric casts)
```
### Why pipe delimiter
Beneficiar names contain commas (`"FULOP ZOLTAN, GERGELY"`), Obiectiv contains
both `,` and quote chars. Pipe is safer than comma + quoting and the loader
already replaces any literal `|` in source text with `/` before serialization.
### Idempotency
`DELETE WHERE source_year = N` runs only on full ingests (not when
`LIMIT` is set for smoke tests). Re-running for the same year is safe and
produces consistent counts.
### Smoke test mode
```
./import-afir-historical.sh URL YEAR feadr 1000
```
The 4th arg (LIMIT) skips the DELETE step and truncates the TSV to N rows
before COPY, so you can validate end-to-end without trampling production
data.
## Next-session work
### 1. FEGA ingest (HIGHEST IMPACT, 30-60 min)
**Volume**: 2,476,897 rows in 2023 alone, ~580 MB CSV inside 23 MB RAR.
**Source URLs**:
- 2023: `https://www.afir.ro/media/sxcnuvwc/listaplati_2023_fega_corectat.rar`
- 2024: `https://www.afir.ro/media/dqjddti2/lista-plati-beneificiari-fega-2024.rar`
**Schema differences vs FEADR XLSX** (column-by-column):
| FEADR XLSX (RO header) | FEGA CSV (concat header) | Notes |
|---|---|---|
| Numele beneficiarului | `DenumireBeneficiar` | same |
| Numele de familie | `NumeFamilie` | same |
| Denumirea societatii-mama si codul de inregistrare fiscala | `Cui` | **FEGA CSV exposes a real CUI column** (mostly empty for natural persons, populated for SRL/PFA — bonus enrichment vs FEADR XLSX) |
| Localitate | `Localicate` *(typo in source)* | same content |
| Codul masurii/tipului de interventie | `Masura` | same; FEGA codes look like `MICA` / scheme acronyms instead of `M 06` etc |
| Obiectiv | `ObiectivSpecific` | longer descriptions |
| Data inceperii / Data incheierii | `DataIncepere` / `DataSfarsit` | usually empty |
| Cuantum {Operatiune,Total} {FEGA,FEADR} | same 4 columns | **decimals already in `.` format** (English-locale, no comma swap needed) |
| Cuantum aferent operatiunii | `CuantumAferentOperatiune` | same |
| Cuantum total cofinantare beneficiari | `CuantumTotalCofinantareBeneficiar` | same |
| Cuantum total UE Beneficiar | `CuantumtotalUEBenefeciar` *(typo in source)* | same |
**Implementation choices**:
Option A — **augment afir_plati with `tip_fond` discriminator**.
Add `ALTER TABLE fonduri.afir_plati ADD COLUMN tip_fond text CHECK (tip_fond IN ('FEADR','FEGA'));`
Re-tag existing rows as `'FEADR'`. Importer writes both. Uniform downstream query.
Option B — **separate table `fonduri.fega_plati`**.
Different cardinality (5x rows), different measure code namespace; some
queries naturally separate. But duplicates the index/MV maintenance burden.
**Recommendation: Option A**. The schema is identical, the differences are
namespace-of-codes only. A single discriminator keeps things simple, fits
the existing `gin_trgm` name index, and lets the recipe code do
`WHERE tip_fond='FEGA'` cheaply (b-tree on tip_fond if needed).
**FEGA importer changes vs current FEADR script**:
1. Download → `unrar x` (already installed on satra now: `apt install unrar` was run).
2. New python normalizer `import-afir-historical-fega.py` — reads CSV not XLSX; column-name remapping; *no* RO-decimal swap.
3. Pass new `FUND=fega` flag → script writes `tip_fond='FEGA'` and uses CSV path.
4. **Cui column passthrough** — write directly into the existing `cui` column
when non-empty, with `cui_match_method='afir_self_reported'` and
`cui_match_score=1.0`. Skip fuzzy matcher for these.
**Volume budget**: 2.48M rows × 2 years = ~5M rows. Same staging table
works (TRUNCATE between runs). Postgres COPY @ ~100K rows/s → ~25s/year
for COPY, plus ~60s for INSERT. Total ~5 min per year.
### 2. Historical FEADR 2020/2021/2022 (BLOCKED on source)
Status: **not publicly available.**
Investigation outcome:
- AFIR `/date-deschise/` page shows only 2023+2024.
- `plati.afir.info` portal shows only 2023+2024.
- data.gov.ro CKAN has no `listaplati_<year>` resources.
**Options to unblock** (in order of cost):
1. **Email AFIR direct**`comunicare@afir.info` and request the historical
payment lists 2020-2022 under Law 544/2001 (FOIA equivalent). They are
legally obligated to provide. Expected: 2-4 week response.
2. **Wayback Machine archive** — check
`https://web.archive.org/web/2023*/afir.ro/rapoarte/beneficiari-de-fonduri-europene/date-deschise/`
for snapshots that still link to old XLSX files. URLs may still resolve
(AFIR media folder is content-addressed: `/media/<hash>/file.xlsx`).
3. **opendata.afir.info account** — the dataset titles `AchizitiiPrivate2020`,
`ProiectePNDR2020` suggest historical exports may live here, but the
download interface needs login. Apply for an open-data access account.
**Estimated row counts when obtained**: ~450K-500K per year (extrapolating
from 2023 = 475K and 2024 = 563K).
### 3. APIA-specific datasets (LOWER PRIORITY)
`Lista Fermierilor Campania APIA 2024` (small file, ~50K rows expected).
This is a *subset* of FEGA payments (only certain campaigns), so once FEGA
2024 is ingested, this dataset is partially redundant. Worth ingesting
into a separate `fonduri.apia_fermieri` table only if it carries the
geographic columns (parcel codes) the FEGA dump lacks.
Geographic LPIS shapefiles (`Parcele Agricole APIA LPIS 2025`,
`Categorii de Folosință`) are **map data**, not payment data — defer to
when we add map overlays to /achizitii/firma/[cui] profile pages.
## Files modified/added in this session
- **NEW** `services/seap-scraper/scripts/import-afir-historical.py` — XLSX→TSV normalizer
- **NEW** `services/seap-scraper/cron/import-afir-historical.sh` — orchestrator
- **NEW** `services/seap-scraper/AFIR-HISTORICAL-PLAN.md` (this file)
`fonduri.afir_plati` schema unchanged — no migration. The DELETE+INSERT
flow uses the existing table as-is. Adding `tip_fond` discriminator is
a follow-up migration when FEGA ingest is implemented.