Files
vreau-digital/services/seap-scraper/CNAS-PLAN.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

218 lines
10 KiB
Markdown

# CNAS — Casa Națională de Asigurări de Sănătate — Ingest Plan
Lista furnizorilor de servicii medicale aflați în relație contractuală cu CAS-urile județene.
## v1 status (2026-05-10)
**Schema applied:** `services/seap-scraper/sql/031_cnas.sql` (3 tables + 1 MV)
**Scraper:** `services/seap-scraper/src/scrape-cnas.ts`
**Wrapper:** `services/seap-scraper/cron/scrape-cnas.sh`
**First-pass yield:** 36,183 rows / 12,392 distinct provider names from **46 PDFs successfully parsed** (61 furnizor PDFs registered, 14 with non-tabular layout).
### What v1 captures
The CNAS WordPress media library at `cnas.ro/wp-content/uploads/` exposes ~70-90 furnizor-related PDFs (CAS Bihor, CAS Bacău, CAS Gorj, CAS Arad upload most heavily; rest of counties don't use this central library). Discoverable via `cnas.ro/wp-json/wp/v2/media` REST API (no auth, no rate limit).
Working categories with >100 rows extracted:
- `medicina_dentara` — 361 rows from FURNIZORI-IN-CONTRACT-AMBULATORIU-DE-SPECIALITATE-MEDICINA-DENTARA-2024
- `medicina_familie` — 488 rows total (mostly CAS Bihor)
- `dispozitive_medicale` — 268 rows
- `farmacie` — 119 rows
- `ambulatoriu_clinic` — 99 rows
- `recuperare_medicala` — 61 rows
- 4,300+ rows each from 7 historical 2022 "Nr-furnizori-testare" PDFs (national snapshots, ~10K distinct lines)
### Investigation findings
The CNAS source ecosystem is **mid-migration** between 3 layers:
1. **NEW — `cas.cnas.ro/casXX`** (Angular SPA, 42 county sub-instances). Uses Blazor admin/api at `/admin/api/{home-content,menu-items,provider-map,pharmacy-report,dental-report,…}`. Routes via `X-Instance-Key` HTTP header. **As of 2026-05, all data endpoints return `[]` or 500 — the migration hasn't loaded provider lists yet.** Watch script (see Phase 2 below) recommended.
2. **CENTRAL — `cnas.ro/wp-content/uploads/`** (WordPress media library). 4,180 files total, ~70 furnizor PDFs. **THIS IS WHAT v1 INGESTS.** Updated weekly-ish.
3. **OLD — `www.cnas.ro/casXX/page/lista-furnizori-*.html`** (pre-migration WP). All 301-redirect to dead stubs on `cnas.ro/casXX/`. **Effectively removed.** Archived content recoverable via Wayback CDX (`web.archive.org/cdx/search/cdx?url=cas.cnas.ro/casXX&matchType=domain`).
## Phase 2 — Improve parser (effort: 2-3h)
Parser misses ~25% of files due to non-tabular layouts. Fixes needed:
### "no_table" failures (14 files)
These have valid data but unusual layouts:
| File | Issue | Approach |
|---|---|---|
| `Lista-furnizori-testare-genetica-2024-2025_all.pdf` (4 pages) | First column is "Casa de asigurări" (judet header), nr_crt is implicit | Per-page re-parse: detect judet headers (`BIHOR`, `CLUJ`), assign to all rows below until next header |
| `Lista-furnizori-tumori-solide-maligne-martie-2025.pdf` (1 page) | Same as above — judet-grouped | Same |
| `Lista-furnizori-radioterapie-2024.pdf` | Same | Same |
| `Lista-furnizori-testare-hematologie-maligna-2024.pdf` | Same | Same |
| `FURNIZORI-INGRIJIRI-PALIATIVE-INCEPAND-CU-01.07.2023-2.pdf` | Header row says "Bacau" — county is in *header*, not column. Plus row#1 leading on the right column | Detect "CAS \w+" or "JUDET" in header text; skip first 5 lines; rows start with bare number followed by `[A-Z]` |
| `FURNIZORI-MEDICINA-DENTARA-LA-29-11-2024.pdf` | Multi-column page layout (2 columns side-by-side) | Use `pdftotext -table` instead of `-layout`, OR split page mid-x via `pdftotext -x ... -W ...` |
| `FURNIZORI-stomato-in-contract-la-1-noiembrie-2024.pdf` | Same as above | Same |
| `Valori-de-contract-furnizori-PNS-13.11.2024.pdf` | "Valori" files have name + sum, not provider lists | Reclassify or skip via filename regex `Valori-` |
| `CAS-GORJ-Lista-furnizori-in-contract-PNS-01.01.2024.pdf` | PDF text is image-based (scanned) — pdftotext returns empty | Add OCR via tesseract: `pdftotext` if empty → `tesseract -l ron` |
| `2024_SITE_FURNIZORI-SERVICII-PARACLINICE-09.2024.xlsx` | XLSX format unsupported | Add `xlsx` parsing via `xlsx` npm package or `gnumeric ssconvert` to CSV |
Drop-in fixes that recover 80% of these in <1h:
1. Reclassify `Valori-` filenames as `parse_status='not_provider_list'` (skip).
2. Detect `LISTA FURNIZORILOR ... CASA ... DE SANATATE A JUDETULUI [A-Z]+` header at top of page → set document.judet from header.
3. Add per-page judet detection for testare-genetica-style files.
4. Handle 2-column-per-page layouts by running `pdftotext -W $((width/2))` twice with different `-x`.
### "other" tip cleanup (34K rows)
The 7 "Nr-furnizori-testare" 2022 PDFs were each parsed at ~4,300 lines each — many of those rows are **duplicates of the same providers** plus some **garbage** (e.g. `name="SRL"`, empty sediu). These dominate the dataset. Two options:
**Option A (recommended):** Mark these documents as `parse_status='superseded'` since 2024-2025 lists cover the same providers. Cuts dataset to ~1,900 high-quality rows.
**Option B:** Deduplicate by name+email post-ingest into a `cnas.furnizori_clean` table.
## Phase 3 — Per-county SPA harvest (effort: 4-6h, deferred)
Once `cas.cnas.ro/casXX` data goes live (no clear timeline; check monthly):
```ts
// poc-cas-cnas-watch.ts
for (const judet of ['casmb', 'cascj', 'casbn', /* 42 total */]) {
const r = await fetch(`https://cas.cnas.ro/admin/api/home-content`, {
headers: { 'X-Instance-Key': judet }
});
// Currently always returns: {"data":null,"message":"Sequence contains no elements.","isSucces":false}
// When this turns into a real payload, the SPA will have working endpoints.
}
```
Confirmed working endpoints (return JSON when populated):
- `admin/api/home-content` (header: `X-Instance-Key: <slug>`)
- `admin/api/menu-items`
- `admin/api/get-content?slug=<page-slug>`
- `admin/api/get-pages/<slug>` (page tree)
- `public/api/provider-map`, `public/api/pharmacy-report`, `public/api/dental-report`, `public/api/paraclinic-report`, `public/api/recuperare-report` (per-tip plurals — pagination via `?skip=&take=`)
## Phase 4 — CUI matching (effort: 1-2h)
Mirror `match-cui-anre.sh` pattern. CNAS provider names are messy (CMI prefixes, doctor titles, abbreviated SRL etc.). Strategy:
```ts
// services/seap-scraper/src/match-cui-cnas.ts
// 1. UPDATE cnas.furnizori SET name_norm = firms.normalize_company_name(name)
// 2. Try exact match: WHERE firms.entities.name_norm = cnas.furnizori.name_norm
// 3. Try trgm fuzzy with judet constraint (when judet known)
// 4. Mark cui_match_method ('exact_norm' | 'trgm_judet' | 'trgm_unique' | 'unmatched')
```
Expected match rate: 50-70% for SRL/SA-form providers; 5-15% for CMI (cabinete medicale individuale, often unregistered firms).
## Phase 5 — Cross-source recipes (drafted SQL)
### Recipe 1: "Furnizori medicali CNAS care apar și ca furnizori SEAP la CPV 33.* / 85.*"
```sql
WITH cnas_cui AS (
SELECT DISTINCT cui FROM cnas.furnizori WHERE cui IS NOT NULL
),
seap_med AS (
SELECT DISTINCT a.supplier_cui AS cui, COUNT(*) AS nr_castiguri,
SUM(a.value_eur) AS total_eur
FROM seap.announcements a
WHERE (a.cpv_code LIKE '33%' OR a.cpv_code LIKE '85%')
AND a.supplier_cui IS NOT NULL
GROUP BY a.supplier_cui
)
SELECT c.cui, e.name, sm.nr_castiguri, sm.total_eur,
array_agg(DISTINCT cf.tip_serviciu) AS tipuri_cnas
FROM cnas_cui c
JOIN seap_med sm USING (cui)
JOIN firms.entities e ON e.cui = c.cui
JOIN cnas.furnizori cf USING (cui)
GROUP BY c.cui, e.name, sm.nr_castiguri, sm.total_eur
ORDER BY sm.total_eur DESC NULLS LAST
LIMIT 100;
```
### Recipe 2: "Spitale CNAS care au datorii ANAF" — red flag
```sql
SELECT DISTINCT
cf.cui, e.name, cf.judet,
cf.tip_serviciu,
ad.sume_datorate_buget_general_consolidat AS datorii_total
FROM cnas.furnizori cf
JOIN firms.entities e ON e.cui = cf.cui
JOIN anaf_datornici.datornic ad ON ad.cui = cf.cui
WHERE cf.tip_serviciu IN ('spital','clinic','ambulatoriu_clinic')
AND ad.sume_datorate_buget_general_consolidat > 100000
ORDER BY datorii_total DESC;
```
### Recipe 3: "Furnizori CNAS care primesc fonduri EU (POIM-Sănătate)" — EU-linked
```sql
SELECT DISTINCT
cf.cui, e.name, cf.tip_serviciu,
fp.titlu_proiect, fp.valoare_totala_eligibila
FROM cnas.furnizori cf
JOIN firms.entities e ON e.cui = cf.cui
JOIN fonduri.proiect_v2 fp ON fp.beneficiar_cui = cf.cui
WHERE fp.titlu_proiect ILIKE '%sanatate%' OR fp.programul_operational ILIKE '%POIM%'
ORDER BY fp.valoare_totala_eligibila DESC;
```
### Recipe 4: "Spitale CNAS cu zero contracte SEAP" — anomaly
Hospitals contracted with state insurance but never appearing as SEAP suppliers/buyers:
```sql
SELECT cf.cui, e.name, cf.judet
FROM cnas.furnizori cf
JOIN firms.entities e ON e.cui = cf.cui
WHERE cf.tip_serviciu = 'spital'
AND NOT EXISTS (
SELECT 1 FROM seap.announcements a
WHERE a.supplier_cui = cf.cui OR a.buyer_cui = cf.cui
)
ORDER BY e.name;
```
## Operational
```sh
# Smoke (5 docs, ~30s)
sudo LIMIT=5 /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
# Full ingest (61 docs, ~3 min, idempotent)
sudo /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
# Just refresh document catalog without re-parsing
sudo MODE=metadata-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
# Re-parse existing pending/failed only
sudo MODE=parse-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
# Cron suggested: weekly (CNAS uploads ~5-15 files/month)
# 0 5 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
```
## Remaining county sites — handoff list
When `cas.cnas.ro/casXX` SPA goes live, all 42 sub-instances follow the same URL pattern:
```
casab Alba casdj Dolj casnt Neamt
casag Argeș casgj Gorj casot Olt
casar Arad casgl Galați casph Prahova
casbc Bacău casgr Giurgiu cassb Sibiu
casbh Bihor cashd Hunedoara cassj Sălaj
casbn Bistrița-N. cashr Harghita cassv Suceava
casbr Brăila casif Ilfov casts Teleorman ?
casbt Botoșani casil Ialomița castl Tulcea
casbv Brașov casis Iași castm Timiș
casbz Buzău casmb București castr Teleorman ?
cascj Cluj casmh Mehedinți casvl Vâlcea
cascl Călărași casmm Maramureș casvn Vrancea
cascs Caraș-Severin casms Mureș casvs Vaslui
casct Constanța cassam Satu Mare casaopsnaj (Apărare/Ord. publică)
cascv Covasna
casdb Dâmbovița
```
Total: 43 sub-sites including `casaopsnaj`. v1 ingests 0 of these directly (relies on central WP catalog only).