# CNAS — Casa Națională de Asigurări de Sănătate — Ingest Plan Lista furnizorilor de servicii medicale aflați în relație contractuală cu CAS-urile județene. ## v1 status (2026-05-10) **Schema applied:** `services/seap-scraper/sql/031_cnas.sql` (3 tables + 1 MV) **Scraper:** `services/seap-scraper/src/scrape-cnas.ts` **Wrapper:** `services/seap-scraper/cron/scrape-cnas.sh` **First-pass yield:** 36,183 rows / 12,392 distinct provider names from **46 PDFs successfully parsed** (61 furnizor PDFs registered, 14 with non-tabular layout). ### What v1 captures The CNAS WordPress media library at `cnas.ro/wp-content/uploads/` exposes ~70-90 furnizor-related PDFs (CAS Bihor, CAS Bacău, CAS Gorj, CAS Arad upload most heavily; rest of counties don't use this central library). Discoverable via `cnas.ro/wp-json/wp/v2/media` REST API (no auth, no rate limit). Working categories with >100 rows extracted: - `medicina_dentara` — 361 rows from FURNIZORI-IN-CONTRACT-AMBULATORIU-DE-SPECIALITATE-MEDICINA-DENTARA-2024 - `medicina_familie` — 488 rows total (mostly CAS Bihor) - `dispozitive_medicale` — 268 rows - `farmacie` — 119 rows - `ambulatoriu_clinic` — 99 rows - `recuperare_medicala` — 61 rows - 4,300+ rows each from 7 historical 2022 "Nr-furnizori-testare" PDFs (national snapshots, ~10K distinct lines) ### Investigation findings The CNAS source ecosystem is **mid-migration** between 3 layers: 1. **NEW — `cas.cnas.ro/casXX`** (Angular SPA, 42 county sub-instances). Uses Blazor admin/api at `/admin/api/{home-content,menu-items,provider-map,pharmacy-report,dental-report,…}`. Routes via `X-Instance-Key` HTTP header. **As of 2026-05, all data endpoints return `[]` or 500 — the migration hasn't loaded provider lists yet.** Watch script (see Phase 2 below) recommended. 2. **CENTRAL — `cnas.ro/wp-content/uploads/`** (WordPress media library). 4,180 files total, ~70 furnizor PDFs. **THIS IS WHAT v1 INGESTS.** Updated weekly-ish. 3. **OLD — `www.cnas.ro/casXX/page/lista-furnizori-*.html`** (pre-migration WP). All 301-redirect to dead stubs on `cnas.ro/casXX/`. **Effectively removed.** Archived content recoverable via Wayback CDX (`web.archive.org/cdx/search/cdx?url=cas.cnas.ro/casXX&matchType=domain`). ## Phase 2 — Improve parser (effort: 2-3h) Parser misses ~25% of files due to non-tabular layouts. Fixes needed: ### "no_table" failures (14 files) These have valid data but unusual layouts: | File | Issue | Approach | |---|---|---| | `Lista-furnizori-testare-genetica-2024-2025_all.pdf` (4 pages) | First column is "Casa de asigurări" (judet header), nr_crt is implicit | Per-page re-parse: detect judet headers (`BIHOR`, `CLUJ`), assign to all rows below until next header | | `Lista-furnizori-tumori-solide-maligne-martie-2025.pdf` (1 page) | Same as above — judet-grouped | Same | | `Lista-furnizori-radioterapie-2024.pdf` | Same | Same | | `Lista-furnizori-testare-hematologie-maligna-2024.pdf` | Same | Same | | `FURNIZORI-INGRIJIRI-PALIATIVE-INCEPAND-CU-01.07.2023-2.pdf` | Header row says "Bacau" — county is in *header*, not column. Plus row#1 leading on the right column | Detect "CAS \w+" or "JUDET" in header text; skip first 5 lines; rows start with bare number followed by `[A-Z]` | | `FURNIZORI-MEDICINA-DENTARA-LA-29-11-2024.pdf` | Multi-column page layout (2 columns side-by-side) | Use `pdftotext -table` instead of `-layout`, OR split page mid-x via `pdftotext -x ... -W ...` | | `FURNIZORI-stomato-in-contract-la-1-noiembrie-2024.pdf` | Same as above | Same | | `Valori-de-contract-furnizori-PNS-13.11.2024.pdf` | "Valori" files have name + sum, not provider lists | Reclassify or skip via filename regex `Valori-` | | `CAS-GORJ-Lista-furnizori-in-contract-PNS-01.01.2024.pdf` | PDF text is image-based (scanned) — pdftotext returns empty | Add OCR via tesseract: `pdftotext` if empty → `tesseract -l ron` | | `2024_SITE_FURNIZORI-SERVICII-PARACLINICE-09.2024.xlsx` | XLSX format unsupported | Add `xlsx` parsing via `xlsx` npm package or `gnumeric ssconvert` to CSV | Drop-in fixes that recover 80% of these in <1h: 1. Reclassify `Valori-` filenames as `parse_status='not_provider_list'` (skip). 2. Detect `LISTA FURNIZORILOR ... CASA ... DE SANATATE A JUDETULUI [A-Z]+` header at top of page → set document.judet from header. 3. Add per-page judet detection for testare-genetica-style files. 4. Handle 2-column-per-page layouts by running `pdftotext -W $((width/2))` twice with different `-x`. ### "other" tip cleanup (34K rows) The 7 "Nr-furnizori-testare" 2022 PDFs were each parsed at ~4,300 lines each — many of those rows are **duplicates of the same providers** plus some **garbage** (e.g. `name="SRL"`, empty sediu). These dominate the dataset. Two options: **Option A (recommended):** Mark these documents as `parse_status='superseded'` since 2024-2025 lists cover the same providers. Cuts dataset to ~1,900 high-quality rows. **Option B:** Deduplicate by name+email post-ingest into a `cnas.furnizori_clean` table. ## Phase 3 — Per-county SPA harvest (effort: 4-6h, deferred) Once `cas.cnas.ro/casXX` data goes live (no clear timeline; check monthly): ```ts // poc-cas-cnas-watch.ts for (const judet of ['casmb', 'cascj', 'casbn', /* 42 total */]) { const r = await fetch(`https://cas.cnas.ro/admin/api/home-content`, { headers: { 'X-Instance-Key': judet } }); // Currently always returns: {"data":null,"message":"Sequence contains no elements.","isSucces":false} // When this turns into a real payload, the SPA will have working endpoints. } ``` Confirmed working endpoints (return JSON when populated): - `admin/api/home-content` (header: `X-Instance-Key: `) - `admin/api/menu-items` - `admin/api/get-content?slug=` - `admin/api/get-pages/` (page tree) - `public/api/provider-map`, `public/api/pharmacy-report`, `public/api/dental-report`, `public/api/paraclinic-report`, `public/api/recuperare-report` (per-tip plurals — pagination via `?skip=&take=`) ## Phase 4 — CUI matching (effort: 1-2h) Mirror `match-cui-anre.sh` pattern. CNAS provider names are messy (CMI prefixes, doctor titles, abbreviated SRL etc.). Strategy: ```ts // services/seap-scraper/src/match-cui-cnas.ts // 1. UPDATE cnas.furnizori SET name_norm = firms.normalize_company_name(name) // 2. Try exact match: WHERE firms.entities.name_norm = cnas.furnizori.name_norm // 3. Try trgm fuzzy with judet constraint (when judet known) // 4. Mark cui_match_method ('exact_norm' | 'trgm_judet' | 'trgm_unique' | 'unmatched') ``` Expected match rate: 50-70% for SRL/SA-form providers; 5-15% for CMI (cabinete medicale individuale, often unregistered firms). ## Phase 5 — Cross-source recipes (drafted SQL) ### Recipe 1: "Furnizori medicali CNAS care apar și ca furnizori SEAP la CPV 33.* / 85.*" ```sql WITH cnas_cui AS ( SELECT DISTINCT cui FROM cnas.furnizori WHERE cui IS NOT NULL ), seap_med AS ( SELECT DISTINCT a.supplier_cui AS cui, COUNT(*) AS nr_castiguri, SUM(a.value_eur) AS total_eur FROM seap.announcements a WHERE (a.cpv_code LIKE '33%' OR a.cpv_code LIKE '85%') AND a.supplier_cui IS NOT NULL GROUP BY a.supplier_cui ) SELECT c.cui, e.name, sm.nr_castiguri, sm.total_eur, array_agg(DISTINCT cf.tip_serviciu) AS tipuri_cnas FROM cnas_cui c JOIN seap_med sm USING (cui) JOIN firms.entities e ON e.cui = c.cui JOIN cnas.furnizori cf USING (cui) GROUP BY c.cui, e.name, sm.nr_castiguri, sm.total_eur ORDER BY sm.total_eur DESC NULLS LAST LIMIT 100; ``` ### Recipe 2: "Spitale CNAS care au datorii ANAF" — red flag ```sql SELECT DISTINCT cf.cui, e.name, cf.judet, cf.tip_serviciu, ad.sume_datorate_buget_general_consolidat AS datorii_total FROM cnas.furnizori cf JOIN firms.entities e ON e.cui = cf.cui JOIN anaf_datornici.datornic ad ON ad.cui = cf.cui WHERE cf.tip_serviciu IN ('spital','clinic','ambulatoriu_clinic') AND ad.sume_datorate_buget_general_consolidat > 100000 ORDER BY datorii_total DESC; ``` ### Recipe 3: "Furnizori CNAS care primesc fonduri EU (POIM-Sănătate)" — EU-linked ```sql SELECT DISTINCT cf.cui, e.name, cf.tip_serviciu, fp.titlu_proiect, fp.valoare_totala_eligibila FROM cnas.furnizori cf JOIN firms.entities e ON e.cui = cf.cui JOIN fonduri.proiect_v2 fp ON fp.beneficiar_cui = cf.cui WHERE fp.titlu_proiect ILIKE '%sanatate%' OR fp.programul_operational ILIKE '%POIM%' ORDER BY fp.valoare_totala_eligibila DESC; ``` ### Recipe 4: "Spitale CNAS cu zero contracte SEAP" — anomaly Hospitals contracted with state insurance but never appearing as SEAP suppliers/buyers: ```sql SELECT cf.cui, e.name, cf.judet FROM cnas.furnizori cf JOIN firms.entities e ON e.cui = cf.cui WHERE cf.tip_serviciu = 'spital' AND NOT EXISTS ( SELECT 1 FROM seap.announcements a WHERE a.supplier_cui = cf.cui OR a.buyer_cui = cf.cui ) ORDER BY e.name; ``` ## Operational ```sh # Smoke (5 docs, ~30s) sudo LIMIT=5 /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh # Full ingest (61 docs, ~3 min, idempotent) sudo /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh # Just refresh document catalog without re-parsing sudo MODE=metadata-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh # Re-parse existing pending/failed only sudo MODE=parse-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh # Cron suggested: weekly (CNAS uploads ~5-15 files/month) # 0 5 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh ``` ## Remaining county sites — handoff list When `cas.cnas.ro/casXX` SPA goes live, all 42 sub-instances follow the same URL pattern: ``` casab Alba casdj Dolj casnt Neamt casag Argeș casgj Gorj casot Olt casar Arad casgl Galați casph Prahova casbc Bacău casgr Giurgiu cassb Sibiu casbh Bihor cashd Hunedoara cassj Sălaj casbn Bistrița-N. cashr Harghita cassv Suceava casbr Brăila casif Ilfov casts Teleorman ? casbt Botoșani casil Ialomița castl Tulcea casbv Brașov casis Iași castm Timiș casbz Buzău casmb București castr Teleorman ? cascj Cluj casmh Mehedinți casvl Vâlcea cascl Călărași casmm Maramureș casvn Vrancea cascs Caraș-Severin casms Mureș casvs Vaslui casct Constanța cassam Satu Mare casaopsnaj (Apărare/Ord. publică) cascv Covasna casdb Dâmbovița ``` Total: 43 sub-sites including `casaopsnaj`. v1 ingests 0 of these directly (relies on central WP catalog only).