Files
vreau-digital/services/seap-scraper/CNAS-PLAN.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

10 KiB

CNAS — Casa Națională de Asigurări de Sănătate — Ingest Plan

Lista furnizorilor de servicii medicale aflați în relație contractuală cu CAS-urile județene.

v1 status (2026-05-10)

Schema applied: services/seap-scraper/sql/031_cnas.sql (3 tables + 1 MV) Scraper: services/seap-scraper/src/scrape-cnas.ts Wrapper: services/seap-scraper/cron/scrape-cnas.sh First-pass yield: 36,183 rows / 12,392 distinct provider names from 46 PDFs successfully parsed (61 furnizor PDFs registered, 14 with non-tabular layout).

What v1 captures

The CNAS WordPress media library at cnas.ro/wp-content/uploads/ exposes ~70-90 furnizor-related PDFs (CAS Bihor, CAS Bacău, CAS Gorj, CAS Arad upload most heavily; rest of counties don't use this central library). Discoverable via cnas.ro/wp-json/wp/v2/media REST API (no auth, no rate limit).

Working categories with >100 rows extracted:

  • medicina_dentara — 361 rows from FURNIZORI-IN-CONTRACT-AMBULATORIU-DE-SPECIALITATE-MEDICINA-DENTARA-2024
  • medicina_familie — 488 rows total (mostly CAS Bihor)
  • dispozitive_medicale — 268 rows
  • farmacie — 119 rows
  • ambulatoriu_clinic — 99 rows
  • recuperare_medicala — 61 rows
  • 4,300+ rows each from 7 historical 2022 "Nr-furnizori-testare" PDFs (national snapshots, ~10K distinct lines)

Investigation findings

The CNAS source ecosystem is mid-migration between 3 layers:

  1. NEW — cas.cnas.ro/casXX (Angular SPA, 42 county sub-instances). Uses Blazor admin/api at /admin/api/{home-content,menu-items,provider-map,pharmacy-report,dental-report,…}. Routes via X-Instance-Key HTTP header. As of 2026-05, all data endpoints return [] or 500 — the migration hasn't loaded provider lists yet. Watch script (see Phase 2 below) recommended.
  2. CENTRAL — cnas.ro/wp-content/uploads/ (WordPress media library). 4,180 files total, ~70 furnizor PDFs. THIS IS WHAT v1 INGESTS. Updated weekly-ish.
  3. OLD — www.cnas.ro/casXX/page/lista-furnizori-*.html (pre-migration WP). All 301-redirect to dead stubs on cnas.ro/casXX/. Effectively removed. Archived content recoverable via Wayback CDX (web.archive.org/cdx/search/cdx?url=cas.cnas.ro/casXX&matchType=domain).

Phase 2 — Improve parser (effort: 2-3h)

Parser misses ~25% of files due to non-tabular layouts. Fixes needed:

"no_table" failures (14 files)

These have valid data but unusual layouts:

File Issue Approach
Lista-furnizori-testare-genetica-2024-2025_all.pdf (4 pages) First column is "Casa de asigurări" (judet header), nr_crt is implicit Per-page re-parse: detect judet headers (BIHOR, CLUJ), assign to all rows below until next header
Lista-furnizori-tumori-solide-maligne-martie-2025.pdf (1 page) Same as above — judet-grouped Same
Lista-furnizori-radioterapie-2024.pdf Same Same
Lista-furnizori-testare-hematologie-maligna-2024.pdf Same Same
FURNIZORI-INGRIJIRI-PALIATIVE-INCEPAND-CU-01.07.2023-2.pdf Header row says "Bacau" — county is in header, not column. Plus row#1 leading on the right column Detect "CAS \w+" or "JUDET" in header text; skip first 5 lines; rows start with bare number followed by [A-Z]
FURNIZORI-MEDICINA-DENTARA-LA-29-11-2024.pdf Multi-column page layout (2 columns side-by-side) Use pdftotext -table instead of -layout, OR split page mid-x via pdftotext -x ... -W ...
FURNIZORI-stomato-in-contract-la-1-noiembrie-2024.pdf Same as above Same
Valori-de-contract-furnizori-PNS-13.11.2024.pdf "Valori" files have name + sum, not provider lists Reclassify or skip via filename regex Valori-
CAS-GORJ-Lista-furnizori-in-contract-PNS-01.01.2024.pdf PDF text is image-based (scanned) — pdftotext returns empty Add OCR via tesseract: pdftotext if empty → tesseract -l ron
2024_SITE_FURNIZORI-SERVICII-PARACLINICE-09.2024.xlsx XLSX format unsupported Add xlsx parsing via xlsx npm package or gnumeric ssconvert to CSV

Drop-in fixes that recover 80% of these in <1h:

  1. Reclassify Valori- filenames as parse_status='not_provider_list' (skip).
  2. Detect LISTA FURNIZORILOR ... CASA ... DE SANATATE A JUDETULUI [A-Z]+ header at top of page → set document.judet from header.
  3. Add per-page judet detection for testare-genetica-style files.
  4. Handle 2-column-per-page layouts by running pdftotext -W $((width/2)) twice with different -x.

"other" tip cleanup (34K rows)

The 7 "Nr-furnizori-testare" 2022 PDFs were each parsed at ~4,300 lines each — many of those rows are duplicates of the same providers plus some garbage (e.g. name="SRL", empty sediu). These dominate the dataset. Two options:

Option A (recommended): Mark these documents as parse_status='superseded' since 2024-2025 lists cover the same providers. Cuts dataset to ~1,900 high-quality rows.

Option B: Deduplicate by name+email post-ingest into a cnas.furnizori_clean table.

Phase 3 — Per-county SPA harvest (effort: 4-6h, deferred)

Once cas.cnas.ro/casXX data goes live (no clear timeline; check monthly):

// poc-cas-cnas-watch.ts
for (const judet of ['casmb', 'cascj', 'casbn', /* 42 total */]) {
  const r = await fetch(`https://cas.cnas.ro/admin/api/home-content`, {
    headers: { 'X-Instance-Key': judet }
  });
  // Currently always returns: {"data":null,"message":"Sequence contains no elements.","isSucces":false}
  // When this turns into a real payload, the SPA will have working endpoints.
}

Confirmed working endpoints (return JSON when populated):

  • admin/api/home-content (header: X-Instance-Key: <slug>)
  • admin/api/menu-items
  • admin/api/get-content?slug=<page-slug>
  • admin/api/get-pages/<slug> (page tree)
  • public/api/provider-map, public/api/pharmacy-report, public/api/dental-report, public/api/paraclinic-report, public/api/recuperare-report (per-tip plurals — pagination via ?skip=&take=)

Phase 4 — CUI matching (effort: 1-2h)

Mirror match-cui-anre.sh pattern. CNAS provider names are messy (CMI prefixes, doctor titles, abbreviated SRL etc.). Strategy:

// services/seap-scraper/src/match-cui-cnas.ts
// 1. UPDATE cnas.furnizori SET name_norm = firms.normalize_company_name(name)
// 2. Try exact match: WHERE firms.entities.name_norm = cnas.furnizori.name_norm
// 3. Try trgm fuzzy with judet constraint (when judet known)
// 4. Mark cui_match_method ('exact_norm' | 'trgm_judet' | 'trgm_unique' | 'unmatched')

Expected match rate: 50-70% for SRL/SA-form providers; 5-15% for CMI (cabinete medicale individuale, often unregistered firms).

Phase 5 — Cross-source recipes (drafted SQL)

Recipe 1: "Furnizori medicali CNAS care apar și ca furnizori SEAP la CPV 33.* / 85.*"

WITH cnas_cui AS (
  SELECT DISTINCT cui FROM cnas.furnizori WHERE cui IS NOT NULL
),
seap_med AS (
  SELECT DISTINCT a.supplier_cui AS cui, COUNT(*) AS nr_castiguri,
         SUM(a.value_eur) AS total_eur
  FROM seap.announcements a
  WHERE (a.cpv_code LIKE '33%' OR a.cpv_code LIKE '85%')
    AND a.supplier_cui IS NOT NULL
  GROUP BY a.supplier_cui
)
SELECT c.cui, e.name, sm.nr_castiguri, sm.total_eur,
       array_agg(DISTINCT cf.tip_serviciu) AS tipuri_cnas
FROM cnas_cui c
JOIN seap_med sm USING (cui)
JOIN firms.entities e ON e.cui = c.cui
JOIN cnas.furnizori cf USING (cui)
GROUP BY c.cui, e.name, sm.nr_castiguri, sm.total_eur
ORDER BY sm.total_eur DESC NULLS LAST
LIMIT 100;

Recipe 2: "Spitale CNAS care au datorii ANAF" — red flag

SELECT DISTINCT
  cf.cui, e.name, cf.judet,
  cf.tip_serviciu,
  ad.sume_datorate_buget_general_consolidat AS datorii_total
FROM cnas.furnizori cf
JOIN firms.entities e ON e.cui = cf.cui
JOIN anaf_datornici.datornic ad ON ad.cui = cf.cui
WHERE cf.tip_serviciu IN ('spital','clinic','ambulatoriu_clinic')
  AND ad.sume_datorate_buget_general_consolidat > 100000
ORDER BY datorii_total DESC;

Recipe 3: "Furnizori CNAS care primesc fonduri EU (POIM-Sănătate)" — EU-linked

SELECT DISTINCT
  cf.cui, e.name, cf.tip_serviciu,
  fp.titlu_proiect, fp.valoare_totala_eligibila
FROM cnas.furnizori cf
JOIN firms.entities e ON e.cui = cf.cui
JOIN fonduri.proiect_v2 fp ON fp.beneficiar_cui = cf.cui
WHERE fp.titlu_proiect ILIKE '%sanatate%' OR fp.programul_operational ILIKE '%POIM%'
ORDER BY fp.valoare_totala_eligibila DESC;

Recipe 4: "Spitale CNAS cu zero contracte SEAP" — anomaly

Hospitals contracted with state insurance but never appearing as SEAP suppliers/buyers:

SELECT cf.cui, e.name, cf.judet
FROM cnas.furnizori cf
JOIN firms.entities e ON e.cui = cf.cui
WHERE cf.tip_serviciu = 'spital'
  AND NOT EXISTS (
    SELECT 1 FROM seap.announcements a
    WHERE a.supplier_cui = cf.cui OR a.buyer_cui = cf.cui
  )
ORDER BY e.name;

Operational

# Smoke (5 docs, ~30s)
sudo LIMIT=5 /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh

# Full ingest (61 docs, ~3 min, idempotent)
sudo /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh

# Just refresh document catalog without re-parsing
sudo MODE=metadata-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh

# Re-parse existing pending/failed only
sudo MODE=parse-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh

# Cron suggested: weekly (CNAS uploads ~5-15 files/month)
# 0 5 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh

Remaining county sites — handoff list

When cas.cnas.ro/casXX SPA goes live, all 42 sub-instances follow the same URL pattern:

casab    Alba             casdj    Dolj             casnt    Neamt
casag    Argeș            casgj    Gorj             casot    Olt
casar    Arad             casgl    Galați           casph    Prahova
casbc    Bacău            casgr    Giurgiu          cassb    Sibiu
casbh    Bihor            cashd    Hunedoara        cassj    Sălaj
casbn    Bistrița-N.      cashr    Harghita         cassv    Suceava
casbr    Brăila           casif    Ilfov            casts    Teleorman ?
casbt    Botoșani         casil    Ialomița         castl    Tulcea
casbv    Brașov           casis    Iași             castm    Timiș
casbz    Buzău            casmb    București        castr    Teleorman ?
cascj    Cluj             casmh    Mehedinți        casvl    Vâlcea
cascl    Călărași         casmm    Maramureș        casvn    Vrancea
cascs    Caraș-Severin    casms    Mureș            casvs    Vaslui
casct    Constanța        cassam   Satu Mare        casaopsnaj  (Apărare/Ord. publică)
cascv    Covasna
casdb    Dâmbovița

Total: 43 sub-sites including casaopsnaj. v1 ingests 0 of these directly (relies on central WP catalog only).