a6c03a091e
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
218 lines
10 KiB
Markdown
218 lines
10 KiB
Markdown
# CNAS — Casa Națională de Asigurări de Sănătate — Ingest Plan
|
|
|
|
Lista furnizorilor de servicii medicale aflați în relație contractuală cu CAS-urile județene.
|
|
|
|
## v1 status (2026-05-10)
|
|
|
|
**Schema applied:** `services/seap-scraper/sql/031_cnas.sql` (3 tables + 1 MV)
|
|
**Scraper:** `services/seap-scraper/src/scrape-cnas.ts`
|
|
**Wrapper:** `services/seap-scraper/cron/scrape-cnas.sh`
|
|
**First-pass yield:** 36,183 rows / 12,392 distinct provider names from **46 PDFs successfully parsed** (61 furnizor PDFs registered, 14 with non-tabular layout).
|
|
|
|
### What v1 captures
|
|
|
|
The CNAS WordPress media library at `cnas.ro/wp-content/uploads/` exposes ~70-90 furnizor-related PDFs (CAS Bihor, CAS Bacău, CAS Gorj, CAS Arad upload most heavily; rest of counties don't use this central library). Discoverable via `cnas.ro/wp-json/wp/v2/media` REST API (no auth, no rate limit).
|
|
|
|
Working categories with >100 rows extracted:
|
|
- `medicina_dentara` — 361 rows from FURNIZORI-IN-CONTRACT-AMBULATORIU-DE-SPECIALITATE-MEDICINA-DENTARA-2024
|
|
- `medicina_familie` — 488 rows total (mostly CAS Bihor)
|
|
- `dispozitive_medicale` — 268 rows
|
|
- `farmacie` — 119 rows
|
|
- `ambulatoriu_clinic` — 99 rows
|
|
- `recuperare_medicala` — 61 rows
|
|
- 4,300+ rows each from 7 historical 2022 "Nr-furnizori-testare" PDFs (national snapshots, ~10K distinct lines)
|
|
|
|
### Investigation findings
|
|
|
|
The CNAS source ecosystem is **mid-migration** between 3 layers:
|
|
|
|
1. **NEW — `cas.cnas.ro/casXX`** (Angular SPA, 42 county sub-instances). Uses Blazor admin/api at `/admin/api/{home-content,menu-items,provider-map,pharmacy-report,dental-report,…}`. Routes via `X-Instance-Key` HTTP header. **As of 2026-05, all data endpoints return `[]` or 500 — the migration hasn't loaded provider lists yet.** Watch script (see Phase 2 below) recommended.
|
|
2. **CENTRAL — `cnas.ro/wp-content/uploads/`** (WordPress media library). 4,180 files total, ~70 furnizor PDFs. **THIS IS WHAT v1 INGESTS.** Updated weekly-ish.
|
|
3. **OLD — `www.cnas.ro/casXX/page/lista-furnizori-*.html`** (pre-migration WP). All 301-redirect to dead stubs on `cnas.ro/casXX/`. **Effectively removed.** Archived content recoverable via Wayback CDX (`web.archive.org/cdx/search/cdx?url=cas.cnas.ro/casXX&matchType=domain`).
|
|
|
|
## Phase 2 — Improve parser (effort: 2-3h)
|
|
|
|
Parser misses ~25% of files due to non-tabular layouts. Fixes needed:
|
|
|
|
### "no_table" failures (14 files)
|
|
|
|
These have valid data but unusual layouts:
|
|
|
|
| File | Issue | Approach |
|
|
|---|---|---|
|
|
| `Lista-furnizori-testare-genetica-2024-2025_all.pdf` (4 pages) | First column is "Casa de asigurări" (judet header), nr_crt is implicit | Per-page re-parse: detect judet headers (`BIHOR`, `CLUJ`), assign to all rows below until next header |
|
|
| `Lista-furnizori-tumori-solide-maligne-martie-2025.pdf` (1 page) | Same as above — judet-grouped | Same |
|
|
| `Lista-furnizori-radioterapie-2024.pdf` | Same | Same |
|
|
| `Lista-furnizori-testare-hematologie-maligna-2024.pdf` | Same | Same |
|
|
| `FURNIZORI-INGRIJIRI-PALIATIVE-INCEPAND-CU-01.07.2023-2.pdf` | Header row says "Bacau" — county is in *header*, not column. Plus row#1 leading on the right column | Detect "CAS \w+" or "JUDET" in header text; skip first 5 lines; rows start with bare number followed by `[A-Z]` |
|
|
| `FURNIZORI-MEDICINA-DENTARA-LA-29-11-2024.pdf` | Multi-column page layout (2 columns side-by-side) | Use `pdftotext -table` instead of `-layout`, OR split page mid-x via `pdftotext -x ... -W ...` |
|
|
| `FURNIZORI-stomato-in-contract-la-1-noiembrie-2024.pdf` | Same as above | Same |
|
|
| `Valori-de-contract-furnizori-PNS-13.11.2024.pdf` | "Valori" files have name + sum, not provider lists | Reclassify or skip via filename regex `Valori-` |
|
|
| `CAS-GORJ-Lista-furnizori-in-contract-PNS-01.01.2024.pdf` | PDF text is image-based (scanned) — pdftotext returns empty | Add OCR via tesseract: `pdftotext` if empty → `tesseract -l ron` |
|
|
| `2024_SITE_FURNIZORI-SERVICII-PARACLINICE-09.2024.xlsx` | XLSX format unsupported | Add `xlsx` parsing via `xlsx` npm package or `gnumeric ssconvert` to CSV |
|
|
|
|
Drop-in fixes that recover 80% of these in <1h:
|
|
1. Reclassify `Valori-` filenames as `parse_status='not_provider_list'` (skip).
|
|
2. Detect `LISTA FURNIZORILOR ... CASA ... DE SANATATE A JUDETULUI [A-Z]+` header at top of page → set document.judet from header.
|
|
3. Add per-page judet detection for testare-genetica-style files.
|
|
4. Handle 2-column-per-page layouts by running `pdftotext -W $((width/2))` twice with different `-x`.
|
|
|
|
### "other" tip cleanup (34K rows)
|
|
|
|
The 7 "Nr-furnizori-testare" 2022 PDFs were each parsed at ~4,300 lines each — many of those rows are **duplicates of the same providers** plus some **garbage** (e.g. `name="SRL"`, empty sediu). These dominate the dataset. Two options:
|
|
|
|
**Option A (recommended):** Mark these documents as `parse_status='superseded'` since 2024-2025 lists cover the same providers. Cuts dataset to ~1,900 high-quality rows.
|
|
|
|
**Option B:** Deduplicate by name+email post-ingest into a `cnas.furnizori_clean` table.
|
|
|
|
## Phase 3 — Per-county SPA harvest (effort: 4-6h, deferred)
|
|
|
|
Once `cas.cnas.ro/casXX` data goes live (no clear timeline; check monthly):
|
|
|
|
```ts
|
|
// poc-cas-cnas-watch.ts
|
|
for (const judet of ['casmb', 'cascj', 'casbn', /* 42 total */]) {
|
|
const r = await fetch(`https://cas.cnas.ro/admin/api/home-content`, {
|
|
headers: { 'X-Instance-Key': judet }
|
|
});
|
|
// Currently always returns: {"data":null,"message":"Sequence contains no elements.","isSucces":false}
|
|
// When this turns into a real payload, the SPA will have working endpoints.
|
|
}
|
|
```
|
|
|
|
Confirmed working endpoints (return JSON when populated):
|
|
- `admin/api/home-content` (header: `X-Instance-Key: <slug>`)
|
|
- `admin/api/menu-items`
|
|
- `admin/api/get-content?slug=<page-slug>`
|
|
- `admin/api/get-pages/<slug>` (page tree)
|
|
- `public/api/provider-map`, `public/api/pharmacy-report`, `public/api/dental-report`, `public/api/paraclinic-report`, `public/api/recuperare-report` (per-tip plurals — pagination via `?skip=&take=`)
|
|
|
|
## Phase 4 — CUI matching (effort: 1-2h)
|
|
|
|
Mirror `match-cui-anre.sh` pattern. CNAS provider names are messy (CMI prefixes, doctor titles, abbreviated SRL etc.). Strategy:
|
|
|
|
```ts
|
|
// services/seap-scraper/src/match-cui-cnas.ts
|
|
// 1. UPDATE cnas.furnizori SET name_norm = firms.normalize_company_name(name)
|
|
// 2. Try exact match: WHERE firms.entities.name_norm = cnas.furnizori.name_norm
|
|
// 3. Try trgm fuzzy with judet constraint (when judet known)
|
|
// 4. Mark cui_match_method ('exact_norm' | 'trgm_judet' | 'trgm_unique' | 'unmatched')
|
|
```
|
|
|
|
Expected match rate: 50-70% for SRL/SA-form providers; 5-15% for CMI (cabinete medicale individuale, often unregistered firms).
|
|
|
|
## Phase 5 — Cross-source recipes (drafted SQL)
|
|
|
|
### Recipe 1: "Furnizori medicali CNAS care apar și ca furnizori SEAP la CPV 33.* / 85.*"
|
|
|
|
```sql
|
|
WITH cnas_cui AS (
|
|
SELECT DISTINCT cui FROM cnas.furnizori WHERE cui IS NOT NULL
|
|
),
|
|
seap_med AS (
|
|
SELECT DISTINCT a.supplier_cui AS cui, COUNT(*) AS nr_castiguri,
|
|
SUM(a.value_eur) AS total_eur
|
|
FROM seap.announcements a
|
|
WHERE (a.cpv_code LIKE '33%' OR a.cpv_code LIKE '85%')
|
|
AND a.supplier_cui IS NOT NULL
|
|
GROUP BY a.supplier_cui
|
|
)
|
|
SELECT c.cui, e.name, sm.nr_castiguri, sm.total_eur,
|
|
array_agg(DISTINCT cf.tip_serviciu) AS tipuri_cnas
|
|
FROM cnas_cui c
|
|
JOIN seap_med sm USING (cui)
|
|
JOIN firms.entities e ON e.cui = c.cui
|
|
JOIN cnas.furnizori cf USING (cui)
|
|
GROUP BY c.cui, e.name, sm.nr_castiguri, sm.total_eur
|
|
ORDER BY sm.total_eur DESC NULLS LAST
|
|
LIMIT 100;
|
|
```
|
|
|
|
### Recipe 2: "Spitale CNAS care au datorii ANAF" — red flag
|
|
|
|
```sql
|
|
SELECT DISTINCT
|
|
cf.cui, e.name, cf.judet,
|
|
cf.tip_serviciu,
|
|
ad.sume_datorate_buget_general_consolidat AS datorii_total
|
|
FROM cnas.furnizori cf
|
|
JOIN firms.entities e ON e.cui = cf.cui
|
|
JOIN anaf_datornici.datornic ad ON ad.cui = cf.cui
|
|
WHERE cf.tip_serviciu IN ('spital','clinic','ambulatoriu_clinic')
|
|
AND ad.sume_datorate_buget_general_consolidat > 100000
|
|
ORDER BY datorii_total DESC;
|
|
```
|
|
|
|
### Recipe 3: "Furnizori CNAS care primesc fonduri EU (POIM-Sănătate)" — EU-linked
|
|
|
|
```sql
|
|
SELECT DISTINCT
|
|
cf.cui, e.name, cf.tip_serviciu,
|
|
fp.titlu_proiect, fp.valoare_totala_eligibila
|
|
FROM cnas.furnizori cf
|
|
JOIN firms.entities e ON e.cui = cf.cui
|
|
JOIN fonduri.proiect_v2 fp ON fp.beneficiar_cui = cf.cui
|
|
WHERE fp.titlu_proiect ILIKE '%sanatate%' OR fp.programul_operational ILIKE '%POIM%'
|
|
ORDER BY fp.valoare_totala_eligibila DESC;
|
|
```
|
|
|
|
### Recipe 4: "Spitale CNAS cu zero contracte SEAP" — anomaly
|
|
|
|
Hospitals contracted with state insurance but never appearing as SEAP suppliers/buyers:
|
|
|
|
```sql
|
|
SELECT cf.cui, e.name, cf.judet
|
|
FROM cnas.furnizori cf
|
|
JOIN firms.entities e ON e.cui = cf.cui
|
|
WHERE cf.tip_serviciu = 'spital'
|
|
AND NOT EXISTS (
|
|
SELECT 1 FROM seap.announcements a
|
|
WHERE a.supplier_cui = cf.cui OR a.buyer_cui = cf.cui
|
|
)
|
|
ORDER BY e.name;
|
|
```
|
|
|
|
## Operational
|
|
|
|
```sh
|
|
# Smoke (5 docs, ~30s)
|
|
sudo LIMIT=5 /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
|
|
|
|
# Full ingest (61 docs, ~3 min, idempotent)
|
|
sudo /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
|
|
|
|
# Just refresh document catalog without re-parsing
|
|
sudo MODE=metadata-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
|
|
|
|
# Re-parse existing pending/failed only
|
|
sudo MODE=parse-only /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
|
|
|
|
# Cron suggested: weekly (CNAS uploads ~5-15 files/month)
|
|
# 0 5 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-cnas.sh
|
|
```
|
|
|
|
## Remaining county sites — handoff list
|
|
|
|
When `cas.cnas.ro/casXX` SPA goes live, all 42 sub-instances follow the same URL pattern:
|
|
|
|
```
|
|
casab Alba casdj Dolj casnt Neamt
|
|
casag Argeș casgj Gorj casot Olt
|
|
casar Arad casgl Galați casph Prahova
|
|
casbc Bacău casgr Giurgiu cassb Sibiu
|
|
casbh Bihor cashd Hunedoara cassj Sălaj
|
|
casbn Bistrița-N. cashr Harghita cassv Suceava
|
|
casbr Brăila casif Ilfov casts Teleorman ?
|
|
casbt Botoșani casil Ialomița castl Tulcea
|
|
casbv Brașov casis Iași castm Timiș
|
|
casbz Buzău casmb București castr Teleorman ?
|
|
cascj Cluj casmh Mehedinți casvl Vâlcea
|
|
cascl Călărași casmm Maramureș casvn Vrancea
|
|
cascs Caraș-Severin casms Mureș casvs Vaslui
|
|
casct Constanța cassam Satu Mare casaopsnaj (Apărare/Ord. publică)
|
|
cascv Covasna
|
|
casdb Dâmbovița
|
|
```
|
|
|
|
Total: 43 sub-sites including `casaopsnaj`. v1 ingests 0 of these directly (relies on central WP catalog only).
|