# CNAS Phase 2 — Layout B parser handoff State at 2026-05-11 (after C4 partial fix): - 14 PDFs were stuck at `parse_status='no_table'`. - Commit `bfa0b69` relaxed the `nr_crt` regex from `\s{2,}` to `\s+` (guarded by a Romanian capital letter). This recovers ~3-5 of the 14 PDFs that use Layout A (numbered rows). - The remaining ~9-11 PDFs use **Layout B** (judet-grouped, no row numbers) and need a separate parser path that this handoff describes. ## Layout B specimens Tested via `pdftotext -layout`: | ID | URL | Tip | Rows visible | |----|-----|-----|--------------| | 1 | `Lista-furnizori-testare-genetica-2024-2025_all.pdf` | testare_genetica | ~15 | | 2 | `Lista-furnizori-tumori-solide-maligne-martie-2025.pdf` | oncologie | ~15 | | 14 | `Valori-de-contract-furnizori-PNS-13.11.2024.pdf` | pns | unknown | | 15 | `CAS-GORJ-Lista-furnizori-in-contract-PNS-01.01.2024.pdf` | pns | small (single CAS) | | 44 | `Valori-de-contract-pentru-furnizorii-de-servicii-medicale-de-consultatiii-de-urgenta-…` | urgenta_transport | unknown | | 46 | `FURNIZORI-SERVICII-ASISTENTA-MEDICALA-PRIMARA-ADMISI-IN-SESIUNEA-CONTRACTARE-NOV-2024-PENTRU-SITE-1.pdf` | medicina_familie | unknown | | 56 | `Lista-furnizori-radioterapie-2024.pdf` | radioterapie | small | | 57 | `Lista-furnizori-testare-hematologie-maligna-2024.pdf` | oncologie | small | | 58 | `Lista-furnizori-tumori-solide-maligne-2024.pdf` | oncologie | small | ## Layout B shape (sample from testare_genetica) ``` BIHOR SC Resident Laboratory SRL Oradea, Str.… email phone DA CLUJ Institutul Oncologic … Cluj-Napoca… email phone DA DA DA Centrul Medical Unirea S.R.L Punct de lucru… email phone DA DA DA BUCUREȘTI Personal Genetics SRL București sector 1… email phone DA ``` Key signals: - Single-word ALL-CAPS judet on its own line (left-aligned, ~4-12 chars). - Provider rows are indented to a fixed column (~20 chars left margin). - Multi-line addresses with continuation rows. - Trailing DA/NU columns indicate which test panel / service the furnizor is contracted for (varies by PDF type — sometimes 1 column, sometimes 7+). ## Recommended approach (~3-5h) 1. **Add a 2nd parser** `parseProviderTextJudetGrouped(text, hints)` invoked only when `parseProviderText` returns 0 rows AND `tip_serviciu IN ('oncologie','testare_genetica','radioterapie','pns','medicina_familie')`. 2. **State machine**: track `currentJudet`; when a line matches `^\s+([A-ZĂÂÎȘȚ]{3,15})\s*$` (also accept variants like `BUCUREŞTI` / `BUCURESTI`), update `currentJudet`. When the next line is indented and non-empty, treat it as the start of a row. 3. **Row assembly**: gather lines until next judet header, next blank-line block, or next provider name (heuristic: line starts with capital + doesn't start with `Str.` / `Mun.` / `sector` / `nr.` / city name). 4. **Column extraction**: split by `\s{3,}` like the existing parser, but know that col 0 = name, col 1 = address, col 2 = email, col 3 = phone, cols 4+ = DA/NU flags. Capture flags into a `specialitate` JSON field (would need a schema migration if we want to keep them structured) or collapse into a comma-separated text in `specialitate`. 5. **Judet override**: when judet is detected from PDF body, override the filename-derived judet in cnas.furnizori per-row. ## Schema-change consideration To preserve the DA/NU flag matrix, add a `specialitate_jsonb` column to `cnas.furnizori` (or reuse the existing `specialitate` text column with a serialized string like `"panel_1:DA,panel_2:DA,panel_3:NU"`). Existing column suffices for v1 if we encode as text. ## Testing Cache the 9-11 PDFs locally (`/tmp/cnas-pdfs/`) and run the parser unit-style. For each PDF, the expected row count is roughly the number of `@gmail|yahoo|ro|com` email-pattern hits in the body (15-50 per PDF on average → estimated total: 200-500 additional providers). ## Defer reason 3-5h of work for an estimated 200-500 rows (~10% of current cnas.furnizori size, which is 36k). Lower ROI than the WSP timezone fix (restores daily cron entirely) or ANRE electricieni (zero → ~101k rows).