initial: split from gov-agreg — vreau.digital standalone platform

Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00
commit a6c03a091e
352 changed files with 75295 additions and 0 deletions
@@ -0,0 +1,84 @@
+# CNAS Phase 2 — Layout B parser handoff
+
+State at 2026-05-11 (after C4 partial fix):
+- 14 PDFs were stuck at `parse_status='no_table'`.
+- Commit `bfa0b69` relaxed the `nr_crt` regex from `\s{2,}` to `\s+` (guarded
+  by a Romanian capital letter). This recovers ~3-5 of the 14 PDFs that use
+  Layout A (numbered rows).
+- The remaining ~9-11 PDFs use **Layout B** (judet-grouped, no row numbers)
+  and need a separate parser path that this handoff describes.
+
+## Layout B specimens
+
+Tested via `pdftotext -layout`:
+
+| ID | URL | Tip | Rows visible |
+|----|-----|-----|--------------|
+| 1 | `Lista-furnizori-testare-genetica-2024-2025_all.pdf` | testare_genetica | ~15 |
+| 2 | `Lista-furnizori-tumori-solide-maligne-martie-2025.pdf` | oncologie | ~15 |
+| 14 | `Valori-de-contract-furnizori-PNS-13.11.2024.pdf` | pns | unknown |
+| 15 | `CAS-GORJ-Lista-furnizori-in-contract-PNS-01.01.2024.pdf` | pns | small (single CAS) |
+| 44 | `Valori-de-contract-pentru-furnizorii-de-servicii-medicale-de-consultatiii-de-urgenta-…` | urgenta_transport | unknown |
+| 46 | `FURNIZORI-SERVICII-ASISTENTA-MEDICALA-PRIMARA-ADMISI-IN-SESIUNEA-CONTRACTARE-NOV-2024-PENTRU-SITE-1.pdf` | medicina_familie | unknown |
+| 56 | `Lista-furnizori-radioterapie-2024.pdf` | radioterapie | small |
+| 57 | `Lista-furnizori-testare-hematologie-maligna-2024.pdf` | oncologie | small |
+| 58 | `Lista-furnizori-tumori-solide-maligne-2024.pdf` | oncologie | small |
+
+## Layout B shape (sample from testare_genetica)
+
+```
+    BIHOR
+                    SC Resident Laboratory SRL   Oradea, Str.…       email   phone   DA
+    CLUJ
+                    Institutul Oncologic …       Cluj-Napoca…        email   phone   DA  DA  DA
+                    Centrul Medical Unirea S.R.L Punct de lucru…     email   phone   DA  DA  DA
+    BUCUREȘTI
+                    Personal Genetics SRL        București sector 1… email   phone   DA
+```
+
+Key signals:
+- Single-word ALL-CAPS judet on its own line (left-aligned, ~4-12 chars).
+- Provider rows are indented to a fixed column (~20 chars left margin).
+- Multi-line addresses with continuation rows.
+- Trailing DA/NU columns indicate which test panel / service the furnizor
+  is contracted for (varies by PDF type — sometimes 1 column, sometimes 7+).
+
+## Recommended approach (~3-5h)
+
+1. **Add a 2nd parser** `parseProviderTextJudetGrouped(text, hints)` invoked
+   only when `parseProviderText` returns 0 rows AND `tip_serviciu IN
+   ('oncologie','testare_genetica','radioterapie','pns','medicina_familie')`.
+2. **State machine**: track `currentJudet`; when a line matches
+   `^\s+([A-ZĂÂÎȘȚ]{3,15})\s*$` (also accept variants like `BUCUREŞTI` /
+   `BUCURESTI`), update `currentJudet`. When the next line is indented and
+   non-empty, treat it as the start of a row.
+3. **Row assembly**: gather lines until next judet header, next blank-line
+   block, or next provider name (heuristic: line starts with capital +
+   doesn't start with `Str.` / `Mun.` / `sector` / `nr.` / city name).
+4. **Column extraction**: split by `\s{3,}` like the existing parser, but
+   know that col 0 = name, col 1 = address, col 2 = email, col 3 = phone,
+   cols 4+ = DA/NU flags. Capture flags into a `specialitate` JSON field
+   (would need a schema migration if we want to keep them structured) or
+   collapse into a comma-separated text in `specialitate`.
+5. **Judet override**: when judet is detected from PDF body, override the
+   filename-derived judet in cnas.furnizori per-row.
+
+## Schema-change consideration
+
+To preserve the DA/NU flag matrix, add a `specialitate_jsonb` column to
+`cnas.furnizori` (or reuse the existing `specialitate` text column with a
+serialized string like `"panel_1:DA,panel_2:DA,panel_3:NU"`). Existing
+column suffices for v1 if we encode as text.
+
+## Testing
+
+Cache the 9-11 PDFs locally (`/tmp/cnas-pdfs/`) and run the parser
+unit-style. For each PDF, the expected row count is roughly the number of
+`@gmail|yahoo|ro|com` email-pattern hits in the body (15-50 per PDF on
+average → estimated total: 200-500 additional providers).
+
+## Defer reason
+
+3-5h of work for an estimated 200-500 rows (~10% of current cnas.furnizori
+size, which is 36k). Lower ROI than the WSP timezone fix
+(restores daily cron entirely) or ANRE electricieni (zero → ~101k rows).