Files
vreau-digital/services/seap-scraper/CURTEACONT-PLAN.md
T
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

206 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Curtea de Conturi (CdC) — Stage 1 done, Stage 2 roadmap
Ingest of audit reports from https://www.curteadeconturi.ro/rapoarte-audit/.
## Stage 1 — DONE in this session
What was built:
- `services/seap-scraper/sql/035_curteacont.sql` — schema:
- `curteacont.rapoarte` (PK `slug_id` = sha1(category|slug))
- `curteacont.scrape_runs` (one row per CLI invocation)
- `services/seap-scraper/src/scrape-curteacont.ts` — listing-page walker:
- Three sources: `financiar`, `conformitate`, `performanta`
- Parses title → `audit_year`, `doc_number`, `doc_date`, `audited_entity_name`
- Detects follow-up reports (title prefix `Follow-up`)
- Reads `<time datetime>``publication_date`
- Idempotent UPSERT on `slug_id`
- `services/seap-scraper/cron/scrape-curteacont.sh` — Infisical → docker run
--env-file wrapper. Mirrors `scrape-anre.sh`. NODE_TLS_REJECT_UNAUTHORIZED=0
required (CdC serves an intermediate CA chain node's bundle doesn't trust).
Stage 1 ingest stats (2026-05-10):
| category | universe | ingested | parse rate (entity+doc_date) |
|-------------|----------|----------|-------------------------------|
| financiar | ~1,890 | 500 | 100% |
| conformitate| ~2,580 | 500 | TBD (similar pattern) |
| performanta | ~135 | 133 | 100% |
| **total** | **~4,605** | **1,133** | — |
Speed: ~25s per 500 reports (gentle 600ms delay between pages).
## Page-count reference (verified by probing 2026-05-10)
```
financiar ~127 pages × 15 = ~1,890 reports (last page=127 had 14)
conformitate ~173 pages × 15 = ~2,580 reports (last page=173 had 14)
performanta 9 pages × 15 = ~135 reports (last page=9 had 13)
```
Run a full backfill:
```bash
sudo SOURCE=all /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh
```
Estimated wall time: ~6 minutes for ~4,600 rows + page fetches.
## Stage 2 — TODO (next session, ~6-10h focused work)
Goal: resolve numeric `download_id`, mirror PDFs, parse first 3 pages, fuzzy-match `audited_entity_cui`.
### 2.1 — Resolve `download_id` from detail pages (~2h)
For each row with `download_id IS NULL`:
1. Fetch `detail_url`.
2. Regex `/rapoarte-audit/downloads/(\d+)``download_id`.
3. Regex `\(([0-9,]+) (KB|MB|GB)\)` next to download anchor → `pdf_size_bytes`.
4. UPSERT.
Rate: ~2 req/s (gentle), ~40 min for 4,600 rows. Implement as
`scrape-curteacont-resolve.ts --batch=100`. Idempotent on `slug_id`.
### 2.2 — Mirror PDFs to satra disk (~3-4h, optional)
- Path: `/opt/vreaudigital/data/cdc/{category}/{download_id}.pdf`
- Skip if `pdf_path IS NOT NULL` AND file exists.
- Average size: ~2-3 MB → ~12-15 GB total for full corpus.
- Update `pdf_path` after successful download.
### 2.3 — PDF first-page abstract + findings count (~2-3h)
- Use `pdftotext` (poppler) — already on satra. Faster than pdfminer.
- Read first 3 pages → `summary` (cleaned, dehyphenated text, 4-8 KB).
- Count occurrences of "constatare", "abateri", "deficiență" → `findings_count`.
- Some reports have a "Sinteza constatărilor" section — cheap regex to find it.
### 2.4 — CUI fuzzy match against `firms.entities` (~2h)
- We already have `services/seap-scraper/src/matching/cui-matcher.ts`
(commit f3477e2 — "CUI fuzzy matcher + /achizitii/beneficiar-privat/[id]
profile page"). Reuse it.
- Input: `audited_entity_name` (already populated by Stage 1).
- Strategy:
1. Exact match against `firms.entities.denumire` — high confidence.
2. Trigram similarity (`pg_trgm`, index already exists) for top-3 candidates,
then UAT-aware ranking (UATC = comună, UATM = municipiu, UATO = oraș,
UATJ = județ). Most CdC entities are UATs — this is high-leverage.
3. Fallback: store best-similarity score + leave NULL if < 0.6.
- Update `audited_entity_cui`.
- Expect 70-80% match rate on first pass; manual cleanup later.
## 3. Cross-source recipe drafts (draft SQL)
These SQLs reference Stage 2 data (`audited_entity_cui` populated). They give
the strategic value of CdC ingest — per-CUI audit history × SEAP awards.
### Recipe A — "Top autorități audited de N ori în 5 ani"
Repeat-audit signal: agencies audited many times in a short window typically
have persistent issues. Powerful for the "Profil autoritate" page.
```sql
SELECT
r.audited_entity_cui,
fe.denumire,
count(*) AS audit_count_5y,
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
count(*) FILTER (WHERE r.audit_type = 'performanta') AS perf_audits,
max(r.publication_date) AS last_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
AND r.publication_date > now() - interval '5 years'
GROUP BY r.audited_entity_cui, fe.denumire
HAVING count(*) >= 3
ORDER BY audit_count_5y DESC, last_audit DESC
LIMIT 50;
```
### Recipe B — "Spitale audited POST SEAP award" (paralelă cu CNAS)
Match SEAP contracts at hospitals against CdC audits issued AFTER award.
A red-flag indicator that the procurement raised audit attention.
```sql
WITH hospital_seap AS (
SELECT
s.contracting_authority_cui AS cui,
s.contracting_authority_name AS denumire,
s.id AS seap_id,
s.award_date,
s.contract_value
FROM seap.announcements s
JOIN cnas.spitale_furnizori cf ON cf.cui = s.contracting_authority_cui
WHERE s.award_date > now() - interval '5 years'
)
SELECT
hs.cui,
hs.denumire,
count(DISTINCT hs.seap_id) AS seap_awards,
sum(hs.contract_value) AS total_value_ron,
count(DISTINCT r.slug_id) FILTER (
WHERE r.publication_date > hs.award_date
) AS audits_after_award,
array_agg(DISTINCT r.audit_type) FILTER (WHERE r.publication_date > hs.award_date) AS audit_types
FROM hospital_seap hs
LEFT JOIN curteacont.rapoarte r ON r.audited_entity_cui = hs.cui
GROUP BY hs.cui, hs.denumire
HAVING count(DISTINCT r.slug_id) FILTER (WHERE r.publication_date > hs.award_date) > 0
ORDER BY audits_after_award DESC, total_value_ron DESC
LIMIT 50;
```
### Recipe C — "Autorități cu audit follow-up — probleme persistente"
Follow-up reports = CdC came back to verify whether earlier findings were
remediated. Existence of follow-ups means the original audit had material
issues. Cross-link to financial dependency on state contracts.
```sql
SELECT
r.audited_entity_cui,
fe.denumire,
fe.judet,
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
count(*) FILTER (WHERE r.audit_type <> 'follow-up') AS regular_audits,
array_agg(DISTINCT r.audit_year) FILTER (WHERE r.audit_type = 'follow-up') AS follow_up_years,
-- Cross-source: SEAP wins in same window
(SELECT count(*) FROM seap.announcements s
WHERE s.contracting_authority_cui = r.audited_entity_cui
AND s.award_date > min(r.publication_date)) AS seap_awards_post_first_audit,
(SELECT sum(contract_value) FROM seap.announcements s
WHERE s.contracting_authority_cui = r.audited_entity_cui
AND s.award_date > min(r.publication_date)) AS seap_value_post_first_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
GROUP BY r.audited_entity_cui, fe.denumire, fe.judet
HAVING count(*) FILTER (WHERE r.audit_type = 'follow-up') >= 1
ORDER BY follow_ups DESC, seap_value_post_first_audit DESC NULLS LAST
LIMIT 50;
```
## 4. Operational notes
- **TLS bypass**: `NODE_TLS_REJECT_UNAUTHORIZED=0` is set in the cron wrapper
— required because curteadeconturi.ro serves an intermediate CA chain that
Node's bundled CA store doesn't trust. Cert is valid OOB (browser trusts
it, Linux ca-certificates trusts it). Same workaround as `scrape-anre.sh`.
- **Gentle pacing**: 600ms between page fetches. Site is on shared infra,
no rate-limit headers observed. Stay polite.
- **Stable IDs**: Slugs are stable (we verified 7 historical IDs in scope).
`slug_id = sha1(category|slug)` PK survives slug renames within category
if CdC ever changes URLs (would re-insert as "new" — acceptable trade-off).
- **Cron suggestion**: weekly. New audits drip in at ~5-15/day on financiar.
`45 03 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh`
## 5. Files
- `services/seap-scraper/sql/035_curteacont.sql`
- `services/seap-scraper/src/scrape-curteacont.ts`
- `services/seap-scraper/cron/scrape-curteacont.sh`
- `services/seap-scraper/CURTEACONT-PLAN.md` (this file)