Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
8.4 KiB
Curtea de Conturi (CdC) — Stage 1 done, Stage 2 roadmap
Ingest of audit reports from https://www.curteadeconturi.ro/rapoarte-audit/.
Stage 1 — DONE in this session
What was built:
services/seap-scraper/sql/035_curteacont.sql— schema:curteacont.rapoarte(PKslug_id= sha1(category|slug))curteacont.scrape_runs(one row per CLI invocation)
services/seap-scraper/src/scrape-curteacont.ts— listing-page walker:- Three sources:
financiar,conformitate,performanta - Parses title →
audit_year,doc_number,doc_date,audited_entity_name - Detects follow-up reports (title prefix
Follow-up) - Reads
<time datetime>→publication_date - Idempotent UPSERT on
slug_id
- Three sources:
services/seap-scraper/cron/scrape-curteacont.sh— Infisical → docker run --env-file wrapper. Mirrorsscrape-anre.sh. NODE_TLS_REJECT_UNAUTHORIZED=0 required (CdC serves an intermediate CA chain node's bundle doesn't trust).
Stage 1 ingest stats (2026-05-10):
| category | universe | ingested | parse rate (entity+doc_date) |
|---|---|---|---|
| financiar | ~1,890 | 500 | 100% |
| conformitate | ~2,580 | 500 | TBD (similar pattern) |
| performanta | ~135 | 133 | 100% |
| total | ~4,605 | 1,133 | — |
Speed: ~25s per 500 reports (gentle 600ms delay between pages).
Page-count reference (verified by probing 2026-05-10)
financiar ~127 pages × 15 = ~1,890 reports (last page=127 had 14)
conformitate ~173 pages × 15 = ~2,580 reports (last page=173 had 14)
performanta 9 pages × 15 = ~135 reports (last page=9 had 13)
Run a full backfill:
sudo SOURCE=all /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh
Estimated wall time: ~6 minutes for ~4,600 rows + page fetches.
Stage 2 — TODO (next session, ~6-10h focused work)
Goal: resolve numeric download_id, mirror PDFs, parse first 3 pages, fuzzy-match audited_entity_cui.
2.1 — Resolve download_id from detail pages (~2h)
For each row with download_id IS NULL:
- Fetch
detail_url. - Regex
/rapoarte-audit/downloads/(\d+)→download_id. - Regex
\(([0-9,]+) (KB|MB|GB)\)next to download anchor →pdf_size_bytes. - UPSERT.
Rate: ~2 req/s (gentle), ~40 min for 4,600 rows. Implement as
scrape-curteacont-resolve.ts --batch=100. Idempotent on slug_id.
2.2 — Mirror PDFs to satra disk (~3-4h, optional)
- Path:
/opt/vreaudigital/data/cdc/{category}/{download_id}.pdf - Skip if
pdf_path IS NOT NULLAND file exists. - Average size: ~2-3 MB → ~12-15 GB total for full corpus.
- Update
pdf_pathafter successful download.
2.3 — PDF first-page abstract + findings count (~2-3h)
- Use
pdftotext(poppler) — already on satra. Faster than pdfminer. - Read first 3 pages →
summary(cleaned, dehyphenated text, 4-8 KB). - Count occurrences of "constatare", "abateri", "deficiență" →
findings_count. - Some reports have a "Sinteza constatărilor" section — cheap regex to find it.
2.4 — CUI fuzzy match against firms.entities (~2h)
- We already have
services/seap-scraper/src/matching/cui-matcher.ts(commit f3477e2 — "CUI fuzzy matcher + /achizitii/beneficiar-privat/[id] profile page"). Reuse it. - Input:
audited_entity_name(already populated by Stage 1). - Strategy:
- Exact match against
firms.entities.denumire— high confidence. - Trigram similarity (
pg_trgm, index already exists) for top-3 candidates, then UAT-aware ranking (UATC = comună, UATM = municipiu, UATO = oraș, UATJ = județ). Most CdC entities are UATs — this is high-leverage. - Fallback: store best-similarity score + leave NULL if < 0.6.
- Exact match against
- Update
audited_entity_cui. - Expect 70-80% match rate on first pass; manual cleanup later.
3. Cross-source recipe drafts (draft SQL)
These SQLs reference Stage 2 data (audited_entity_cui populated). They give
the strategic value of CdC ingest — per-CUI audit history × SEAP awards.
Recipe A — "Top autorități audited de N ori în 5 ani"
Repeat-audit signal: agencies audited many times in a short window typically have persistent issues. Powerful for the "Profil autoritate" page.
SELECT
r.audited_entity_cui,
fe.denumire,
count(*) AS audit_count_5y,
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
count(*) FILTER (WHERE r.audit_type = 'performanta') AS perf_audits,
max(r.publication_date) AS last_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
AND r.publication_date > now() - interval '5 years'
GROUP BY r.audited_entity_cui, fe.denumire
HAVING count(*) >= 3
ORDER BY audit_count_5y DESC, last_audit DESC
LIMIT 50;
Recipe B — "Spitale audited POST SEAP award" (paralelă cu CNAS)
Match SEAP contracts at hospitals against CdC audits issued AFTER award. A red-flag indicator that the procurement raised audit attention.
WITH hospital_seap AS (
SELECT
s.contracting_authority_cui AS cui,
s.contracting_authority_name AS denumire,
s.id AS seap_id,
s.award_date,
s.contract_value
FROM seap.announcements s
JOIN cnas.spitale_furnizori cf ON cf.cui = s.contracting_authority_cui
WHERE s.award_date > now() - interval '5 years'
)
SELECT
hs.cui,
hs.denumire,
count(DISTINCT hs.seap_id) AS seap_awards,
sum(hs.contract_value) AS total_value_ron,
count(DISTINCT r.slug_id) FILTER (
WHERE r.publication_date > hs.award_date
) AS audits_after_award,
array_agg(DISTINCT r.audit_type) FILTER (WHERE r.publication_date > hs.award_date) AS audit_types
FROM hospital_seap hs
LEFT JOIN curteacont.rapoarte r ON r.audited_entity_cui = hs.cui
GROUP BY hs.cui, hs.denumire
HAVING count(DISTINCT r.slug_id) FILTER (WHERE r.publication_date > hs.award_date) > 0
ORDER BY audits_after_award DESC, total_value_ron DESC
LIMIT 50;
Recipe C — "Autorități cu audit follow-up — probleme persistente"
Follow-up reports = CdC came back to verify whether earlier findings were remediated. Existence of follow-ups means the original audit had material issues. Cross-link to financial dependency on state contracts.
SELECT
r.audited_entity_cui,
fe.denumire,
fe.judet,
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
count(*) FILTER (WHERE r.audit_type <> 'follow-up') AS regular_audits,
array_agg(DISTINCT r.audit_year) FILTER (WHERE r.audit_type = 'follow-up') AS follow_up_years,
-- Cross-source: SEAP wins in same window
(SELECT count(*) FROM seap.announcements s
WHERE s.contracting_authority_cui = r.audited_entity_cui
AND s.award_date > min(r.publication_date)) AS seap_awards_post_first_audit,
(SELECT sum(contract_value) FROM seap.announcements s
WHERE s.contracting_authority_cui = r.audited_entity_cui
AND s.award_date > min(r.publication_date)) AS seap_value_post_first_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
GROUP BY r.audited_entity_cui, fe.denumire, fe.judet
HAVING count(*) FILTER (WHERE r.audit_type = 'follow-up') >= 1
ORDER BY follow_ups DESC, seap_value_post_first_audit DESC NULLS LAST
LIMIT 50;
4. Operational notes
- TLS bypass:
NODE_TLS_REJECT_UNAUTHORIZED=0is set in the cron wrapper — required because curteadeconturi.ro serves an intermediate CA chain that Node's bundled CA store doesn't trust. Cert is valid OOB (browser trusts it, Linux ca-certificates trusts it). Same workaround asscrape-anre.sh. - Gentle pacing: 600ms between page fetches. Site is on shared infra, no rate-limit headers observed. Stay polite.
- Stable IDs: Slugs are stable (we verified 7 historical IDs in scope).
slug_id = sha1(category|slug)PK survives slug renames within category if CdC ever changes URLs (would re-insert as "new" — acceptable trade-off). - Cron suggestion: weekly. New audits drip in at ~5-15/day on financiar.
45 03 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh
5. Files
services/seap-scraper/sql/035_curteacont.sqlservices/seap-scraper/src/scrape-curteacont.tsservices/seap-scraper/cron/scrape-curteacont.shservices/seap-scraper/CURTEACONT-PLAN.md(this file)