Files
Claude VM a6c03a091e initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix).
- 22 pages migrated, 127 files total
- All internal links: /achizitii/X → /X (176 occurrences fixed)
- AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub
- BaseLayout new (vreau.digital branding, OG tags, site URL)
- astro.config.mjs: site https://vreau.digital, server output (was static)
- docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital
- deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log

Backend shared with gov-agreg:
- PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...)
- Photon, Martin tiles
- Infisical /vreaudigital path (DATABASE_URL etc. shared)

build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
2026-05-13 00:10:32 +03:00

8.4 KiB
Raw Permalink Blame History

Curtea de Conturi (CdC) — Stage 1 done, Stage 2 roadmap

Ingest of audit reports from https://www.curteadeconturi.ro/rapoarte-audit/.

Stage 1 — DONE in this session

What was built:

  • services/seap-scraper/sql/035_curteacont.sql — schema:
    • curteacont.rapoarte (PK slug_id = sha1(category|slug))
    • curteacont.scrape_runs (one row per CLI invocation)
  • services/seap-scraper/src/scrape-curteacont.ts — listing-page walker:
    • Three sources: financiar, conformitate, performanta
    • Parses title → audit_year, doc_number, doc_date, audited_entity_name
    • Detects follow-up reports (title prefix Follow-up)
    • Reads <time datetime>publication_date
    • Idempotent UPSERT on slug_id
  • services/seap-scraper/cron/scrape-curteacont.sh — Infisical → docker run --env-file wrapper. Mirrors scrape-anre.sh. NODE_TLS_REJECT_UNAUTHORIZED=0 required (CdC serves an intermediate CA chain node's bundle doesn't trust).

Stage 1 ingest stats (2026-05-10):

category universe ingested parse rate (entity+doc_date)
financiar ~1,890 500 100%
conformitate ~2,580 500 TBD (similar pattern)
performanta ~135 133 100%
total ~4,605 1,133

Speed: ~25s per 500 reports (gentle 600ms delay between pages).

Page-count reference (verified by probing 2026-05-10)

financiar    ~127 pages × 15 = ~1,890 reports (last page=127 had 14)
conformitate ~173 pages × 15 = ~2,580 reports (last page=173 had 14)
performanta    9 pages × 15 = ~135 reports   (last page=9 had 13)

Run a full backfill:

sudo SOURCE=all /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh

Estimated wall time: ~6 minutes for ~4,600 rows + page fetches.

Stage 2 — TODO (next session, ~6-10h focused work)

Goal: resolve numeric download_id, mirror PDFs, parse first 3 pages, fuzzy-match audited_entity_cui.

2.1 — Resolve download_id from detail pages (~2h)

For each row with download_id IS NULL:

  1. Fetch detail_url.
  2. Regex /rapoarte-audit/downloads/(\d+)download_id.
  3. Regex \(([0-9,]+) (KB|MB|GB)\) next to download anchor → pdf_size_bytes.
  4. UPSERT.

Rate: ~2 req/s (gentle), ~40 min for 4,600 rows. Implement as scrape-curteacont-resolve.ts --batch=100. Idempotent on slug_id.

2.2 — Mirror PDFs to satra disk (~3-4h, optional)

  • Path: /opt/vreaudigital/data/cdc/{category}/{download_id}.pdf
  • Skip if pdf_path IS NOT NULL AND file exists.
  • Average size: ~2-3 MB → ~12-15 GB total for full corpus.
  • Update pdf_path after successful download.

2.3 — PDF first-page abstract + findings count (~2-3h)

  • Use pdftotext (poppler) — already on satra. Faster than pdfminer.
  • Read first 3 pages → summary (cleaned, dehyphenated text, 4-8 KB).
  • Count occurrences of "constatare", "abateri", "deficiență" → findings_count.
  • Some reports have a "Sinteza constatărilor" section — cheap regex to find it.

2.4 — CUI fuzzy match against firms.entities (~2h)

  • We already have services/seap-scraper/src/matching/cui-matcher.ts (commit f3477e2 — "CUI fuzzy matcher + /achizitii/beneficiar-privat/[id] profile page"). Reuse it.
  • Input: audited_entity_name (already populated by Stage 1).
  • Strategy:
    1. Exact match against firms.entities.denumire — high confidence.
    2. Trigram similarity (pg_trgm, index already exists) for top-3 candidates, then UAT-aware ranking (UATC = comună, UATM = municipiu, UATO = oraș, UATJ = județ). Most CdC entities are UATs — this is high-leverage.
    3. Fallback: store best-similarity score + leave NULL if < 0.6.
  • Update audited_entity_cui.
  • Expect 70-80% match rate on first pass; manual cleanup later.

3. Cross-source recipe drafts (draft SQL)

These SQLs reference Stage 2 data (audited_entity_cui populated). They give the strategic value of CdC ingest — per-CUI audit history × SEAP awards.

Recipe A — "Top autorități audited de N ori în 5 ani"

Repeat-audit signal: agencies audited many times in a short window typically have persistent issues. Powerful for the "Profil autoritate" page.

SELECT
  r.audited_entity_cui,
  fe.denumire,
  count(*) AS audit_count_5y,
  count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
  count(*) FILTER (WHERE r.audit_type = 'performanta') AS perf_audits,
  max(r.publication_date) AS last_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
  AND r.publication_date > now() - interval '5 years'
GROUP BY r.audited_entity_cui, fe.denumire
HAVING count(*) >= 3
ORDER BY audit_count_5y DESC, last_audit DESC
LIMIT 50;

Recipe B — "Spitale audited POST SEAP award" (paralelă cu CNAS)

Match SEAP contracts at hospitals against CdC audits issued AFTER award. A red-flag indicator that the procurement raised audit attention.

WITH hospital_seap AS (
  SELECT
    s.contracting_authority_cui AS cui,
    s.contracting_authority_name AS denumire,
    s.id AS seap_id,
    s.award_date,
    s.contract_value
  FROM seap.announcements s
  JOIN cnas.spitale_furnizori cf ON cf.cui = s.contracting_authority_cui
  WHERE s.award_date > now() - interval '5 years'
)
SELECT
  hs.cui,
  hs.denumire,
  count(DISTINCT hs.seap_id) AS seap_awards,
  sum(hs.contract_value)     AS total_value_ron,
  count(DISTINCT r.slug_id) FILTER (
    WHERE r.publication_date > hs.award_date
  )                           AS audits_after_award,
  array_agg(DISTINCT r.audit_type) FILTER (WHERE r.publication_date > hs.award_date) AS audit_types
FROM hospital_seap hs
LEFT JOIN curteacont.rapoarte r ON r.audited_entity_cui = hs.cui
GROUP BY hs.cui, hs.denumire
HAVING count(DISTINCT r.slug_id) FILTER (WHERE r.publication_date > hs.award_date) > 0
ORDER BY audits_after_award DESC, total_value_ron DESC
LIMIT 50;

Recipe C — "Autorități cu audit follow-up — probleme persistente"

Follow-up reports = CdC came back to verify whether earlier findings were remediated. Existence of follow-ups means the original audit had material issues. Cross-link to financial dependency on state contracts.

SELECT
  r.audited_entity_cui,
  fe.denumire,
  fe.judet,
  count(*) FILTER (WHERE r.audit_type = 'follow-up')  AS follow_ups,
  count(*) FILTER (WHERE r.audit_type <> 'follow-up') AS regular_audits,
  array_agg(DISTINCT r.audit_year) FILTER (WHERE r.audit_type = 'follow-up') AS follow_up_years,
  -- Cross-source: SEAP wins in same window
  (SELECT count(*) FROM seap.announcements s
    WHERE s.contracting_authority_cui = r.audited_entity_cui
      AND s.award_date > min(r.publication_date)) AS seap_awards_post_first_audit,
  (SELECT sum(contract_value) FROM seap.announcements s
    WHERE s.contracting_authority_cui = r.audited_entity_cui
      AND s.award_date > min(r.publication_date)) AS seap_value_post_first_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
GROUP BY r.audited_entity_cui, fe.denumire, fe.judet
HAVING count(*) FILTER (WHERE r.audit_type = 'follow-up') >= 1
ORDER BY follow_ups DESC, seap_value_post_first_audit DESC NULLS LAST
LIMIT 50;

4. Operational notes

  • TLS bypass: NODE_TLS_REJECT_UNAUTHORIZED=0 is set in the cron wrapper — required because curteadeconturi.ro serves an intermediate CA chain that Node's bundled CA store doesn't trust. Cert is valid OOB (browser trusts it, Linux ca-certificates trusts it). Same workaround as scrape-anre.sh.
  • Gentle pacing: 600ms between page fetches. Site is on shared infra, no rate-limit headers observed. Stay polite.
  • Stable IDs: Slugs are stable (we verified 7 historical IDs in scope). slug_id = sha1(category|slug) PK survives slug renames within category if CdC ever changes URLs (would re-insert as "new" — acceptable trade-off).
  • Cron suggestion: weekly. New audits drip in at ~5-15/day on financiar. 45 03 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh

5. Files

  • services/seap-scraper/sql/035_curteacont.sql
  • services/seap-scraper/src/scrape-curteacont.ts
  • services/seap-scraper/cron/scrape-curteacont.sh
  • services/seap-scraper/CURTEACONT-PLAN.md (this file)