vreau-digital/services/seap-scraper/CURTEACONT-PLAN.md

# Curtea de Conturi (CdC) — Stage 1 done, Stage 2 roadmap

Ingest of audit reports from https://www.curteadeconturi.ro/rapoarte-audit/.

## Stage 1 — DONE in this session

What was built:

- `services/seap-scraper/sql/035_curteacont.sql` — schema:
  - `curteacont.rapoarte` (PK `slug_id` = sha1(category|slug))
  - `curteacont.scrape_runs` (one row per CLI invocation)
- `services/seap-scraper/src/scrape-curteacont.ts` — listing-page walker:
  - Three sources: `financiar`, `conformitate`, `performanta`
  - Parses title → `audit_year`, `doc_number`, `doc_date`, `audited_entity_name`
  - Detects follow-up reports (title prefix `Follow-up`)
  - Reads `<time datetime>` → `publication_date`
  - Idempotent UPSERT on `slug_id`
- `services/seap-scraper/cron/scrape-curteacont.sh` — Infisical → docker run
  --env-file wrapper. Mirrors `scrape-anre.sh`. NODE_TLS_REJECT_UNAUTHORIZED=0
  required (CdC serves an intermediate CA chain node's bundle doesn't trust).

Stage 1 ingest stats (2026-05-10):

| category    | universe | ingested | parse rate (entity+doc_date) |
|-------------|----------|----------|-------------------------------|
| financiar   | ~1,890   | 500      | 100%                          |
| conformitate| ~2,580   | 500      | TBD (similar pattern)         |
| performanta | ~135     | 133      | 100%                          |
| **total**   | **~4,605** | **1,133** | —                            |

Speed: ~25s per 500 reports (gentle 600ms delay between pages).

## Page-count reference (verified by probing 2026-05-10)

```
financiar    ~127 pages × 15 = ~1,890 reports (last page=127 had 14)
conformitate ~173 pages × 15 = ~2,580 reports (last page=173 had 14)
performanta    9 pages × 15 = ~135 reports   (last page=9 had 13)
```

Run a full backfill:

```bash
sudo SOURCE=all /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh
```

Estimated wall time: ~6 minutes for ~4,600 rows + page fetches.

## Stage 2 — TODO (next session, ~6-10h focused work)

Goal: resolve numeric `download_id`, mirror PDFs, parse first 3 pages, fuzzy-match `audited_entity_cui`.

### 2.1 — Resolve `download_id` from detail pages (~2h)

For each row with `download_id IS NULL`:

1. Fetch `detail_url`.
2. Regex `/rapoarte-audit/downloads/(\d+)` → `download_id`.
3. Regex `\(([0-9,]+) (KB|MB|GB)\)` next to download anchor → `pdf_size_bytes`.
4. UPSERT.

Rate: ~2 req/s (gentle), ~40 min for 4,600 rows. Implement as
`scrape-curteacont-resolve.ts --batch=100`. Idempotent on `slug_id`.

### 2.2 — Mirror PDFs to satra disk (~3-4h, optional)

- Path: `/opt/vreaudigital/data/cdc/{category}/{download_id}.pdf`
- Skip if `pdf_path IS NOT NULL` AND file exists.
- Average size: ~2-3 MB → ~12-15 GB total for full corpus.
- Update `pdf_path` after successful download.

### 2.3 — PDF first-page abstract + findings count (~2-3h)

- Use `pdftotext` (poppler) — already on satra. Faster than pdfminer.
- Read first 3 pages → `summary` (cleaned, dehyphenated text, 4-8 KB).
- Count occurrences of "constatare", "abateri", "deficiență" → `findings_count`.
- Some reports have a "Sinteza constatărilor" section — cheap regex to find it.

### 2.4 — CUI fuzzy match against `firms.entities` (~2h)

- We already have `services/seap-scraper/src/matching/cui-matcher.ts`
  (commit f3477e2 — "CUI fuzzy matcher + /achizitii/beneficiar-privat/[id]
  profile page"). Reuse it.
- Input: `audited_entity_name` (already populated by Stage 1).
- Strategy:
  1. Exact match against `firms.entities.denumire` — high confidence.
  2. Trigram similarity (`pg_trgm`, index already exists) for top-3 candidates,
     then UAT-aware ranking (UATC = comună, UATM = municipiu, UATO = oraș,
     UATJ = județ). Most CdC entities are UATs — this is high-leverage.
  3. Fallback: store best-similarity score + leave NULL if < 0.6.
- Update `audited_entity_cui`.
- Expect 70-80% match rate on first pass; manual cleanup later.

## 3. Cross-source recipe drafts (draft SQL)

These SQLs reference Stage 2 data (`audited_entity_cui` populated). They give
the strategic value of CdC ingest — per-CUI audit history × SEAP awards.

### Recipe A — "Top autorități audited de N ori în 5 ani"

Repeat-audit signal: agencies audited many times in a short window typically
have persistent issues. Powerful for the "Profil autoritate" page.

```sql
SELECT
  r.audited_entity_cui,
  fe.denumire,
  count(*) AS audit_count_5y,
  count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
  count(*) FILTER (WHERE r.audit_type = 'performanta') AS perf_audits,
  max(r.publication_date) AS last_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
  AND r.publication_date > now() - interval '5 years'
GROUP BY r.audited_entity_cui, fe.denumire
HAVING count(*) >= 3
ORDER BY audit_count_5y DESC, last_audit DESC
LIMIT 50;
```

### Recipe B — "Spitale audited POST SEAP award" (paralelă cu CNAS)

Match SEAP contracts at hospitals against CdC audits issued AFTER award.
A red-flag indicator that the procurement raised audit attention.

```sql
WITH hospital_seap AS (
  SELECT
    s.contracting_authority_cui AS cui,
    s.contracting_authority_name AS denumire,
    s.id AS seap_id,
    s.award_date,
    s.contract_value
  FROM seap.announcements s
  JOIN cnas.spitale_furnizori cf ON cf.cui = s.contracting_authority_cui
  WHERE s.award_date > now() - interval '5 years'
)
SELECT
  hs.cui,
  hs.denumire,
  count(DISTINCT hs.seap_id) AS seap_awards,
  sum(hs.contract_value)     AS total_value_ron,
  count(DISTINCT r.slug_id) FILTER (
    WHERE r.publication_date > hs.award_date
  )                           AS audits_after_award,
  array_agg(DISTINCT r.audit_type) FILTER (WHERE r.publication_date > hs.award_date) AS audit_types
FROM hospital_seap hs
LEFT JOIN curteacont.rapoarte r ON r.audited_entity_cui = hs.cui
GROUP BY hs.cui, hs.denumire
HAVING count(DISTINCT r.slug_id) FILTER (WHERE r.publication_date > hs.award_date) > 0
ORDER BY audits_after_award DESC, total_value_ron DESC
LIMIT 50;
```

### Recipe C — "Autorități cu audit follow-up — probleme persistente"

Follow-up reports = CdC came back to verify whether earlier findings were
remediated. Existence of follow-ups means the original audit had material
issues. Cross-link to financial dependency on state contracts.

```sql
SELECT
  r.audited_entity_cui,
  fe.denumire,
  fe.judet,
  count(*) FILTER (WHERE r.audit_type = 'follow-up')  AS follow_ups,
  count(*) FILTER (WHERE r.audit_type <> 'follow-up') AS regular_audits,
  array_agg(DISTINCT r.audit_year) FILTER (WHERE r.audit_type = 'follow-up') AS follow_up_years,
  -- Cross-source: SEAP wins in same window
  (SELECT count(*) FROM seap.announcements s
    WHERE s.contracting_authority_cui = r.audited_entity_cui
      AND s.award_date > min(r.publication_date)) AS seap_awards_post_first_audit,
  (SELECT sum(contract_value) FROM seap.announcements s
    WHERE s.contracting_authority_cui = r.audited_entity_cui
      AND s.award_date > min(r.publication_date)) AS seap_value_post_first_audit
FROM curteacont.rapoarte r
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
WHERE r.audited_entity_cui IS NOT NULL
GROUP BY r.audited_entity_cui, fe.denumire, fe.judet
HAVING count(*) FILTER (WHERE r.audit_type = 'follow-up') >= 1
ORDER BY follow_ups DESC, seap_value_post_first_audit DESC NULLS LAST
LIMIT 50;
```

## 4. Operational notes

- **TLS bypass**: `NODE_TLS_REJECT_UNAUTHORIZED=0` is set in the cron wrapper
  — required because curteadeconturi.ro serves an intermediate CA chain that
  Node's bundled CA store doesn't trust. Cert is valid OOB (browser trusts
  it, Linux ca-certificates trusts it). Same workaround as `scrape-anre.sh`.
- **Gentle pacing**: 600ms between page fetches. Site is on shared infra,
  no rate-limit headers observed. Stay polite.
- **Stable IDs**: Slugs are stable (we verified 7 historical IDs in scope).
  `slug_id = sha1(category|slug)` PK survives slug renames within category
  if CdC ever changes URLs (would re-insert as "new" — acceptable trade-off).
- **Cron suggestion**: weekly. New audits drip in at ~5-15/day on financiar.
  `45 03 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh`

## 5. Files

- `services/seap-scraper/sql/035_curteacont.sql`
- `services/seap-scraper/src/scrape-curteacont.ts`
- `services/seap-scraper/cron/scrape-curteacont.sh`
- `services/seap-scraper/CURTEACONT-PLAN.md` (this file)