a6c03a091e
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
206 lines
8.4 KiB
Markdown
206 lines
8.4 KiB
Markdown
# Curtea de Conturi (CdC) — Stage 1 done, Stage 2 roadmap
|
||
|
||
Ingest of audit reports from https://www.curteadeconturi.ro/rapoarte-audit/.
|
||
|
||
## Stage 1 — DONE in this session
|
||
|
||
What was built:
|
||
|
||
- `services/seap-scraper/sql/035_curteacont.sql` — schema:
|
||
- `curteacont.rapoarte` (PK `slug_id` = sha1(category|slug))
|
||
- `curteacont.scrape_runs` (one row per CLI invocation)
|
||
- `services/seap-scraper/src/scrape-curteacont.ts` — listing-page walker:
|
||
- Three sources: `financiar`, `conformitate`, `performanta`
|
||
- Parses title → `audit_year`, `doc_number`, `doc_date`, `audited_entity_name`
|
||
- Detects follow-up reports (title prefix `Follow-up`)
|
||
- Reads `<time datetime>` → `publication_date`
|
||
- Idempotent UPSERT on `slug_id`
|
||
- `services/seap-scraper/cron/scrape-curteacont.sh` — Infisical → docker run
|
||
--env-file wrapper. Mirrors `scrape-anre.sh`. NODE_TLS_REJECT_UNAUTHORIZED=0
|
||
required (CdC serves an intermediate CA chain node's bundle doesn't trust).
|
||
|
||
Stage 1 ingest stats (2026-05-10):
|
||
|
||
| category | universe | ingested | parse rate (entity+doc_date) |
|
||
|-------------|----------|----------|-------------------------------|
|
||
| financiar | ~1,890 | 500 | 100% |
|
||
| conformitate| ~2,580 | 500 | TBD (similar pattern) |
|
||
| performanta | ~135 | 133 | 100% |
|
||
| **total** | **~4,605** | **1,133** | — |
|
||
|
||
Speed: ~25s per 500 reports (gentle 600ms delay between pages).
|
||
|
||
## Page-count reference (verified by probing 2026-05-10)
|
||
|
||
```
|
||
financiar ~127 pages × 15 = ~1,890 reports (last page=127 had 14)
|
||
conformitate ~173 pages × 15 = ~2,580 reports (last page=173 had 14)
|
||
performanta 9 pages × 15 = ~135 reports (last page=9 had 13)
|
||
```
|
||
|
||
Run a full backfill:
|
||
|
||
```bash
|
||
sudo SOURCE=all /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh
|
||
```
|
||
|
||
Estimated wall time: ~6 minutes for ~4,600 rows + page fetches.
|
||
|
||
## Stage 2 — TODO (next session, ~6-10h focused work)
|
||
|
||
Goal: resolve numeric `download_id`, mirror PDFs, parse first 3 pages, fuzzy-match `audited_entity_cui`.
|
||
|
||
### 2.1 — Resolve `download_id` from detail pages (~2h)
|
||
|
||
For each row with `download_id IS NULL`:
|
||
|
||
1. Fetch `detail_url`.
|
||
2. Regex `/rapoarte-audit/downloads/(\d+)` → `download_id`.
|
||
3. Regex `\(([0-9,]+) (KB|MB|GB)\)` next to download anchor → `pdf_size_bytes`.
|
||
4. UPSERT.
|
||
|
||
Rate: ~2 req/s (gentle), ~40 min for 4,600 rows. Implement as
|
||
`scrape-curteacont-resolve.ts --batch=100`. Idempotent on `slug_id`.
|
||
|
||
### 2.2 — Mirror PDFs to satra disk (~3-4h, optional)
|
||
|
||
- Path: `/opt/vreaudigital/data/cdc/{category}/{download_id}.pdf`
|
||
- Skip if `pdf_path IS NOT NULL` AND file exists.
|
||
- Average size: ~2-3 MB → ~12-15 GB total for full corpus.
|
||
- Update `pdf_path` after successful download.
|
||
|
||
### 2.3 — PDF first-page abstract + findings count (~2-3h)
|
||
|
||
- Use `pdftotext` (poppler) — already on satra. Faster than pdfminer.
|
||
- Read first 3 pages → `summary` (cleaned, dehyphenated text, 4-8 KB).
|
||
- Count occurrences of "constatare", "abateri", "deficiență" → `findings_count`.
|
||
- Some reports have a "Sinteza constatărilor" section — cheap regex to find it.
|
||
|
||
### 2.4 — CUI fuzzy match against `firms.entities` (~2h)
|
||
|
||
- We already have `services/seap-scraper/src/matching/cui-matcher.ts`
|
||
(commit f3477e2 — "CUI fuzzy matcher + /achizitii/beneficiar-privat/[id]
|
||
profile page"). Reuse it.
|
||
- Input: `audited_entity_name` (already populated by Stage 1).
|
||
- Strategy:
|
||
1. Exact match against `firms.entities.denumire` — high confidence.
|
||
2. Trigram similarity (`pg_trgm`, index already exists) for top-3 candidates,
|
||
then UAT-aware ranking (UATC = comună, UATM = municipiu, UATO = oraș,
|
||
UATJ = județ). Most CdC entities are UATs — this is high-leverage.
|
||
3. Fallback: store best-similarity score + leave NULL if < 0.6.
|
||
- Update `audited_entity_cui`.
|
||
- Expect 70-80% match rate on first pass; manual cleanup later.
|
||
|
||
## 3. Cross-source recipe drafts (draft SQL)
|
||
|
||
These SQLs reference Stage 2 data (`audited_entity_cui` populated). They give
|
||
the strategic value of CdC ingest — per-CUI audit history × SEAP awards.
|
||
|
||
### Recipe A — "Top autorități audited de N ori în 5 ani"
|
||
|
||
Repeat-audit signal: agencies audited many times in a short window typically
|
||
have persistent issues. Powerful for the "Profil autoritate" page.
|
||
|
||
```sql
|
||
SELECT
|
||
r.audited_entity_cui,
|
||
fe.denumire,
|
||
count(*) AS audit_count_5y,
|
||
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
|
||
count(*) FILTER (WHERE r.audit_type = 'performanta') AS perf_audits,
|
||
max(r.publication_date) AS last_audit
|
||
FROM curteacont.rapoarte r
|
||
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
|
||
WHERE r.audited_entity_cui IS NOT NULL
|
||
AND r.publication_date > now() - interval '5 years'
|
||
GROUP BY r.audited_entity_cui, fe.denumire
|
||
HAVING count(*) >= 3
|
||
ORDER BY audit_count_5y DESC, last_audit DESC
|
||
LIMIT 50;
|
||
```
|
||
|
||
### Recipe B — "Spitale audited POST SEAP award" (paralelă cu CNAS)
|
||
|
||
Match SEAP contracts at hospitals against CdC audits issued AFTER award.
|
||
A red-flag indicator that the procurement raised audit attention.
|
||
|
||
```sql
|
||
WITH hospital_seap AS (
|
||
SELECT
|
||
s.contracting_authority_cui AS cui,
|
||
s.contracting_authority_name AS denumire,
|
||
s.id AS seap_id,
|
||
s.award_date,
|
||
s.contract_value
|
||
FROM seap.announcements s
|
||
JOIN cnas.spitale_furnizori cf ON cf.cui = s.contracting_authority_cui
|
||
WHERE s.award_date > now() - interval '5 years'
|
||
)
|
||
SELECT
|
||
hs.cui,
|
||
hs.denumire,
|
||
count(DISTINCT hs.seap_id) AS seap_awards,
|
||
sum(hs.contract_value) AS total_value_ron,
|
||
count(DISTINCT r.slug_id) FILTER (
|
||
WHERE r.publication_date > hs.award_date
|
||
) AS audits_after_award,
|
||
array_agg(DISTINCT r.audit_type) FILTER (WHERE r.publication_date > hs.award_date) AS audit_types
|
||
FROM hospital_seap hs
|
||
LEFT JOIN curteacont.rapoarte r ON r.audited_entity_cui = hs.cui
|
||
GROUP BY hs.cui, hs.denumire
|
||
HAVING count(DISTINCT r.slug_id) FILTER (WHERE r.publication_date > hs.award_date) > 0
|
||
ORDER BY audits_after_award DESC, total_value_ron DESC
|
||
LIMIT 50;
|
||
```
|
||
|
||
### Recipe C — "Autorități cu audit follow-up — probleme persistente"
|
||
|
||
Follow-up reports = CdC came back to verify whether earlier findings were
|
||
remediated. Existence of follow-ups means the original audit had material
|
||
issues. Cross-link to financial dependency on state contracts.
|
||
|
||
```sql
|
||
SELECT
|
||
r.audited_entity_cui,
|
||
fe.denumire,
|
||
fe.judet,
|
||
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
|
||
count(*) FILTER (WHERE r.audit_type <> 'follow-up') AS regular_audits,
|
||
array_agg(DISTINCT r.audit_year) FILTER (WHERE r.audit_type = 'follow-up') AS follow_up_years,
|
||
-- Cross-source: SEAP wins in same window
|
||
(SELECT count(*) FROM seap.announcements s
|
||
WHERE s.contracting_authority_cui = r.audited_entity_cui
|
||
AND s.award_date > min(r.publication_date)) AS seap_awards_post_first_audit,
|
||
(SELECT sum(contract_value) FROM seap.announcements s
|
||
WHERE s.contracting_authority_cui = r.audited_entity_cui
|
||
AND s.award_date > min(r.publication_date)) AS seap_value_post_first_audit
|
||
FROM curteacont.rapoarte r
|
||
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
|
||
WHERE r.audited_entity_cui IS NOT NULL
|
||
GROUP BY r.audited_entity_cui, fe.denumire, fe.judet
|
||
HAVING count(*) FILTER (WHERE r.audit_type = 'follow-up') >= 1
|
||
ORDER BY follow_ups DESC, seap_value_post_first_audit DESC NULLS LAST
|
||
LIMIT 50;
|
||
```
|
||
|
||
## 4. Operational notes
|
||
|
||
- **TLS bypass**: `NODE_TLS_REJECT_UNAUTHORIZED=0` is set in the cron wrapper
|
||
— required because curteadeconturi.ro serves an intermediate CA chain that
|
||
Node's bundled CA store doesn't trust. Cert is valid OOB (browser trusts
|
||
it, Linux ca-certificates trusts it). Same workaround as `scrape-anre.sh`.
|
||
- **Gentle pacing**: 600ms between page fetches. Site is on shared infra,
|
||
no rate-limit headers observed. Stay polite.
|
||
- **Stable IDs**: Slugs are stable (we verified 7 historical IDs in scope).
|
||
`slug_id = sha1(category|slug)` PK survives slug renames within category
|
||
if CdC ever changes URLs (would re-insert as "new" — acceptable trade-off).
|
||
- **Cron suggestion**: weekly. New audits drip in at ~5-15/day on financiar.
|
||
`45 03 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh`
|
||
|
||
## 5. Files
|
||
|
||
- `services/seap-scraper/sql/035_curteacont.sql`
|
||
- `services/seap-scraper/src/scrape-curteacont.ts`
|
||
- `services/seap-scraper/cron/scrape-curteacont.sh`
|
||
- `services/seap-scraper/CURTEACONT-PLAN.md` (this file)
|