initial: split from gov-agreg — vreau.digital standalone platform
Moved from gov-agreg/src/pages/achizitii/* to root (drop prefix). - 22 pages migrated, 127 files total - All internal links: /achizitii/X → /X (176 occurrences fixed) - AchizitiiLayout subnav rewritten: /X paths, top-right link to vreaudigital.ro hub - BaseLayout new (vreau.digital branding, OG tags, site URL) - astro.config.mjs: site https://vreau.digital, server output (was static) - docker-compose: port 5096 (vreaudigital is 5095), container vreau-digital - deploy.sh: paths /opt/vreau-digital, log /var/log/vreau-digital-deploy.log Backend shared with gov-agreg: - PostgreSQL satra (same schemas: seap, firms, anaf, anre, ...) - Photon, Martin tiles - Infisical /vreaudigital path (DATABASE_URL etc. shared) build: PASS (npx astro check 0 errors, npm run build 5s vite + 10s server)
This commit is contained in:
@@ -0,0 +1,205 @@
|
||||
# Curtea de Conturi (CdC) — Stage 1 done, Stage 2 roadmap
|
||||
|
||||
Ingest of audit reports from https://www.curteadeconturi.ro/rapoarte-audit/.
|
||||
|
||||
## Stage 1 — DONE in this session
|
||||
|
||||
What was built:
|
||||
|
||||
- `services/seap-scraper/sql/035_curteacont.sql` — schema:
|
||||
- `curteacont.rapoarte` (PK `slug_id` = sha1(category|slug))
|
||||
- `curteacont.scrape_runs` (one row per CLI invocation)
|
||||
- `services/seap-scraper/src/scrape-curteacont.ts` — listing-page walker:
|
||||
- Three sources: `financiar`, `conformitate`, `performanta`
|
||||
- Parses title → `audit_year`, `doc_number`, `doc_date`, `audited_entity_name`
|
||||
- Detects follow-up reports (title prefix `Follow-up`)
|
||||
- Reads `<time datetime>` → `publication_date`
|
||||
- Idempotent UPSERT on `slug_id`
|
||||
- `services/seap-scraper/cron/scrape-curteacont.sh` — Infisical → docker run
|
||||
--env-file wrapper. Mirrors `scrape-anre.sh`. NODE_TLS_REJECT_UNAUTHORIZED=0
|
||||
required (CdC serves an intermediate CA chain node's bundle doesn't trust).
|
||||
|
||||
Stage 1 ingest stats (2026-05-10):
|
||||
|
||||
| category | universe | ingested | parse rate (entity+doc_date) |
|
||||
|-------------|----------|----------|-------------------------------|
|
||||
| financiar | ~1,890 | 500 | 100% |
|
||||
| conformitate| ~2,580 | 500 | TBD (similar pattern) |
|
||||
| performanta | ~135 | 133 | 100% |
|
||||
| **total** | **~4,605** | **1,133** | — |
|
||||
|
||||
Speed: ~25s per 500 reports (gentle 600ms delay between pages).
|
||||
|
||||
## Page-count reference (verified by probing 2026-05-10)
|
||||
|
||||
```
|
||||
financiar ~127 pages × 15 = ~1,890 reports (last page=127 had 14)
|
||||
conformitate ~173 pages × 15 = ~2,580 reports (last page=173 had 14)
|
||||
performanta 9 pages × 15 = ~135 reports (last page=9 had 13)
|
||||
```
|
||||
|
||||
Run a full backfill:
|
||||
|
||||
```bash
|
||||
sudo SOURCE=all /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh
|
||||
```
|
||||
|
||||
Estimated wall time: ~6 minutes for ~4,600 rows + page fetches.
|
||||
|
||||
## Stage 2 — TODO (next session, ~6-10h focused work)
|
||||
|
||||
Goal: resolve numeric `download_id`, mirror PDFs, parse first 3 pages, fuzzy-match `audited_entity_cui`.
|
||||
|
||||
### 2.1 — Resolve `download_id` from detail pages (~2h)
|
||||
|
||||
For each row with `download_id IS NULL`:
|
||||
|
||||
1. Fetch `detail_url`.
|
||||
2. Regex `/rapoarte-audit/downloads/(\d+)` → `download_id`.
|
||||
3. Regex `\(([0-9,]+) (KB|MB|GB)\)` next to download anchor → `pdf_size_bytes`.
|
||||
4. UPSERT.
|
||||
|
||||
Rate: ~2 req/s (gentle), ~40 min for 4,600 rows. Implement as
|
||||
`scrape-curteacont-resolve.ts --batch=100`. Idempotent on `slug_id`.
|
||||
|
||||
### 2.2 — Mirror PDFs to satra disk (~3-4h, optional)
|
||||
|
||||
- Path: `/opt/vreaudigital/data/cdc/{category}/{download_id}.pdf`
|
||||
- Skip if `pdf_path IS NOT NULL` AND file exists.
|
||||
- Average size: ~2-3 MB → ~12-15 GB total for full corpus.
|
||||
- Update `pdf_path` after successful download.
|
||||
|
||||
### 2.3 — PDF first-page abstract + findings count (~2-3h)
|
||||
|
||||
- Use `pdftotext` (poppler) — already on satra. Faster than pdfminer.
|
||||
- Read first 3 pages → `summary` (cleaned, dehyphenated text, 4-8 KB).
|
||||
- Count occurrences of "constatare", "abateri", "deficiență" → `findings_count`.
|
||||
- Some reports have a "Sinteza constatărilor" section — cheap regex to find it.
|
||||
|
||||
### 2.4 — CUI fuzzy match against `firms.entities` (~2h)
|
||||
|
||||
- We already have `services/seap-scraper/src/matching/cui-matcher.ts`
|
||||
(commit f3477e2 — "CUI fuzzy matcher + /achizitii/beneficiar-privat/[id]
|
||||
profile page"). Reuse it.
|
||||
- Input: `audited_entity_name` (already populated by Stage 1).
|
||||
- Strategy:
|
||||
1. Exact match against `firms.entities.denumire` — high confidence.
|
||||
2. Trigram similarity (`pg_trgm`, index already exists) for top-3 candidates,
|
||||
then UAT-aware ranking (UATC = comună, UATM = municipiu, UATO = oraș,
|
||||
UATJ = județ). Most CdC entities are UATs — this is high-leverage.
|
||||
3. Fallback: store best-similarity score + leave NULL if < 0.6.
|
||||
- Update `audited_entity_cui`.
|
||||
- Expect 70-80% match rate on first pass; manual cleanup later.
|
||||
|
||||
## 3. Cross-source recipe drafts (draft SQL)
|
||||
|
||||
These SQLs reference Stage 2 data (`audited_entity_cui` populated). They give
|
||||
the strategic value of CdC ingest — per-CUI audit history × SEAP awards.
|
||||
|
||||
### Recipe A — "Top autorități audited de N ori în 5 ani"
|
||||
|
||||
Repeat-audit signal: agencies audited many times in a short window typically
|
||||
have persistent issues. Powerful for the "Profil autoritate" page.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
r.audited_entity_cui,
|
||||
fe.denumire,
|
||||
count(*) AS audit_count_5y,
|
||||
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
|
||||
count(*) FILTER (WHERE r.audit_type = 'performanta') AS perf_audits,
|
||||
max(r.publication_date) AS last_audit
|
||||
FROM curteacont.rapoarte r
|
||||
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
|
||||
WHERE r.audited_entity_cui IS NOT NULL
|
||||
AND r.publication_date > now() - interval '5 years'
|
||||
GROUP BY r.audited_entity_cui, fe.denumire
|
||||
HAVING count(*) >= 3
|
||||
ORDER BY audit_count_5y DESC, last_audit DESC
|
||||
LIMIT 50;
|
||||
```
|
||||
|
||||
### Recipe B — "Spitale audited POST SEAP award" (paralelă cu CNAS)
|
||||
|
||||
Match SEAP contracts at hospitals against CdC audits issued AFTER award.
|
||||
A red-flag indicator that the procurement raised audit attention.
|
||||
|
||||
```sql
|
||||
WITH hospital_seap AS (
|
||||
SELECT
|
||||
s.contracting_authority_cui AS cui,
|
||||
s.contracting_authority_name AS denumire,
|
||||
s.id AS seap_id,
|
||||
s.award_date,
|
||||
s.contract_value
|
||||
FROM seap.announcements s
|
||||
JOIN cnas.spitale_furnizori cf ON cf.cui = s.contracting_authority_cui
|
||||
WHERE s.award_date > now() - interval '5 years'
|
||||
)
|
||||
SELECT
|
||||
hs.cui,
|
||||
hs.denumire,
|
||||
count(DISTINCT hs.seap_id) AS seap_awards,
|
||||
sum(hs.contract_value) AS total_value_ron,
|
||||
count(DISTINCT r.slug_id) FILTER (
|
||||
WHERE r.publication_date > hs.award_date
|
||||
) AS audits_after_award,
|
||||
array_agg(DISTINCT r.audit_type) FILTER (WHERE r.publication_date > hs.award_date) AS audit_types
|
||||
FROM hospital_seap hs
|
||||
LEFT JOIN curteacont.rapoarte r ON r.audited_entity_cui = hs.cui
|
||||
GROUP BY hs.cui, hs.denumire
|
||||
HAVING count(DISTINCT r.slug_id) FILTER (WHERE r.publication_date > hs.award_date) > 0
|
||||
ORDER BY audits_after_award DESC, total_value_ron DESC
|
||||
LIMIT 50;
|
||||
```
|
||||
|
||||
### Recipe C — "Autorități cu audit follow-up — probleme persistente"
|
||||
|
||||
Follow-up reports = CdC came back to verify whether earlier findings were
|
||||
remediated. Existence of follow-ups means the original audit had material
|
||||
issues. Cross-link to financial dependency on state contracts.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
r.audited_entity_cui,
|
||||
fe.denumire,
|
||||
fe.judet,
|
||||
count(*) FILTER (WHERE r.audit_type = 'follow-up') AS follow_ups,
|
||||
count(*) FILTER (WHERE r.audit_type <> 'follow-up') AS regular_audits,
|
||||
array_agg(DISTINCT r.audit_year) FILTER (WHERE r.audit_type = 'follow-up') AS follow_up_years,
|
||||
-- Cross-source: SEAP wins in same window
|
||||
(SELECT count(*) FROM seap.announcements s
|
||||
WHERE s.contracting_authority_cui = r.audited_entity_cui
|
||||
AND s.award_date > min(r.publication_date)) AS seap_awards_post_first_audit,
|
||||
(SELECT sum(contract_value) FROM seap.announcements s
|
||||
WHERE s.contracting_authority_cui = r.audited_entity_cui
|
||||
AND s.award_date > min(r.publication_date)) AS seap_value_post_first_audit
|
||||
FROM curteacont.rapoarte r
|
||||
LEFT JOIN firms.entities fe ON fe.cui = r.audited_entity_cui
|
||||
WHERE r.audited_entity_cui IS NOT NULL
|
||||
GROUP BY r.audited_entity_cui, fe.denumire, fe.judet
|
||||
HAVING count(*) FILTER (WHERE r.audit_type = 'follow-up') >= 1
|
||||
ORDER BY follow_ups DESC, seap_value_post_first_audit DESC NULLS LAST
|
||||
LIMIT 50;
|
||||
```
|
||||
|
||||
## 4. Operational notes
|
||||
|
||||
- **TLS bypass**: `NODE_TLS_REJECT_UNAUTHORIZED=0` is set in the cron wrapper
|
||||
— required because curteadeconturi.ro serves an intermediate CA chain that
|
||||
Node's bundled CA store doesn't trust. Cert is valid OOB (browser trusts
|
||||
it, Linux ca-certificates trusts it). Same workaround as `scrape-anre.sh`.
|
||||
- **Gentle pacing**: 600ms between page fetches. Site is on shared infra,
|
||||
no rate-limit headers observed. Stay polite.
|
||||
- **Stable IDs**: Slugs are stable (we verified 7 historical IDs in scope).
|
||||
`slug_id = sha1(category|slug)` PK survives slug renames within category
|
||||
if CdC ever changes URLs (would re-insert as "new" — acceptable trade-off).
|
||||
- **Cron suggestion**: weekly. New audits drip in at ~5-15/day on financiar.
|
||||
`45 03 * * 1 root /opt/vreaudigital/services/seap-scraper/cron/scrape-curteacont.sh`
|
||||
|
||||
## 5. Files
|
||||
|
||||
- `services/seap-scraper/sql/035_curteacont.sql`
|
||||
- `services/seap-scraper/src/scrape-curteacont.ts`
|
||||
- `services/seap-scraper/cron/scrape-curteacont.sh`
|
||||
- `services/seap-scraper/CURTEACONT-PLAN.md` (this file)
|
||||
Reference in New Issue
Block a user