harden(epay): cart-hygiene invariant uses confirmed cart count + add service architecture plan

- cartCount tracks actual cart rows (decrement only on confirmed delete) so a
  failed cleanup delete can't trigger a false dirty-cart abort.
- docs/plans/006: the multi-tenant CF-service architecture (DB-backed
  fulfiller, account pool, catalog dedup, per-tenant credential model,
  reversible flag flip) — the executable next phase. The Phase-F flag flip is
  gated on the orchestrator fulfiller existing (Plan 003 Faza F was wrong).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude VM
2026-06-05 00:06:06 +03:00
parent f49fdb1da0
commit 28c870fb12
6 changed files with 1703 additions and 11 deletions
@@ -0,0 +1,143 @@
# Plan 006 — ePay CF-Extract as a Multi-Tenant Service
**Status:** design (executable). **Author:** deep-dive 2026-06-04 + hardening 2026-06-05.
**Prereq reading:** plans 002/003 (architools thin-client cutover), `project_epay_cf_roadmap_2026_06` memory.
## Why this exists
The CF-extract capability (ANCPI ePay paid extracts + the free `cf-intern` copycf
circuit) will be offered beyond internal use — multi-tenant (ArchiTools, eterra-live,
Planhub, external paying customers). Today it runs as an **in-process queue inside
ArchiTools** (`src/modules/parcel-sync/services/epay-*`). That path is now hardened
(commit `f49fdb1`: cart hygiene, auth/IDOR gates, single-page fetch, parallel
downloads, recover-by-extractId) and is **billing-safe and correct for the internal
tool** — but it is the wrong *shape* for a service:
- Queue + ePay session are in-memory globals → die on redeploy mid-batch.
- One serial cart per process → no multi-tenant throughput.
- No catalog dedup on the paid path → the same parcel is paid for repeatedly.
- `EPAY_ORDERING_VIA_GIS_AC=false` because **gis-api `POST /enrichment/cf` inserts a
pending row that nothing fulfills** — the orchestrator has no ePay worker. Plan 003
Faza F ("endpoints already exist") is wrong: the fulfiller is the unwritten keystone.
This plan is the path from the hardened-internal state to a real service. Each phase
is independently shippable; do them in order, validate, then flip the flag per-tenant.
## Invariants carried over from the hardened internal path (do NOT regress)
These were learned the hard way (2026-06-04 incident, order 10009605). The new worker
MUST preserve every one:
1. **Submit is timeout-resilient.** A slow `EditCartSubmit` that ANCPI completes must
never be marked failed. Resolve the order via `findNewOrderId(previous, known)` which
never adopts a stale/known id. (`SUBMIT_TIMEOUT_MS`, today's fix.)
2. **Cart hygiene invariant.** ePay has ONE global cart per account; `EditCartSubmit`
checks out everything in it. After N adds a clean cart reports exactly N items — any
excess = orphan from a crash → wipe + abort, never submit a cart you didn't build.
3. **CF-number matching is authoritative; index fallback is `review`, not `completed`.**
4. **`%PDF` magic-byte check** on every download (expired session returns login HTML).
5. **Single-page order fetch** via `itemsPerPage` (5/page default silently drops docs).
6. **Recover is idempotent** (re-poll + re-download an already-paid order, no new charge).
## Phase A — DB-backed fulfiller worker (`eterra.cf-epay`) — THE KEYSTONE
A pg-boss worker in `gis-sync-orchestrator` (next to `enrichment-drainer`, cron 12 min).
**The CfExtract row IS the work item** — no in-memory queue.
- **Claim:** `SELECT … FROM gis_enrichment."CfExtract" WHERE status='pending' AND
type='epay' [AND account-compatible] ORDER BY "createdAt" FOR UPDATE SKIP LOCKED LIMIT N;
UPDATE → status='claimed', claimedAt=now()`. SKIP LOCKED → two instances never grab the
same rows.
- **State machine** (each transition = one UPDATE = a precise resumable marker):
`pending → claimed → cart → submitted_unconfirmed → polling → downloading → completed |
review | failed | cancelled`. Extend gis-api's `ExtractStatus` enum
(`gis-api/src/routes/enrichment.ts:9`) with `claimed`, `submitted_unconfirmed`, `review`.
- **Crash recovery:** a boot **reaper** requeues rows stuck in a transitional state past a
heartbeat TTL. `submitted_unconfirmed` rows are resolved via the recover pattern (find
the order at ANCPI, never re-charge). This structurally eliminates the in-memory-queue
orphan class (criticals C2).
- **Idempotent submit:** before `EditCartSubmit`, persist on the claimed rows the account's
current latest orderId + the intended `nrCadastral` set. On timeout/crash, resume
re-runs `findNewOrderId` against that snapshot — never adopts a stale id.
- Port the hardened `epay-client` here (see Phase G — shared package).
## Phase B — `epay_accounts` pool with one-batch-per-account lock
Mirror `gis_meta.eterra_accounts` (busc-infra migration 004): AES-256-GCM creds,
`status active/blocked/retired`, `blocked_reason`, `credits_cached`, optional hourly cap,
`in_flight_batch_id`.
- `pickEpayAccount`: `FOR UPDATE SKIP LOCKED`, but because ePay's cart is **global per
account**, atomically set `in_flight_batch_id` (status `busy`) so no second batch can
touch that account's cart. This is the structural fix for cart contamination (C1) in the
pooled world.
- Refuse to claim a batch larger than the account's cached credits. ePay credits are a
**hard consumable (real money)** — unlike the soft eTerra quota, the credit cap is
mandatory, not advisory.
## Phase C — Catalog dedup (largest recurring economic win)
`CfExtractCatalog` is written **only** on the `cf-intern` path today; nothing writes it
when a paid ePay order completes → a paid extract by tenant A is never "fresh" for tenant
B, so the 30-day money-saver is structurally unrealized.
- On ePay completion, `upsert CfExtractCatalog(nrCadastral, latestId,
expiresAt=documentDate+30d, isFresh=true)`.
- `POST /enrichment/cf/claim {nrCadastral}`: on a catalog hit, create a B-owned row
`type='catalog', status='completed', creditsUsed=0` pointing at the shared MinIO object
(or a copy). This turns today's 409 `catalog_hit` (`enrichment.ts:226`) into **instant,
free fulfillment**. RLS unchanged (B reads B's row). One paid extract serves every tenant
that needs that parcel within 30 days, at marginal zero ANCPI cost.
## Phase D — Credential model (tenant-policy-driven)
Store the strategy per-tenant; don't pick one globally:
- **Internal Beletage group** → pooled company accounts (Infisical, encrypted in
`epay_accounts`). Best batching + catalog sharing; per-credit attribution via audit.
- **External paying tenants** (eterra-live model) → dedicated per-tenant accounts so
credits/billing stay clean.
- Record `account_id` + `creditsUsed` on every `CfExtract` for attribution regardless.
- All three apps converge as thin callers of `POST /enrichment/cf` (Authentik multi-issuer
+ tenant claim already in place, `gis-api/src/lib/auth.ts`). Reuse eterra-live
`crypto.ts` (AES-256-GCM) + a 1-byte key-version prefix for rotation.
## Phase E — gis-api gaps for async consumption
1. **Completion webhook/SSE**, tenant-scoped + RLS-filtered (`GET /enrichment/cf/events`)
→ kills polling and the dead-Brevo dependency.
2. **Bulk-zip** `GET /enrichment/cf/zip?orderId=` streaming from MinIO (port the V3
streaming-zip approach).
3. `ExtractStatus` enum additions (see Phase A).
4. List filters `creditsUsed=0` / `type='catalog'` so the UI can label shared extracts.
## Phase F — Reversible migration, per-tenant flip
- **Phase 0 (now):** `EPAY_ORDERING_VIA_GIS_AC=false`, hardened legacy queue is the sole
fulfiller. `/api/ancpi/recover` stays as the manual safety net.
- **Phase 1:** deploy worker + pool + catalog-write; seed `epay_accounts` with ONLY the
Beletage account; flip the flag for `claims.tenant === 'architools'`.
- **Phase 2:** run both paths in parallel a grace window; reconcile on orderId (no double
charge).
- **Phase 3:** onboard external tenants with dedicated accounts; delete `epay-queue` /
`epay-client` / `epay-session-store` + `src/app/api/ancpi/*` from ArchiTools. The flag is
the kill-switch throughout.
## Phase G — Shared `epay-client` package (do regardless of phase)
ArchiTools and eterra-live each have a near-identical `epay-client.ts` that has **already
diverged dangerously**: ArchiTools got today's fixes; eterra-live got the method-internal
ports (commit `eterra-live d30128b`) but lacks cart-hygiene + the per-page parser refactor.
Extract `@beletage/epay-client` (natural home: `gis-sync-orchestrator`, which owns the
account pool) so a fix lands once. Until then, any epay-client change MUST be mirrored to
both repos in the same change.
## Known follow-ups not yet done
- eterra-live still lacks the cart-hygiene `numberOfItems` invariant (single-order flow
makes it lower-risk, but a crashed prior order can still orphan a cart row). Needs a
route-level touch + testing on that product before shipping.
- `BREVO_API_KEY` returns 401 "Key not found" → ArchiTools email notifications are dead;
the correct fix is the Phase E webhook, not patching Brevo. SMTP relay creds still work.
- ArchiTools `auth-options.ts` has pre-existing `react-hooks/rules-of-hooks` lint errors on
the `useGisAcFlag`/`useBasicPanelFlag` session calls (tolerated by `next build`).