Files
ArchiTools/docs/plans/006-epay-cf-service-architecture.md
Claude VM 28c870fb12 harden(epay): cart-hygiene invariant uses confirmed cart count + add service architecture plan
- cartCount tracks actual cart rows (decrement only on confirmed delete) so a
  failed cleanup delete can't trigger a false dirty-cart abort.
- docs/plans/006: the multi-tenant CF-service architecture (DB-backed
  fulfiller, account pool, catalog dedup, per-tenant credential model,
  reversible flag flip) — the executable next phase. The Phase-F flag flip is
  gated on the orchestrator fulfiller existing (Plan 003 Faza F was wrong).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 00:06:06 +03:00

8.5 KiB
Raw Permalink Blame History

Plan 006 — ePay CF-Extract as a Multi-Tenant Service

Status: design (executable). Author: deep-dive 2026-06-04 + hardening 2026-06-05. Prereq reading: plans 002/003 (architools thin-client cutover), project_epay_cf_roadmap_2026_06 memory.

Why this exists

The CF-extract capability (ANCPI ePay paid extracts + the free cf-intern copycf circuit) will be offered beyond internal use — multi-tenant (ArchiTools, eterra-live, Planhub, external paying customers). Today it runs as an in-process queue inside ArchiTools (src/modules/parcel-sync/services/epay-*). That path is now hardened (commit f49fdb1: cart hygiene, auth/IDOR gates, single-page fetch, parallel downloads, recover-by-extractId) and is billing-safe and correct for the internal tool — but it is the wrong shape for a service:

  • Queue + ePay session are in-memory globals → die on redeploy mid-batch.
  • One serial cart per process → no multi-tenant throughput.
  • No catalog dedup on the paid path → the same parcel is paid for repeatedly.
  • EPAY_ORDERING_VIA_GIS_AC=false because gis-api POST /enrichment/cf inserts a pending row that nothing fulfills — the orchestrator has no ePay worker. Plan 003 Faza F ("endpoints already exist") is wrong: the fulfiller is the unwritten keystone.

This plan is the path from the hardened-internal state to a real service. Each phase is independently shippable; do them in order, validate, then flip the flag per-tenant.

Invariants carried over from the hardened internal path (do NOT regress)

These were learned the hard way (2026-06-04 incident, order 10009605). The new worker MUST preserve every one:

  1. Submit is timeout-resilient. A slow EditCartSubmit that ANCPI completes must never be marked failed. Resolve the order via findNewOrderId(previous, known) which never adopts a stale/known id. (SUBMIT_TIMEOUT_MS, today's fix.)
  2. Cart hygiene invariant. ePay has ONE global cart per account; EditCartSubmit checks out everything in it. After N adds a clean cart reports exactly N items — any excess = orphan from a crash → wipe + abort, never submit a cart you didn't build.
  3. CF-number matching is authoritative; index fallback is review, not completed.
  4. %PDF magic-byte check on every download (expired session returns login HTML).
  5. Single-page order fetch via itemsPerPage (5/page default silently drops docs).
  6. Recover is idempotent (re-poll + re-download an already-paid order, no new charge).

Phase A — DB-backed fulfiller worker (eterra.cf-epay) — THE KEYSTONE

A pg-boss worker in gis-sync-orchestrator (next to enrichment-drainer, cron 12 min). The CfExtract row IS the work item — no in-memory queue.

  • Claim: SELECT … FROM gis_enrichment."CfExtract" WHERE status='pending' AND type='epay' [AND account-compatible] ORDER BY "createdAt" FOR UPDATE SKIP LOCKED LIMIT N; UPDATE → status='claimed', claimedAt=now(). SKIP LOCKED → two instances never grab the same rows.
  • State machine (each transition = one UPDATE = a precise resumable marker): pending → claimed → cart → submitted_unconfirmed → polling → downloading → completed | review | failed | cancelled. Extend gis-api's ExtractStatus enum (gis-api/src/routes/enrichment.ts:9) with claimed, submitted_unconfirmed, review.
  • Crash recovery: a boot reaper requeues rows stuck in a transitional state past a heartbeat TTL. submitted_unconfirmed rows are resolved via the recover pattern (find the order at ANCPI, never re-charge). This structurally eliminates the in-memory-queue orphan class (criticals C2).
  • Idempotent submit: before EditCartSubmit, persist on the claimed rows the account's current latest orderId + the intended nrCadastral set. On timeout/crash, resume re-runs findNewOrderId against that snapshot — never adopts a stale id.
  • Port the hardened epay-client here (see Phase G — shared package).

Phase B — epay_accounts pool with one-batch-per-account lock

Mirror gis_meta.eterra_accounts (busc-infra migration 004): AES-256-GCM creds, status active/blocked/retired, blocked_reason, credits_cached, optional hourly cap, in_flight_batch_id.

  • pickEpayAccount: FOR UPDATE SKIP LOCKED, but because ePay's cart is global per account, atomically set in_flight_batch_id (status busy) so no second batch can touch that account's cart. This is the structural fix for cart contamination (C1) in the pooled world.
  • Refuse to claim a batch larger than the account's cached credits. ePay credits are a hard consumable (real money) — unlike the soft eTerra quota, the credit cap is mandatory, not advisory.

Phase C — Catalog dedup (largest recurring economic win)

CfExtractCatalog is written only on the cf-intern path today; nothing writes it when a paid ePay order completes → a paid extract by tenant A is never "fresh" for tenant B, so the 30-day money-saver is structurally unrealized.

  • On ePay completion, upsert CfExtractCatalog(nrCadastral, latestId, expiresAt=documentDate+30d, isFresh=true).
  • POST /enrichment/cf/claim {nrCadastral}: on a catalog hit, create a B-owned row type='catalog', status='completed', creditsUsed=0 pointing at the shared MinIO object (or a copy). This turns today's 409 catalog_hit (enrichment.ts:226) into instant, free fulfillment. RLS unchanged (B reads B's row). One paid extract serves every tenant that needs that parcel within 30 days, at marginal zero ANCPI cost.

Phase D — Credential model (tenant-policy-driven)

Store the strategy per-tenant; don't pick one globally:

  • Internal Beletage group → pooled company accounts (Infisical, encrypted in epay_accounts). Best batching + catalog sharing; per-credit attribution via audit.
  • External paying tenants (eterra-live model) → dedicated per-tenant accounts so credits/billing stay clean.
  • Record account_id + creditsUsed on every CfExtract for attribution regardless.
  • All three apps converge as thin callers of POST /enrichment/cf (Authentik multi-issuer
    • tenant claim already in place, gis-api/src/lib/auth.ts). Reuse eterra-live crypto.ts (AES-256-GCM) + a 1-byte key-version prefix for rotation.

Phase E — gis-api gaps for async consumption

  1. Completion webhook/SSE, tenant-scoped + RLS-filtered (GET /enrichment/cf/events) → kills polling and the dead-Brevo dependency.
  2. Bulk-zip GET /enrichment/cf/zip?orderId= streaming from MinIO (port the V3 streaming-zip approach).
  3. ExtractStatus enum additions (see Phase A).
  4. List filters creditsUsed=0 / type='catalog' so the UI can label shared extracts.

Phase F — Reversible migration, per-tenant flip

  • Phase 0 (now): EPAY_ORDERING_VIA_GIS_AC=false, hardened legacy queue is the sole fulfiller. /api/ancpi/recover stays as the manual safety net.
  • Phase 1: deploy worker + pool + catalog-write; seed epay_accounts with ONLY the Beletage account; flip the flag for claims.tenant === 'architools'.
  • Phase 2: run both paths in parallel a grace window; reconcile on orderId (no double charge).
  • Phase 3: onboard external tenants with dedicated accounts; delete epay-queue / epay-client / epay-session-store + src/app/api/ancpi/* from ArchiTools. The flag is the kill-switch throughout.

Phase G — Shared epay-client package (do regardless of phase)

ArchiTools and eterra-live each have a near-identical epay-client.ts that has already diverged dangerously: ArchiTools got today's fixes; eterra-live got the method-internal ports (commit eterra-live d30128b) but lacks cart-hygiene + the per-page parser refactor. Extract @beletage/epay-client (natural home: gis-sync-orchestrator, which owns the account pool) so a fix lands once. Until then, any epay-client change MUST be mirrored to both repos in the same change.

Known follow-ups not yet done

  • eterra-live still lacks the cart-hygiene numberOfItems invariant (single-order flow makes it lower-risk, but a crashed prior order can still orphan a cart row). Needs a route-level touch + testing on that product before shipping.
  • BREVO_API_KEY returns 401 "Key not found" → ArchiTools email notifications are dead; the correct fix is the Phase E webhook, not patching Brevo. SMTP relay creds still work.
  • ArchiTools auth-options.ts has pre-existing react-hooks/rules-of-hooks lint errors on the useGisAcFlag/useBasicPanelFlag session calls (tolerated by next build).