RFC: Continuous Recon Checks

Date: 2026-06-10

Status: Draft

Scope: recon-engine; the four quiet-window invariant checks

A sidequest off No-Downtime Upgrades §5. Four recon-engine checks are pinned to the daily downtime window because their two sides have no common clock. That RFC argues the fix is epoch-pinned as-of comparison. This RFC says how we get there safely: do not touch the existing checks. Build the continuous versions as new checks, run both side by side in signal-only mode, and let weeks of recorded agreement not code review decide when the new ones earn paging duty. The old checks are financial-integrity tripwires; we don't swap a tripwire, we run its replacement in parallel until it has proven it trips on the same things and nothing else.

1. What gets built

One new check per quiet-window check, registered alongside the old one in all_checks with its own id() and a frequent default schedule. Old and new differ only in how they read, never in what invariant they assert.

Existing (windowed, pages) New (continuous, signal-only) Mechanism change
transactions-balance transactions-balance-live Read (sum of balances, epoch E) atomically from PG; compare against CH transaction sums as of E.
position-realized-pnls-balance position-realized-pnls-balance-live Pin both CH tables to the same epoch: positions as-of E vs transactions as-of E.
pnl-is-zero-sum pnl-is-zero-sum-live Already cycle-pinned (#1693); the live variant runs on every mark cycle instead of twice daily, measuring whether the full-scan cost is actually tolerable at frequency.
omnibus-balances-square omnibus-balances-square-live Precondition becomes "no in-flight custody transfers" (detected), not "exchange downtime" (scheduled); skip the run when transfers straddle the reads.

Default schedule for the live checks: 0 */5 * * * * (matching the cheap checks), except pnl-is-zero-sum-live which keys off mark cycles and omnibus-balances-square-live which runs every 15 minutes subject to its precondition.

Epoch plumbing

The shared mechanism, per the parent RFC: the trade-engine's tandem commit (state_machine_driver.rs) writes balances, transactions, and the resume token with latest_execution_ns atomically in one PG transaction, and CH writes for a batch are issued before that batch's token commits. So:

  1. PG side: a new query reads the invariant aggregates and trade_engine.resume_tokens.latest_execution_ns in a single repeatable-read transaction. The pair is a consistent cut at a known epoch E by construction.
  2. CH side: query as of E. query_sum_by_transaction_type_as_of already exists; positions need an as-of variant (filter timestamp_ns <= E with argMax selection never trust row order).
  3. Visibility horizon: async inserts mean CH rows through E may lag a few seconds behind the PG commit. The check polls CH's max ingested timestamp_ns and proceeds once it reaches E, with a bounded retry budget; exhausting the budget is an Error (infra signal), not a Fail (invariant signal). Conflating the two would poison the soak data.

Skip semantics

CheckStatus is Pass | Fail | Error. The omnibus live check needs a fourth outcome: "precondition not met, nothing asserted." Add CheckStatus::Skipped, recorded in invariants_log like any other status. A skip is not a pass the soak analysis must track skip rate, because a check that skips 95% of the time is not a viable replacement, and we want to learn that during the soak, not after promotion.

Signal-only mode

New checks must be incapable of paging during the soak. Extend the per-check CheckConfig (config.rs) with paging: bool (default true); the incident.io sink consults it before posting. The four live checks ship with paging = false in the watch config. Failures still log at warn and land in invariants_log visible everywhere except the pager. This flag is also the promotion lever: promotion day is a config flip, not a deploy.

2. The soak

Both check families run concurrently. Old checks keep their schedules and their paging duty nothing about the existing safety posture changes. The live checks accumulate a record in invariants_log under their own check_ids, which makes the comparison a query, not a judgment call.

Sequencing: land on ax-demo first and let it run for ~1 week to shake out check bugs (wrong as-of filter, visibility-horizon misjudgment) where divergence is harmless. Then enable on prod for a 4-week soak.

Divergence scan (weekly, scripted same discipline as the position-feed promotion soak): for each old-check fire time during the window, join against the nearest live-check run within ±10 minutes and compare status and left/right values. Separately, enumerate every live-check Fail, Error, and Skipped episode for adjudication.

Acceptance criteria

The live checks earn paging duty when, over the full prod soak:

  1. Agreement at the overlap. At every old-check fire, the nearest live-check run agrees on status, and left/right values match within the existing penny tolerance. The old checks run at the one moment both methodologies are valid disagreement there means the live check is wrong, full stop.
  2. Zero unexplained failures. Every live-check Fail is adjudicated: either it reflects a real invariant break (old check / incident confirms) which is the live check working, and catching it hours earlier or it is root-caused as a check bug, fixed, and that check's soak clock restarts.
  3. Error and skip budgets. 99% of scheduled runs complete without Error (excluding platform-wide outages). omnibus-balances-square-live skip rate low enough that it still asserts the invariant at least daily.
  4. Cost at frequency. p99 check duration and CH read volume at the 5-min cadence stay within budgets agreed with whoever owns the CH bill this is where pnl-is-zero-sum-live either proves the full scan tolerable or makes the case for the accumulator follow-up.

Promotion and retirement

When the criteria hold: flip paging = true on the live checks, flip paging = false on the old ones, and keep the old checks running on their downtime schedule for two further weeks as a silent cross-check. Then disable the old checks via config (enabled = false), and remove the code one release later. The -live ids stay renaming back would split each check's history in invariants_log across two ids for cosmetic benefit.

If criteria are not met by week 4, the soak doesn't auto-extend indefinitely: adjudicate, fix, restart the affected check's clock, and if a check fundamentally can't reach agreement (most plausible for the Anchorage-coupled one), document why and leave the old check in place a partial promotion of three checks is still three checks the downtime window no longer holds hostage.

3. Non-goals

4. Work items

  1. CheckStatus::Skipped; paging flag in CheckConfig + incident.io sink respects it.
  2. PG epoch read (aggregates + latest_execution_ns, one repeatable-read txn); CH positions as-of query; visibility-horizon retry helper.
  3. The four -live checks, registered in all_checks, paging = false.
  4. Divergence-scan script over invariants_log (old-vs-live join + episode enumeration), runnable by anyone, output checked into the soak log.
  5. Demo bake (~1 week) prod soak (4 weeks) promotion config flip 2-week silent cross-check remove old checks.

Companion to No-Downtime Upgrades §5. Code references as of 99138ce.