RFC: No-Downtime Upgrades

Date: 2026-06-07

Status: Draft

Scope: all AX runtime services

A unified discipline for upgrading every AX service without interrupting the exchange. We classify services by how they hold state, and give each class a deterministic upgrade procedure. The hard case — services that define financial integrity — is solved by stage & splice: run the candidate in parallel against shadow state, prove it byte-for-byte against the incumbent, then hand off authority at an agreed point in the dropcopy sequence.

1. Principles

Five invariants govern every upgrade, regardless of service class.

One writer per authority. For any piece of authoritative state, exactly one process may write it at any instant. An upgrade changes which process that is — it never transiently creates zero writers (downtime) or two (corruption).
State is a replayable fold. The exchange's truth is a deterministic function of the EP3 dropcopy stream. Resume tokens checkpoint position; idempotent, content-addressed writes (by trade_id / execution_id) make re-processing a no-op. This is what makes a parallel candidate possible — it can independently re-derive the same state from the same stream.
Prove before you cut. A candidate never earns authority by assertion. It runs against shadow tables and is verified equal to the incumbent's output over a real overlapping window. Verification is the gate, not the deploy.
Cutover is a handoff, not a restart. The window of zero writers must be zero. The incumbent stops at a sequence point; the candidate resumes from that same point. The seam is a number, agreed in advance.
Reversible until the seam. Up to the splice point the candidate has touched only shadow state. Abandoning it is free. Risk is concentrated into a single, observable, pre-rehearsed instant.

2. Taxonomy of services

Every deployable AX service falls into one of three classes. The class is determined by one question: what does it mean for two copies to run at once?

Class	Defining property	Two copies at once?	Upgrade strategy
A · Replicable	Stateless-ish request servers; correctness independent of instance count.	Fine — that's the normal operating mode.	Blue-green / canary behind a proxy.
B · Intermittent	Scheduled/batch processors; idle between runs.	Avoid by construction — deploy in the idle gap.	Swap in the window.
C · Obligate singleton	Owns authoritative financial state; must be the sole writer.	Forbidden for writes; allowed for a verified shadow.	Stage & splice.

Where each service lands

Service	Class	Why
`order-gateway`	A	Per-connection order entry; in-memory state is per-session and already has graceful drain (`cancel_on_disconnect/shutdown.rs`).
`api-gateway`	A	HTTP request server; reads Postgres/ClickHouse. Balance mutations are guarded by `pg_advisory_xact_lock`, so concurrent instances are safe.
`onboarding-gateway`	A	Pure HTTP API over Postgres.
`marketdata-publisher`	A	Broadcasts EP3 market data; keeps no durable cursor (always reconnects fresh). Clients tolerate reconnect.
`settlement-engine`	B	Cron daemon (FX/equities/futures settlement); date-keyed, runs on a schedule.
`recon-engine`	B	Reconciliation checks; read-only, scheduled or ad-hoc.
`index-publisher`	B	Fixed-interval index price publish; a skipped tick is recoverable via backfill.
`trade-engine`	C	Folds EP3 dropcopy into authoritative `trades`/`positions`; the canonical singleton.
`risk-engine`	C	Authoritative margin/buying-power state machine; gates order admission.
`risk-monitor`	C	Owns breach/limit state and fires enforcement; a second copy would double-enforce.

Topology caveat. Today each environment is a single EC2 box running all services under one compose.yml behind one nginx. "Blue-green" here means two processes on the same host on different ports, with nginx shifting upstreams — not two hosts. The schemas below assume this; multi-host LB is a future extension, not a prerequisite.

3. Upgrade schema per taxon

Class A — Replicable: blue-green / canary

Correctness does not depend on instance count, so we lean on the proxy. The only real work is clean drain: in-flight requests and per-connection obligations (e.g. cancel-on-disconnect) must finish before the old process exits.

Start green. Launch the new version on a second port. Gate readiness on its own /health (and, for order-gateway, a live EP3 dropcopy).
Shift. Reload nginx to point new connections at green. Optionally weight (canary): a fraction first, watch heartbeats/error rates, then 100%.
Drain blue. Stop routing new traffic to blue; let it finish in-flight requests and CoD obligations within the drain budget, then exit. Reuse the existing order-gateway drain loop as the template for all Class-A services.
Done. Blue exits cleanly; green is sole serving version. No request was dropped.

Prerequisite to close: every Class-A service must honor SIGTERM with graceful shutdown. Today only order-gateway does; the rest rely on the docker stop timeout. Generalize the drain pattern before claiming no-downtime for A.

Class B — Intermittent: swap in the window

These don't serve continuous traffic, so there is a window where downtime disturbs nothing. The upgrade is choosing that window.

Find the gap. Compute the interval to the next scheduled fire from the shared scheduler (cron + tz). It must comfortably exceed deploy time.
Quiesce. Confirm no run is in progress (a mid-run swap is a Class-C problem in disguise — never interrupt an in-flight settlement). If one is active, wait for completion.
Swap. Replace the binary while idle. No traffic shaping needed.
Verify next run. The next scheduled fire executes on the new version; assert its output (recon checks, settlement artifacts) before considering the upgrade landed.

Jobs must be idempotent per period (date-/window-keyed) so a swap that straddles a boundary, or a re-run, cannot double-apply. This already holds for settlement (date-keyed) and index-publisher (backfillable).

Class C — Obligate singleton: stage & splice

These define exchange financial integrity and admit exactly one writer. We cannot run two; we cannot afford a gap. The resolution is to let a candidate shadow — derive the same state from the same stream into separate tables — verify it, then hand off authority at an agreed sequence point. This is the heart of the RFC.

It works because Class-C services are shaped as EP3 dropcopy → fold → write state, and that fold is replayable and idempotent (resume tokens checkpoint the stream; writes are keyed by trade_id/execution_id). The candidate can independently reconstruct authoritative state and be compared, bit for bit, against the incumbent.

Stage

Deploy the candidate pointed at shadow storage — already a first-class config: trade-engine2 takes SHADOW_* Postgres for balances and resume tokens, falling back to prod only when unset (config.rs). It opens its own dropcopy subscription with its own resume token (service_id in trade_engine.resume_tokens).
Both processes consume the live dropcopy concurrently. The incumbent remains the sole writer of authoritative tables; the candidate writes only shadow tables. There is still exactly one authority.

Synchronize & verify

Let the candidate run until it has caught up — its latest_execution_with_token_ns tracks the incumbent within tolerance.
Diff shadow vs. authoritative over the overlapping window: trades, positions, balances. Equality over a real window is the gate. (Reuse admin-cli reconcile, which already targets the shadow DB.)
This window is also where bugs surface harmlessly — the candidate can be wrong, fixed, and restarted with zero exchange impact, because it only ever touched shadow state. (Cf. the A-3112 shadow repro tooling.)

Splice

Once shadow is proven, pick a splice point — a resume token S in the dropcopy sequence — and execute a coordinated handoff:

incumbent:  … process(S-1) ; flush+commit through S-1 ; STOP (do not consume S)
                                          │
                                   splice point S
                                          │
candidate:  resume_token = S-1  →  receives batch S onward  →  becomes authority

The incumbent voluntarily stops at S: it commits everything up to and including the batch ending at S-1, persists its checkpoint, and exits. It must publish a durable "stopped at S-1" marker.
The candidate is reconfigured from shadow tables to authoritative tables and resumes from S-1 (recall EP3 redelivers the batch after the supplied token — see resume-token semantics). Idempotent writes make any one-batch overlap a no-op, so the seam is exact with no gap and no double-count.
Flip readers (order-gateway risk lookups, api-gateway) to the new authority. Now exactly one writer again — the new one.

The seam is the whole game. Zero-writer windows mean downtime; two-writer windows mean corruption. The handoff must be a single owned transition with a pre-agreed sequence number and a durable stop marker — never "stop the old one, then go start the new one and hope." Rehearse it against ax-demo before prod.

Validate the full lifecycle before rollout (per CLAUDE.md): client disconnect, server-initiated stop, crash mid-batch, and reconnect/recovery — for both incumbent-stop and candidate-resume paths. The three partial-write states (trades-only / trades+positions / +token) must each recover correctly across the splice, exactly as they do across a crash.

4. State of readiness

Capability	Status	Evidence / gap
Replayable stream + resume tokens	✅ Have	EP3 dropcopy; `trade_engine.resume_tokens`; documented semantics.
Idempotent, content-addressed writes	✅ Have	Three-state replay recovery in `trade-engine2`; dedup by `trade_id`/`execution_id`.
Shadow storage config	✅ Have	`SHADOW_*` Postgres; per-service resume tokens; `admin-cli reconcile` targets shadow.
Parallel candidate engines	✅ Have	`trade-engine2`, `risk-engine2` exist; gold image ships all binaries.
Graceful drain (Class A)	◐ Partial	Only `order-gateway`; generalize SIGTERM drain to all A services.
Proxy traffic shaping	◐ Partial	nginx reload works; no weighted upstreams / upstream health checks yet.
Coordinated splice protocol	✗ Missing	No "stop-at-`S`" handshake or durable stop marker. The key build item.
Automated shadow-vs-prod diff gate	✗ Missing	Reconcile exists; wire it into a go/no-go cutover gate.
Quiet-window-free invariant checks	✗ Missing	Four recon checks are pinned to the daily downtime window; need epoch-pinned as-of comparison (§5).

5. Blocker: invariant checks that assume a quiet exchange

This RFC removes the exchange's need for downtime — but some of our verification currently depends on it. Four recon-engine checks default to the scheduled daily downtime window (0 5,20 16 * * * US/Eastern) precisely because they cannot produce a trustworthy answer while the exchange is moving:

Check	Compares	Why it needs quiet
`transactions-balance`	PG `balances` vs CH `transactions` sums	The two stores are written at different times; no common clock.
`position-realized-pnls-balance`	CH `positions` vs CH `transactions`	Same store, but the two tables are inserted at different times.
`omnibus-balances-square`	Anchorage custodian vs PG balances − CH PnL	Three sources, one of them an external party.
`pnl-is-zero-sum`	realized vs unrealized PnL	Already pinned to an atomic mark cycle (#1693); kept in the window for full-scan cost.

In a continuously-deployed, continuously-running world there is no moment where "the exchange is quiet" is guaranteed. Either these invariants stop being checked (unacceptable — they are the financial-integrity tripwires that make aggressive deployment safe), or they must become valid while the exchange runs. Worse, the splice gate of §3C needs exactly this capability: proving shadow equals authoritative is a cross-store comparison performed while both engines are processing live flow. Treat this as a blocker for retiring the downtime window, and as a prerequisite for the Class-C cutover gate.

Why quiet currently substitutes for synchronization

Each check samples two (or three) stores independently. A live exchange means the sample on one side includes events the other side hasn't absorbed yet — the comparison is between two different points in the dropcopy sequence, and any nonzero diff is uninterpretable (bug, or just skew?). A quiet exchange freezes the sequence, making "whenever you happen to read" a consistent cut. Quiescence is a poor man's snapshot barrier.

Architecture: epoch-pinned cuts instead of quiescence

The system already has the logical clock these checks lack — the dropcopy sequence — and the trade-engine fold already exposes it transactionally:

PG balances, transactions, and the resume token (with latest_execution_ns) commit atomically in one PG transaction (the tandem commit in state_machine_driver.rs). Any PG snapshot is therefore a consistent cut at a known epoch E.
CH writes for a batch land before that batch's token commits (the dropcopy restart fence depends on this same ordering). Observing epoch E in PG implies CH has all rows through E, modulo async-insert visibility.
The CH side already supports as-of reads (query_sum_by_transaction_type_as_of); pnl-is-zero-sum already pins both of its sides to a single atomic mark-cycle timestamp — the pattern works in production today, within one store.

The generalization: a check reads (aggregates, E) from PG in one repeatable-read transaction, waits out the CH visibility horizon, queries CH as of E, and compares. Both sides now describe the same point in the sequence — exact at any throughput, no quiet window, runnable every five minutes like the cheap checks. This converts transactions-balance and position-realized-pnls-balance (and the gate of §3C) outright.

Two refinements ride along:

Incremental accumulators. The full-scan objection (pnl-is-zero-sum) gets worse forever as history grows. Have the fold (or a SummingMergeTree materialized view keyed by epoch) maintain running per-type aggregates, so a check is an O(1) read of two accumulator rows at epoch E. Stronger still: assert the per-batch delta invariant in the fold itself (each batch's PnL deltas sum to zero) — the global invariant then holds by induction from one audited base, and the periodic scan becomes a belt-and-suspenders audit rather than the primary tripwire.
External clocks stay windowed — but on the right window. omnibus-balances-square compares against Anchorage, which cannot be pinned to a dropcopy epoch. Its precondition isn't "exchange quiet" — trading doesn't move the omnibus vault — it's "no in-flight custody transfers." Detect that condition (no pending deposits/withdrawals straddling the two reads) instead of inheriting the exchange's downtime schedule, and/or adjust the comparison by the known in-flight set.

Splice tie-in

The §3C go/no-go gate — "shadow equals authoritative over a real window" — is the same primitive: diff two stores as of the same epoch while both are written live. Build epoch-pinned comparison once and it serves both the continuous invariant checks and the cutover gate; the readiness-table rows "automated shadow-vs-prod diff gate" and "quiet-window-free invariant checks" are one work item wearing two hats.

6. Summary

One question — what does two copies running mean? — partitions every service into three classes, and each class gets a procedure that preserves the single invariant that matters: exactly one writer of authority, always. Class A leans on the proxy and clean drain; Class B hides the swap in an idle window; Class C — the services that are the exchange's integrity — stages a verified shadow and splices authority at an agreed point in the sequence. The replayability and idempotency the engines already have is precisely what makes the singleton case tractable. The remaining work is the splice protocol and the verification gate that turns "looks right" into "proven equal" — plus retiring the recon checks' dependence on a quiet exchange (§5), since a world without downtime windows must verify its invariants while moving.

Drafted from the AX codebase as of commit 26203ed. Service classifications and mechanisms cited inline; corrections welcome.