RFC: No-Downtime Upgrades

Date: 2026-06-07

Status: Draft

Scope: all AX runtime services

A unified discipline for upgrading every AX service without interrupting the exchange. We classify services by how they hold state, and give each class a deterministic upgrade procedure. The hard case services that define financial integrity is solved by stage & splice: run the candidate in parallel against shadow state, prove it byte-for-byte against the incumbent, then hand off authority at an agreed point in the dropcopy sequence.

1. Principles

Five invariants govern every upgrade, regardless of service class.

  1. One writer per authority. For any piece of authoritative state, exactly one process may write it at any instant. An upgrade changes which process that is it never transiently creates zero writers (downtime) or two (corruption).
  2. State is a replayable fold. The exchange's truth is a deterministic function of the EP3 dropcopy stream. Resume tokens checkpoint position; idempotent, content-addressed writes (by trade_id / execution_id) make re-processing a no-op. This is what makes a parallel candidate possible it can independently re-derive the same state from the same stream.
  3. Prove before you cut. A candidate never earns authority by assertion. It runs against shadow tables and is verified equal to the incumbent's output over a real overlapping window. Verification is the gate, not the deploy.
  4. Cutover is a handoff, not a restart. The window of zero writers must be zero. The incumbent stops at a sequence point; the candidate resumes from that same point. The seam is a number, agreed in advance.
  5. Reversible until the seam. Up to the splice point the candidate has touched only shadow state. Abandoning it is free. Risk is concentrated into a single, observable, pre-rehearsed instant.

2. Taxonomy of services

Every deployable AX service falls into one of three classes. The class is determined by one question: what does it mean for two copies to run at once?

Class Defining property Two copies at once? Upgrade strategy
A · Replicable Stateless-ish request servers; correctness independent of instance count. Fine that's the normal operating mode. Blue-green / canary behind a proxy.
B · Intermittent Scheduled/batch processors; idle between runs. Avoid by construction deploy in the idle gap. Swap in the window.
C · Obligate singleton Owns authoritative financial state; must be the sole writer. Forbidden for writes; allowed for a verified shadow. Stage & splice.

Where each service lands

Service Class Why
order-gateway A Per-connection order entry; in-memory state is per-session and already has graceful drain (cancel_on_disconnect/shutdown.rs).
api-gateway A HTTP request server; reads Postgres/ClickHouse. Balance mutations are guarded by pg_advisory_xact_lock, so concurrent instances are safe.
onboarding-gateway A Pure HTTP API over Postgres.
marketdata-publisher A Broadcasts EP3 market data; keeps no durable cursor (always reconnects fresh). Clients tolerate reconnect.
settlement-engine B Cron daemon (FX/equities/futures settlement); date-keyed, runs on a schedule.
recon-engine B Reconciliation checks; read-only, scheduled or ad-hoc.
index-publisher B Fixed-interval index price publish; a skipped tick is recoverable via backfill.
trade-engine C Folds EP3 dropcopy into authoritative trades/positions; the canonical singleton.
risk-engine C Authoritative margin/buying-power state machine; gates order admission.
risk-monitor C Owns breach/limit state and fires enforcement; a second copy would double-enforce.

Topology caveat. Today each environment is a single EC2 box running all services under one compose.yml behind one nginx. "Blue-green" here means two processes on the same host on different ports, with nginx shifting upstreams not two hosts. The schemas below assume this; multi-host LB is a future extension, not a prerequisite.

3. Upgrade schema per taxon

Class A Replicable: blue-green / canary

Correctness does not depend on instance count, so we lean on the proxy. The only real work is clean drain: in-flight requests and per-connection obligations (e.g. cancel-on-disconnect) must finish before the old process exits.

  1. Start green. Launch the new version on a second port. Gate readiness on its own /health (and, for order-gateway, a live EP3 dropcopy).
  2. Shift. Reload nginx to point new connections at green. Optionally weight (canary): a fraction first, watch heartbeats/error rates, then 100%.
  3. Drain blue. Stop routing new traffic to blue; let it finish in-flight requests and CoD obligations within the drain budget, then exit. Reuse the existing order-gateway drain loop as the template for all Class-A services.
  4. Done. Blue exits cleanly; green is sole serving version. No request was dropped.

Prerequisite to close: every Class-A service must honor SIGTERM with graceful shutdown. Today only order-gateway does; the rest rely on the docker stop timeout. Generalize the drain pattern before claiming no-downtime for A.

Class B Intermittent: swap in the window

These don't serve continuous traffic, so there is a window where downtime disturbs nothing. The upgrade is choosing that window.

  1. Find the gap. Compute the interval to the next scheduled fire from the shared scheduler (cron + tz). It must comfortably exceed deploy time.
  2. Quiesce. Confirm no run is in progress (a mid-run swap is a Class-C problem in disguise never interrupt an in-flight settlement). If one is active, wait for completion.
  3. Swap. Replace the binary while idle. No traffic shaping needed.
  4. Verify next run. The next scheduled fire executes on the new version; assert its output (recon checks, settlement artifacts) before considering the upgrade landed.

Jobs must be idempotent per period (date-/window-keyed) so a swap that straddles a boundary, or a re-run, cannot double-apply. This already holds for settlement (date-keyed) and index-publisher (backfillable).

Class C Obligate singleton: stage & splice

These define exchange financial integrity and admit exactly one writer. We cannot run two; we cannot afford a gap. The resolution is to let a candidate shadow derive the same state from the same stream into separate tables verify it, then hand off authority at an agreed sequence point. This is the heart of the RFC.

It works because Class-C services are shaped as EP3 dropcopy → fold → write state, and that fold is replayable and idempotent (resume tokens checkpoint the stream; writes are keyed by trade_id/execution_id). The candidate can independently reconstruct authoritative state and be compared, bit for bit, against the incumbent.

Stage

Synchronize & verify

Splice

Once shadow is proven, pick a splice point a resume token S in the dropcopy sequence and execute a coordinated handoff:

incumbent:  … process(S-1) ; flush+commit through S-1 ; STOP (do not consume S)
                                          │
                                   splice point S
                                          │
candidate:  resume_token = S-1  →  receives batch S onward  →  becomes authority

The seam is the whole game. Zero-writer windows mean downtime; two-writer windows mean corruption. The handoff must be a single owned transition with a pre-agreed sequence number and a durable stop marker never "stop the old one, then go start the new one and hope." Rehearse it against ax-demo before prod.

Validate the full lifecycle before rollout (per CLAUDE.md): client disconnect, server-initiated stop, crash mid-batch, and reconnect/recovery for both incumbent-stop and candidate-resume paths. The three partial-write states (trades-only / trades+positions / +token) must each recover correctly across the splice, exactly as they do across a crash.

4. State of readiness

Capability Status Evidence / gap
Replayable stream + resume tokens Have EP3 dropcopy; trade_engine.resume_tokens; documented semantics.
Idempotent, content-addressed writes Have Three-state replay recovery in trade-engine2; dedup by trade_id/execution_id.
Shadow storage config Have SHADOW_* Postgres; per-service resume tokens; admin-cli reconcile targets shadow.
Parallel candidate engines Have trade-engine2, risk-engine2 exist; gold image ships all binaries.
Graceful drain (Class A) Partial Only order-gateway; generalize SIGTERM drain to all A services.
Proxy traffic shaping Partial nginx reload works; no weighted upstreams / upstream health checks yet.
Coordinated splice protocol Missing No "stop-at-S" handshake or durable stop marker. The key build item.
Automated shadow-vs-prod diff gate Missing Reconcile exists; wire it into a go/no-go cutover gate.
Quiet-window-free invariant checks Missing Four recon checks are pinned to the daily downtime window; need epoch-pinned as-of comparison (§5).

5. Blocker: invariant checks that assume a quiet exchange

This RFC removes the exchange's need for downtime but some of our verification currently depends on it. Four recon-engine checks default to the scheduled daily downtime window (0 5,20 16 * * * US/Eastern) precisely because they cannot produce a trustworthy answer while the exchange is moving:

Check Compares Why it needs quiet
transactions-balance PG balances vs CH transactions sums The two stores are written at different times; no common clock.
position-realized-pnls-balance CH positions vs CH transactions Same store, but the two tables are inserted at different times.
omnibus-balances-square Anchorage custodian vs PG balances CH PnL Three sources, one of them an external party.
pnl-is-zero-sum realized vs unrealized PnL Already pinned to an atomic mark cycle (#1693); kept in the window for full-scan cost.

In a continuously-deployed, continuously-running world there is no moment where "the exchange is quiet" is guaranteed. Either these invariants stop being checked (unacceptable they are the financial-integrity tripwires that make aggressive deployment safe), or they must become valid while the exchange runs. Worse, the splice gate of §3C needs exactly this capability: proving shadow equals authoritative is a cross-store comparison performed while both engines are processing live flow. Treat this as a blocker for retiring the downtime window, and as a prerequisite for the Class-C cutover gate.

Why quiet currently substitutes for synchronization

Each check samples two (or three) stores independently. A live exchange means the sample on one side includes events the other side hasn't absorbed yet the comparison is between two different points in the dropcopy sequence, and any nonzero diff is uninterpretable (bug, or just skew?). A quiet exchange freezes the sequence, making "whenever you happen to read" a consistent cut. Quiescence is a poor man's snapshot barrier.

Architecture: epoch-pinned cuts instead of quiescence

The system already has the logical clock these checks lack the dropcopy sequence and the trade-engine fold already exposes it transactionally:

The generalization: a check reads (aggregates, E) from PG in one repeatable-read transaction, waits out the CH visibility horizon, queries CH as of E, and compares. Both sides now describe the same point in the sequence exact at any throughput, no quiet window, runnable every five minutes like the cheap checks. This converts transactions-balance and position-realized-pnls-balance (and the gate of §3C) outright.

Two refinements ride along:

Splice tie-in

The §3C go/no-go gate "shadow equals authoritative over a real window" is the same primitive: diff two stores as of the same epoch while both are written live. Build epoch-pinned comparison once and it serves both the continuous invariant checks and the cutover gate; the readiness-table rows "automated shadow-vs-prod diff gate" and "quiet-window-free invariant checks" are one work item wearing two hats.

6. Summary

One question what does two copies running mean? partitions every service into three classes, and each class gets a procedure that preserves the single invariant that matters: exactly one writer of authority, always. Class A leans on the proxy and clean drain; Class B hides the swap in an idle window; Class C the services that are the exchange's integrity stages a verified shadow and splices authority at an agreed point in the sequence. The replayability and idempotency the engines already have is precisely what makes the singleton case tractable. The remaining work is the splice protocol and the verification gate that turns "looks right" into "proven equal" plus retiring the recon checks' dependence on a quiet exchange (§5), since a world without downtime windows must verify its invariants while moving.


Drafted from the AX codebase as of commit 26203ed. Service classifications and mechanisms cited inline; corrections welcome.