Date: 2026-06-07
Status: Draft
Scope: all AX runtime services
A unified discipline for upgrading every AX service without interrupting the exchange. We classify services by how they hold state, and give each class a deterministic upgrade procedure. The hard case — services that define financial integrity — is solved by stage & splice: run the candidate in parallel against shadow state, prove it byte-for-byte against the incumbent, then hand off authority at an agreed point in the dropcopy sequence.
Five invariants govern every upgrade, regardless of service class.
trade_id
/
execution_id)
make re-processing
a
no-op.
This
is
what
makes
a
parallel
candidate
possible
— it
can
independently
re-derive
the
same
state
from
the
same
stream.
Every deployable AX service falls into one of three classes. The class is determined by one question: what does it mean for two copies to run at once?
| Class | Defining property | Two copies at once? | Upgrade strategy |
|---|---|---|---|
| A · Replicable | Stateless-ish request servers; correctness independent of instance count. | Fine — that's the normal operating mode. | Blue-green / canary behind a proxy. |
| B · Intermittent | Scheduled/batch processors; idle between runs. | Avoid by construction — deploy in the idle gap. | Swap in the window. |
| C · Obligate singleton | Owns authoritative financial state; must be the sole writer. | Forbidden for writes; allowed for a verified shadow. | Stage & splice. |
| Service | Class | Why |
|---|---|---|
order-gateway |
A | Per-connection
order
entry;
in-memory
state
is
per-session
and
already
has
graceful
drain
(cancel_on_disconnect/shutdown.rs). |
api-gateway |
A | HTTP
request
server;
reads
Postgres/ClickHouse.
Balance
mutations
are
guarded
by
pg_advisory_xact_lock,
so
concurrent
instances
are
safe. |
onboarding-gateway |
A | Pure HTTP API over Postgres. |
marketdata-publisher |
A | Broadcasts EP3 market data; keeps no durable cursor (always reconnects fresh). Clients tolerate reconnect. |
settlement-engine |
B | Cron daemon (FX/equities/futures settlement); date-keyed, runs on a schedule. |
recon-engine |
B | Reconciliation checks; read-only, scheduled or ad-hoc. |
index-publisher |
B | Fixed-interval index price publish; a skipped tick is recoverable via backfill. |
trade-engine |
C | Folds
EP3
dropcopy
into
authoritative
trades/positions;
the
canonical
singleton. |
risk-engine |
C | Authoritative margin/buying-power state machine; gates order admission. |
risk-monitor |
C | Owns breach/limit state and fires enforcement; a second copy would double-enforce. |
Topology caveat. Today each environment is a single EC2 box running all services under one
compose.ymlbehind one nginx. "Blue-green" here means two processes on the same host on different ports, with nginx shifting upstreams — not two hosts. The schemas below assume this; multi-host LB is a future extension, not a prerequisite.
Correctness does not depend on instance count, so we lean on the proxy. The only real work is clean drain: in-flight requests and per-connection obligations (e.g. cancel-on-disconnect) must finish before the old process exits.
/health
(and,
for
order-gateway,
a
live
EP3
dropcopy).
order-gateway
drain
loop
as
the
template
for
all
Class-A services.
Prerequisite
to
close:
every
Class-A
service
must
honor
SIGTERM
with graceful
shutdown.
Today
only
order-gateway
does;
the
rest
rely
on
the docker
stop
timeout.
Generalize
the
drain
pattern
before
claiming
no-downtime for
A.
These don't serve continuous traffic, so there is a window where downtime disturbs nothing. The upgrade is choosing that window.
scheduler
(cron
+
tz).
It
must
comfortably
exceed
deploy
time.
Jobs must be idempotent per period (date-/window-keyed) so a swap that straddles a boundary, or a re-run, cannot double-apply. This already holds for settlement (date-keyed) and index-publisher (backfillable).
These define exchange financial integrity and admit exactly one writer. We cannot run two; we cannot afford a gap. The resolution is to let a candidate shadow — derive the same state from the same stream into separate tables — verify it, then hand off authority at an agreed sequence point. This is the heart of the RFC.
It
works
because
Class-C
services
are
shaped
as
EP3 dropcopy → fold → write state,
and
that
fold
is
replayable
and
idempotent
(resume
tokens
checkpoint the
stream;
writes
are
keyed
by
trade_id/execution_id).
The
candidate
can independently
reconstruct
authoritative
state
and
be
compared,
bit
for
bit, against
the
incumbent.
trade-engine2
takes
SHADOW_*
Postgres
for
balances
and
resume tokens,
falling
back
to
prod
only
when
unset
(config.rs).
It
opens
its
own dropcopy
subscription
with
its
own
resume
token
(service_id
in trade_engine.resume_tokens).
latest_execution_with_token_ns
tracks
the
incumbent
within
tolerance.
trades, positions,
balances.
Equality
over
a
real
window
is
the
gate.
(Reuse admin-cli reconcile,
which
already
targets
the
shadow
DB.)
Once
shadow
is
proven,
pick
a
splice
point
—
a
resume
token
S
in
the dropcopy
sequence
—
and
execute
a
coordinated
handoff:
incumbent: … process(S-1) ; flush+commit through S-1 ; STOP (do not consume S)
│
splice point S
│
candidate: resume_token = S-1 → receives batch S onward → becomes authority
S:
it
commits
everything
up
to
and including
the
batch
ending
at
S-1,
persists
its
checkpoint,
and
exits.
It must
publish
a
durable
"stopped
at
S-1"
marker.
S-1
(recall
EP3
redelivers
the
batch
after
the
supplied token
—
see
resume-token
semantics).
Idempotent
writes
make
any
one-batch overlap
a
no-op,
so
the
seam
is
exact
with
no
gap
and
no
double-count.
The seam is the whole game. Zero-writer windows mean downtime; two-writer windows mean corruption. The handoff must be a single owned transition with a pre-agreed sequence number and a durable stop marker — never "stop the old one, then go start the new one and hope." Rehearse it against ax-demo before prod.
Validate the full lifecycle before rollout (per CLAUDE.md): client disconnect, server-initiated stop, crash mid-batch, and reconnect/recovery — for both incumbent-stop and candidate-resume paths. The three partial-write states (trades-only / trades+positions / +token) must each recover correctly across the splice, exactly as they do across a crash.
| Capability | Status | Evidence / gap |
|---|---|---|
| Replayable stream + resume tokens | ✅ Have | EP3
dropcopy;
trade_engine.resume_tokens;
documented
semantics. |
| Idempotent, content-addressed writes | ✅ Have | Three-state
replay
recovery
in
trade-engine2;
dedup
by
trade_id/execution_id. |
| Shadow storage config | ✅ Have | SHADOW_*
Postgres;
per-service
resume
tokens;
admin-cli reconcile
targets
shadow. |
| Parallel candidate engines | ✅ Have | trade-engine2,
risk-engine2
exist;
gold
image
ships
all
binaries. |
| Graceful drain (Class A) | ◐ Partial | Only
order-gateway;
generalize
SIGTERM
drain
to
all
A
services. |
| Proxy traffic shaping | ◐ Partial | nginx reload works; no weighted upstreams / upstream health checks yet. |
| Coordinated splice protocol | ✗ Missing | No
"stop-at-S"
handshake
or
durable
stop
marker.
The
key
build
item. |
| Automated shadow-vs-prod diff gate | ✗ Missing | Reconcile exists; wire it into a go/no-go cutover gate. |
| Quiet-window-free invariant checks | ✗ Missing | Four recon checks are pinned to the daily downtime window; need epoch-pinned as-of comparison (§5). |
This
RFC
removes
the
exchange's
need
for
downtime
—
but
some
of
our verification
currently
depends
on
it.
Four
recon-engine
checks
default
to the
scheduled
daily
downtime
window
(0 5,20 16 * * * US/Eastern)
precisely because
they
cannot
produce
a
trustworthy
answer
while
the
exchange
is
moving:
| Check | Compares | Why it needs quiet |
|---|---|---|
transactions-balance |
PG
balances
vs
CH
transactions
sums |
The two stores are written at different times; no common clock. |
position-realized-pnls-balance |
CH
positions
vs
CH
transactions |
Same store, but the two tables are inserted at different times. |
omnibus-balances-square |
Anchorage custodian vs PG balances − CH PnL | Three sources, one of them an external party. |
pnl-is-zero-sum |
realized vs unrealized PnL | Already pinned to an atomic mark cycle (#1693); kept in the window for full-scan cost. |
In a continuously-deployed, continuously-running world there is no moment where "the exchange is quiet" is guaranteed. Either these invariants stop being checked (unacceptable — they are the financial-integrity tripwires that make aggressive deployment safe), or they must become valid while the exchange runs. Worse, the splice gate of §3C needs exactly this capability: proving shadow equals authoritative is a cross-store comparison performed while both engines are processing live flow. Treat this as a blocker for retiring the downtime window, and as a prerequisite for the Class-C cutover gate.
Each check samples two (or three) stores independently. A live exchange means the sample on one side includes events the other side hasn't absorbed yet — the comparison is between two different points in the dropcopy sequence, and any nonzero diff is uninterpretable (bug, or just skew?). A quiet exchange freezes the sequence, making "whenever you happen to read" a consistent cut. Quiescence is a poor man's snapshot barrier.
The system already has the logical clock these checks lack — the dropcopy sequence — and the trade-engine fold already exposes it transactionally:
balances,
transactions,
and
the
resume
token
(with latest_execution_ns)
commit
atomically
in
one
PG
transaction
(the tandem
commit
in
state_machine_driver.rs).
Any
PG
snapshot
is
therefore
a consistent
cut
at
a
known
epoch
E.
E in
PG
implies
CH
has
all
rows
through
E,
modulo
async-insert
visibility.
query_sum_by_transaction_type_as_of);
pnl-is-zero-sum
already
pins
both of
its
sides
to
a
single
atomic
mark-cycle
timestamp
—
the
pattern
works
in production
today,
within
one
store.
The
generalization:
a
check
reads
(aggregates, E)
from
PG
in
one repeatable-read
transaction,
waits
out
the
CH
visibility
horizon,
queries
CH as
of
E,
and
compares.
Both
sides
now
describe
the
same
point
in
the sequence
—
exact
at
any
throughput,
no
quiet
window,
runnable
every
five minutes
like
the
cheap
checks.
This
converts
transactions-balance
and position-realized-pnls-balance
(and
the
gate
of
§3C)
outright.
Two refinements ride along:
pnl-is-zero-sum) gets
worse
forever
as
history
grows.
Have
the
fold
(or
a
SummingMergeTree materialized
view
keyed
by
epoch)
maintain
running
per-type
aggregates,
so
a check
is
an
O(1)
read
of
two
accumulator
rows
at
epoch
E.
Stronger
still: assert
the
per-batch
delta
invariant
in
the
fold
itself
(each
batch's
PnL deltas
sum
to
zero)
—
the
global
invariant
then
holds
by
induction
from
one audited
base,
and
the
periodic
scan
becomes
a
belt-and-suspenders
audit rather
than
the
primary
tripwire.
omnibus-balances-square
compares
against
Anchorage,
which
cannot
be
pinned to
a
dropcopy
epoch.
Its
precondition
isn't
"exchange
quiet"
—
trading doesn't
move
the
omnibus
vault
—
it's
"no
in-flight
custody
transfers." Detect
that
condition
(no
pending
deposits/withdrawals
straddling
the
two reads)
instead
of
inheriting
the
exchange's
downtime
schedule,
and/or
adjust the
comparison
by
the
known
in-flight
set.
The §3C go/no-go gate — "shadow equals authoritative over a real window" — is the same primitive: diff two stores as of the same epoch while both are written live. Build epoch-pinned comparison once and it serves both the continuous invariant checks and the cutover gate; the readiness-table rows "automated shadow-vs-prod diff gate" and "quiet-window-free invariant checks" are one work item wearing two hats.
One question — what does two copies running mean? — partitions every service into three classes, and each class gets a procedure that preserves the single invariant that matters: exactly one writer of authority, always. Class A leans on the proxy and clean drain; Class B hides the swap in an idle window; Class C — the services that are the exchange's integrity — stages a verified shadow and splices authority at an agreed point in the sequence. The replayability and idempotency the engines already have is precisely what makes the singleton case tractable. The remaining work is the splice protocol and the verification gate that turns "looks right" into "proven equal" — plus retiring the recon checks' dependence on a quiet exchange (§5), since a world without downtime windows must verify its invariants while moving.
Drafted
from
the
AX
codebase
as
of
commit
26203ed.
Service
classifications and
mechanisms
cited
inline;
corrections
welcome.