Date: 2026-06-10
Status: Draft
Scope: the in-house matching engine we are building to replace EP3, and the portfolio-margin architecture required to list full options chains under it. Whether and when it actually displaces EP3 in production is a separate cutover decision; the engine gets built either way.
This RFC distills a survey of production matching-engine and margin-system architecture (LMAX, NASDAQ INET/Genium, CME Globex, Eurex T7, Coinbase, OCC TIMS, CME SPAN/SPAN 2, Eurex Prisma, Binance/Deribit portfolio margin) into the design for that engine. Two prior RFCs set the stage: ContractId reforms make listing a dated contract pure data and reserve the bit layout for options, and ROME deferred cross-engine sharding as a v2 concern. Both decisions collide with the same question — what does AX matching look like at options-chain scale, with portfolio margin? — and this RFC answers it before any code forces the answer.
The three headline conclusions, stated up front because everything else follows from them:
These are the load-bearing commitments. Everything in §2–§5 is elaboration.
The reference design (WK Selph's, corroborated by liquibook/Chronicle/ Databento write-ups), in Rust:
Limit
nodes
(handles
sparse/wide
price
domains),
or
a
flat tick-indexed
ladder
(O(1)
access;
the
QuantCup-winning
design,
where
cancel is
a
single
store).
Choose
per
product
class:
ladders
for
bounded
tick domains
(most
listed
derivatives),
trees
as
the
general
fallback.
Order
nodes
— price-time
priority
with
O(1)
insert/remove.
Realistic expectation: tens to a few hundred nanoseconds per book operation in optimized native code, dominated by cache behavior. An open benchmark of this design measured ~442ns/order (~2M orders/sec) on commodity hardware with 100k resting orders per book.
The shard process is the LMAX shape: input unmarshalling, journaling, replication, and output marshalling each on their own pinned threads, communicating through pre-allocated ring buffers (the Disruptor paper documents 25M+ msgs/sec and ~52ns mean inter-thread hop); the matching thread is pinned to an isolated core, busy-polling. Standard mechanical sympathy applies: cache-line padding against false sharing, single-writer principle, hyper-threading off on engine hosts, NIC-NUMA-local processing, and (when we get there) kernel bypass on the gateway path — the kernel stack costs ~20–50µs per transit; Onload/DPDK-class bypass cuts that to ~1–5µs.
The binding constraints at scale are memory and market-data fan-out, not match throughput. A near-empty book costs on the order of ~1KB (planning figure — measure before fleet-sizing), so 1M books ≈ 1GB: one host comfortably holds 1M markets. What one core cannot do is keep millions of active books cache-warm or absorb options-chain quote traffic for many hot underliers. Planning numbers:
Gateway
(existing
order-gateway
surface,
eventually
kernel-bypass)
→ sequencer
(assigns
global
order
within
the
shard,
writes
the
replicated journal)
→
matching
core
→
outputs
(trade
reports,
drop
copy,
sequenced market
data).
Market
data
disseminates
MoldUDP64-style:
sequenced
multicast with
gap
detection,
a
re-request
server,
and
a
snapshot
service
for
late join
—
the
NASDAQ
ITCH/GLIMPSE
pattern.
Two credible options, to be settled by prototype in Stage 2 (§6):
| Aeron Cluster (Raft) | Chronicle-style journal + in-house replication | |
|---|---|---|
| Consensus/HA | Integrated (leader election, majority ack) | Build it ourselves |
| Latency | µs-class; used by CME, Coinbase, Man Group | Single-digit µs write-to-read |
| Rust story | Bindings/port required (Aeron is C/Java native) | Memory-mapped journal is straightforward to own in Rust |
| Risk | Foreign runtime in the most critical path | We own a consensus-adjacent protocol |
Durability strategy regardless of choice: replicate to a majority over a fast network rather than fsync-ing the hot path to local disk (Chronicle's own benchmarking argues sync replication beats local fsync on both latency and fault tolerance), persist via memory-mapped journal, bound replay time with periodic state snapshots.
Raft-style failover can lose the in-flight message at the instant of leader loss — Aeron's docs say so explicitly. The client protocol must therefore support post-failover reconciliation of in-flight order state from day one. This is a correctness requirement, not an optimization, and it lands in Stage 2, not later. (House rule applies with force here: disconnect, server-initiated disconnect, crash/restart, and reconnect/recovery paths are all first-class test surfaces before anything goes near production.)
The industry has one basic trick: precompute scenario P&L, then aggregate with offsets up a hierarchy.
AX's methodology choice (scenario grid shape, offsets, lookbacks) is a risk-policy decision out of scope here; the architecture below works for any member of this family.
Hot path (synchronous, in-shard, sub-microsecond budget):
new order cost = increase in IM + order's open loss
against
a
virtual
available
balance)
— the
pattern
is
proven
at
retail-crypto
scale,
and
we
tighten
it
to HFT-grade
by
keeping
it
in-process.
Authoritative tier (asynchronous): a central risk aggregator continuously runs the full scenario/VaR revaluation across the whole portfolio, granting cross-underlier offset credit, re-anchoring the per-account cached vectors (correcting Taylor-approximation drift), and rebalancing per-shard budgets. This is the one place GPU acceleration is justified, and only once scenario counts make CPUs the bottleneck.
Backstop: a liquidation engine keyed off the authoritative tier. The hot path is conservative by design and will occasionally over-margin hedged portfolios; the async tier corrects within its cadence; if an account is genuinely under-margined despite admission control, liquidation — not order blocking — is the safety mechanism. All three layers exist or the design is unsound: admission control without true-up drifts, true-up without liquidation has no teeth.
Because options on one underlier are mathematically related, risk must aggregate at the underlier level even if series were spread across engines. Sharding by underlier (D3) means the dominant offsets — verticals, calendars, covered structures — are local to one shard's budget and priced correctly in the synchronous path. Only the smaller cross-underlier correlation credit lives exclusively in the central aggregator, and it is exactly the part that tolerates asynchrony. This is the formalized version of SPAN's "combined commodity" and TIMS's class-group hierarchy; the survey's conclusion is that mature exchanges already operate this way (Eurex pins all products of one underlier to one partition), so the idea needs no novelty budget.
Failure posture: budget rebalancing lag is bounded loss-of-efficiency, not loss-of-safety — a shard that can't reach the aggregator keeps admitting against its last-granted (conservative) budget and degrades to rejecting growth, never to admitting unchecked risk.
risk-engine2
and
the
gateway's
pre-trade
checks
remain
the
system
of
record for
the
current
EP3-backed
product
set;
nothing
in
this
RFC
changes
them.
The scenario-vector
/
budget
machinery
is
new
build,
and
§6's
staging
is
arranged so
the
margin
tier
(Stage
4)
can
be
developed
and
validated
against
recorded production
flow
before
it
gates
a
single
real
order
—
the
same signal-only-soak
discipline
as continuous
recon
checks.
These are why D3 is by-underlier and non-negotiable:
Stages overlap deliberately; each names the benchmark that would change the plan rather than a calendar promise.
Stage 1 — Core engine. One deterministic single-threaded matching core per partition, in Rust in the existing workspace: book structures per §2.1, ring-buffer hand-off per §2.2, replay-from-journal recovery. Target <1µs in-engine match latency, ≥1M orders/sec/core. Plan-changing benchmark: below ~500k orders/sec under realistic book depth, stop and profile cache behavior and allocation before reaching for shards.
Stage 2 — Sequenced replicated log + HA. Sequencer, majority-replicated journal (settle the §3.2 table by prototyping both), snapshot+replay recovery, sequenced-multicast market data with re-request and snapshots, and the in-flight reconciliation protocol (§3.3). Exit criterion: kill the leader mid-burst in test and reconcile every in-flight order correctly, repeatedly.
Stage 3 — Sharding. Partition by underlier; route by product identifier through per-segment gateways (CME MSGW / Eurex PS-gateway model, strict FIFO per gateway). Pooled shards for the long tail. This is also where ROME's deferred sharding question gets its answer for free: RFQ flow routes to the shard that owns the underlier.
Stage 4 — Tiered portfolio margin. Hot-path scenario-vector check + per-shard budgets; central aggregator with full revaluation and budget rebalancing; liquidation engine. Soak signal-only against recorded/production flow before it gates orders. Plan-changing threshold: if the conservative estimate's false-rejection rate is material to revenue, tighten revaluation cadence or give hot underliers dedicated shards with richer local scenario sets — before contemplating abandoning pre-trade checks.
Stage 5 — Hardware acceleration (optional, last). FPGA pre-trade risk gateways and feed handlers if and only if latency demands it; GPUs only in the Stage-4 async tier. Never GPU matching.
risk-engine2/settlement-engine
boundaries.
Numbers above are planning figures from a survey, not measurements of our code: the ~1KB/book footprint is order-of-magnitude only; QuantCup's original winning score is unrecoverable (figures are from reimplementations); vendor GPU claims beyond J.P. Morgan's 40x and Cboe Hanweck's "millions of valuations/sec" were excluded as unsourced; OPRA peak rates mix 1-second and 1-millisecond microburst timescales (microbursts run roughly an order of magnitude hotter). Re-derive every capacity number empirically at each stage gate before it drives fleet sizing or spend.