RFC: In-House Matching Engine and Portfolio Margin

Date: 2026-06-10

Status: Draft

Scope: the in-house matching engine we are building to replace EP3, and the portfolio-margin architecture required to list full options chains under it. Whether and when it actually displaces EP3 in production is a separate cutover decision; the engine gets built either way.

This RFC distills a survey of production matching-engine and margin-system architecture (LMAX, NASDAQ INET/Genium, CME Globex, Eurex T7, Coinbase, OCC TIMS, CME SPAN/SPAN 2, Eurex Prisma, Binance/Deribit portfolio margin) into the design for that engine. Two prior RFCs set the stage: ContractId reforms make listing a dated contract pure data and reserve the bit layout for options, and ROME deferred cross-engine sharding as a v2 concern. Both decisions collide with the same question what does AX matching look like at options-chain scale, with portfolio margin? and this RFC answers it before any code forces the answer.

The three headline conclusions, stated up front because everything else follows from them:

  1. One logical order book runs on one thread. Every leading venue runs a single-threaded, deterministic matching core per partition; throughput comes from sharding instruments across engines, never from parallelizing a book. A single core sustains roughly 16M orders/sec (LMAX's documented figure is 6M on one thread).
  2. Pre-trade portfolio margin cannot be a synchronous full-portfolio revaluation. Real systems precompute SPAN-style scenario arrays so the hot-path check is a table lookup / incremental dot product against a cached per-account loss vector, backed by an asynchronous authoritative revaluation and a liquidation engine.
  3. GPUs do not belong in matching. Branch divergence, PCIe latency, and strict ordering make them strictly worse than a pinned CPU core. They are legitimately useful only in the asynchronous risk tier (J.P. Morgan's documented 40x on Monte Carlo risk; Cboe's GPU-based Hanweck/Volera options-analytics engine). The realistic hardware-acceleration path for matching-adjacent work is FPGA (pre-trade risk gateways, feed handlers), and only much later.

1. Decisions

These are the load-bearing commitments. Everything in §2§5 is elaboration.

2. Matching core

2.1 Order book data structures

The reference design (WK Selph's, corroborated by liquibook/Chronicle/ Databento write-ups), in Rust:

Realistic expectation: tens to a few hundred nanoseconds per book operation in optimized native code, dominated by cache behavior. An open benchmark of this design measured ~442ns/order (~2M orders/sec) on commodity hardware with 100k resting orders per book.

2.2 Threading and hand-off

The shard process is the LMAX shape: input unmarshalling, journaling, replication, and output marshalling each on their own pinned threads, communicating through pre-allocated ring buffers (the Disruptor paper documents 25M+ msgs/sec and ~52ns mean inter-thread hop); the matching thread is pinned to an isolated core, busy-polling. Standard mechanical sympathy applies: cache-line padding against false sharing, single-writer principle, hyper-threading off on engine hosts, NIC-NUMA-local processing, and (when we get there) kernel bypass on the gateway path the kernel stack costs ~2050µs per transit; Onload/DPDK-class bypass cuts that to ~15µs.

2.3 Capacity per shard

The binding constraints at scale are memory and market-data fan-out, not match throughput. A near-empty book costs on the order of ~1KB (planning figure measure before fleet-sizing), so 1M books 1GB: one host comfortably holds 1M markets. What one core cannot do is keep millions of active books cache-warm or absorb options-chain quote traffic for many hot underliers. Planning numbers:

3. Sequencing, replication, recovery

3.1 The pipeline

Gateway (existing order-gateway surface, eventually kernel-bypass) sequencer (assigns global order within the shard, writes the replicated journal) matching core outputs (trade reports, drop copy, sequenced market data). Market data disseminates MoldUDP64-style: sequenced multicast with gap detection, a re-request server, and a snapshot service for late join the NASDAQ ITCH/GLIMPSE pattern.

3.2 Log technology

Two credible options, to be settled by prototype in Stage 2 (§6):

Aeron Cluster (Raft) Chronicle-style journal + in-house replication
Consensus/HA Integrated (leader election, majority ack) Build it ourselves
Latency µs-class; used by CME, Coinbase, Man Group Single-digit µs write-to-read
Rust story Bindings/port required (Aeron is C/Java native) Memory-mapped journal is straightforward to own in Rust
Risk Foreign runtime in the most critical path We own a consensus-adjacent protocol

Durability strategy regardless of choice: replicate to a majority over a fast network rather than fsync-ing the hot path to local disk (Chronicle's own benchmarking argues sync replication beats local fsync on both latency and fault tolerance), persist via memory-mapped journal, bound replay time with periodic state snapshots.

3.3 Failover is a client-protocol problem too

Raft-style failover can lose the in-flight message at the instant of leader loss Aeron's docs say so explicitly. The client protocol must therefore support post-failover reconciliation of in-flight order state from day one. This is a correctness requirement, not an optimization, and it lands in Stage 2, not later. (House rule applies with force here: disconnect, server-initiated disconnect, crash/restart, and reconnect/recovery paths are all first-class test surfaces before anything goes near production.)

4. Portfolio margin

4.1 Methodology family

The industry has one basic trick: precompute scenario P&L, then aggregate with offsets up a hierarchy.

AX's methodology choice (scenario grid shape, offsets, lookbacks) is a risk-policy decision out of scope here; the architecture below works for any member of this family.

4.2 The tiered architecture

Hot path (synchronous, in-shard, sub-microsecond budget):

  1. Each instrument carries a precomputed scenario P&L vector (SPAN-style risk array), refreshed by the risk tier on its own cadence.
  2. Each account carries a cached scenario-loss vector per shard. An incoming order's marginal margin impact is incremental vector math add the order's per-scenario contribution, take the new worst case plus a Greeks-based first-order term for anything the grid doesn't capture.
  3. The result is checked against the shard's local risk budget for that account (D5). Admit or reject; never call out of the shard.
  4. Binance's own PM order check has exactly this shape (new order cost = increase in IM + order's open loss against a virtual available balance) the pattern is proven at retail-crypto scale, and we tighten it to HFT-grade by keeping it in-process.

Authoritative tier (asynchronous): a central risk aggregator continuously runs the full scenario/VaR revaluation across the whole portfolio, granting cross-underlier offset credit, re-anchoring the per-account cached vectors (correcting Taylor-approximation drift), and rebalancing per-shard budgets. This is the one place GPU acceleration is justified, and only once scenario counts make CPUs the bottleneck.

Backstop: a liquidation engine keyed off the authoritative tier. The hot path is conservative by design and will occasionally over-margin hedged portfolios; the async tier corrects within its cadence; if an account is genuinely under-margined despite admission control, liquidation not order blocking is the safety mechanism. All three layers exist or the design is unsound: admission control without true-up drifts, true-up without liquidation has no teeth.

4.3 Why this fits the sharding ("homologation")

Because options on one underlier are mathematically related, risk must aggregate at the underlier level even if series were spread across engines. Sharding by underlier (D3) means the dominant offsets verticals, calendars, covered structures are local to one shard's budget and priced correctly in the synchronous path. Only the smaller cross-underlier correlation credit lives exclusively in the central aggregator, and it is exactly the part that tolerates asynchrony. This is the formalized version of SPAN's "combined commodity" and TIMS's class-group hierarchy; the survey's conclusion is that mature exchanges already operate this way (Eurex pins all products of one underlier to one partition), so the idea needs no novelty budget.

Failure posture: budget rebalancing lag is bounded loss-of-efficiency, not loss-of-safety a shard that can't reach the aggregator keeps admitting against its last-granted (conservative) budget and degrades to rejecting growth, never to admitting unchecked risk.

4.4 Relationship to today's risk stack

risk-engine2 and the gateway's pre-trade checks remain the system of record for the current EP3-backed product set; nothing in this RFC changes them. The scenario-vector / budget machinery is new build, and §6's staging is arranged so the margin tier (Stage 4) can be developed and validated against recorded production flow before it gates a single real order the same signal-only-soak discipline as continuous recon checks.

5. Product mechanics the sharding must support

These are why D3 is by-underlier and non-negotiable:

6. Staged plan

Stages overlap deliberately; each names the benchmark that would change the plan rather than a calendar promise.

Stage 1 Core engine. One deterministic single-threaded matching core per partition, in Rust in the existing workspace: book structures per §2.1, ring-buffer hand-off per §2.2, replay-from-journal recovery. Target <1µs in-engine match latency, 1M orders/sec/core. Plan-changing benchmark: below ~500k orders/sec under realistic book depth, stop and profile cache behavior and allocation before reaching for shards.

Stage 2 Sequenced replicated log + HA. Sequencer, majority-replicated journal (settle the §3.2 table by prototyping both), snapshot+replay recovery, sequenced-multicast market data with re-request and snapshots, and the in-flight reconciliation protocol (§3.3). Exit criterion: kill the leader mid-burst in test and reconcile every in-flight order correctly, repeatedly.

Stage 3 Sharding. Partition by underlier; route by product identifier through per-segment gateways (CME MSGW / Eurex PS-gateway model, strict FIFO per gateway). Pooled shards for the long tail. This is also where ROME's deferred sharding question gets its answer for free: RFQ flow routes to the shard that owns the underlier.

Stage 4 Tiered portfolio margin. Hot-path scenario-vector check + per-shard budgets; central aggregator with full revaluation and budget rebalancing; liquidation engine. Soak signal-only against recorded/production flow before it gates orders. Plan-changing threshold: if the conservative estimate's false-rejection rate is material to revenue, tighten revaluation cadence or give hot underliers dedicated shards with richer local scenario sets before contemplating abandoning pre-trade checks.

Stage 5 Hardware acceleration (optional, last). FPGA pre-trade risk gateways and feed handlers if and only if latency demands it; GPUs only in the Stage-4 async tier. Never GPU matching.

7. Alternatives considered

8. Non-goals

9. Open questions

  1. Aeron vs. in-house journal+replication (§3.2). Aeron buys proven Raft at the cost of a foreign runtime in the most critical path and an immature Rust story; in-house buys ownership at the cost of building consensus-adjacent machinery. Settle by Stage-2 prototype, not debate.
  2. Budget rebalancing cadence and sizing policy how much headroom per shard per account, rebalance frequency vs. capital efficiency. Needs modeling against real flow.
  3. Where the liquidation engine sits relative to today's risk-engine2/settlement-engine boundaries.
  4. Snapshot cadence vs. replay-time budget for shard recovery SLOs.
  5. Tick-ladder vs. tree per product class measure with our actual tick tables.

10. Evidence caveats

Numbers above are planning figures from a survey, not measurements of our code: the ~1KB/book footprint is order-of-magnitude only; QuantCup's original winning score is unrecoverable (figures are from reimplementations); vendor GPU claims beyond J.P. Morgan's 40x and Cboe Hanweck's "millions of valuations/sec" were excluded as unsourced; OPRA peak rates mix 1-second and 1-millisecond microburst timescales (microbursts run roughly an order of magnitude hotter). Re-derive every capacity number empirically at each stage gate before it drives fleet sizing or spend.