RFC: In-House Matching Engine and Portfolio Margin

This RFC distills a survey of production matching-engine and margin-system architecture (LMAX, NASDAQ INET/Genium, CME Globex, Eurex T7, Coinbase, OCC TIMS, CME SPAN/SPAN 2, Eurex Prisma, Binance/Deribit portfolio margin) into the design for that engine. Two prior RFCs set the stage: ContractId reforms make listing a dated contract pure data and reserve the bit layout for options, and ROME deferred cross-engine sharding as a v2 concern. Both decisions collide with the same question — what does AX matching look like at options-chain scale, with portfolio margin? — and this RFC answers it before any code forces the answer.

One logical order book runs on one thread. Every leading venue runs a single-threaded, deterministic matching core per partition; throughput comes from sharding instruments across engines, never from parallelizing a book. A single core sustains roughly 1–6M orders/sec (LMAX's documented figure is 6M on one thread).
Pre-trade portfolio margin cannot be a synchronous full-portfolio revaluation. Real systems precompute SPAN-style scenario arrays so the hot-path check is a table lookup / incremental dot product against a cached per-account loss vector, backed by an asynchronous authoritative revaluation and a liquidation engine.
GPUs do not belong in matching. Branch divergence, PCIe latency, and strict ordering make them strictly worse than a pinned CPU core. They are legitimately useful only in the asynchronous risk tier (J.P. Morgan's documented 40x on Monte Carlo risk; Cboe's GPU-based Hanweck/Volera options-analytics engine). The realistic hardware-acceleration path for matching-adjacent work is FPGA (pre-trade risk gateways, feed handlers), and only much later.

1. Decisions

D1 — Deterministic single-threaded core per shard. Each shard is a single-writer state machine: all I/O, journaling, and replication live on surrounding threads (Disruptor-style ring buffers); matching itself is one thread, in-memory, no locks, no allocation on the hot path. Determinism is not a performance trick — it is what makes replay-based recovery and replication correct.
D2 — Sequenced, replicated input log in front of every shard. A sequencer assigns a total order to commands and replicates the log to a majority (Raft à la Aeron Cluster, or Chronicle-style journaling with in-house replication — see §3.2). Recovery is deterministic replay plus periodic snapshots. Kafka/Redpanda are downstream/analytics only, never the hot path.
D3 — Shard by underlier, not by series. One shard owns an underlier's entire chain and curve: all expiries, all strikes, all combos. This is the Eurex/CME pattern, and it is forced by the products themselves — implied spread matching, atomic mass quotes, mass cancels, purge ports, and kill switches are all cross-instrument within an underlier. It also localizes the dominant margin offsets (vertical/calendar spreads) to a single shard's risk budget. The long tail of illiquid underliers shares pooled shards.
D4 — Tiered portfolio margin. A fast conservative approximation in the synchronous path (precomputed per-instrument scenario vectors + cached per-account loss vector + Greeks-based marginal impact), an asynchronous authoritative full scenario/VaR revaluation that periodically re-anchors the caches, and a liquidation engine as the backstop. Admission is conservative-then-true-up, never exact-and-slow.
D5 — Pre-allocated per-shard risk budgets; no synchronous cross-shard calls. Each shard admits orders against a locally held budget. The central risk aggregator owns the authoritative portfolio view, rebalances budgets asynchronously, and grants cross-underlier offset credit. The hot path never waits on another shard or on the aggregator.
D6 — No GPUs in matching; GPUs permitted in the async risk tier; FPGA is the only matching-adjacent acceleration path, deferred.

2. Matching core

2.1 Order book data structures

Per side: a price-level structure sorted by price. Two viable shapes: a B-tree of Limit nodes (handles sparse/wide price domains), or a flat tick-indexed ladder (O(1) access; the QuantCup-winning design, where cancel is a single store). Choose per product class: ladders for bounded tick domains (most listed derivatives), trees as the general fallback.
Per price level: an intrusive doubly-linked FIFO of Order nodes — price-time priority with O(1) insert/remove.
Order ID → node map for O(1) cancel/amend.
Arena/pool allocation for all of the above. No allocator traffic and no pointer-chasing across cache lines on the hot path; cache-line-aware layout; pre-sized at shard start.

2.2 Threading and hand-off

The shard process is the LMAX shape: input unmarshalling, journaling, replication, and output marshalling each on their own pinned threads, communicating through pre-allocated ring buffers (the Disruptor paper documents 25M+ msgs/sec and ~52ns mean inter-thread hop); the matching thread is pinned to an isolated core, busy-polling. Standard mechanical sympathy applies: cache-line padding against false sharing, single-writer principle, hyper-threading off on engine hosts, NIC-NUMA-local processing, and (when we get there) kernel bypass on the gateway path — the kernel stack costs ~20–50µs per transit; Onload/DPDK-class bypass cuts that to ~1–5µs.

2.3 Capacity per shard

The binding constraints at scale are memory and market-data fan-out, not match throughput. A near-empty book costs on the order of ~1KB (planning figure — measure before fleet-sizing), so 1M books ≈ 1GB: one host comfortably holds 1M markets. What one core cannot do is keep millions of active books cache-warm or absorb options-chain quote traffic for many hot underliers. Planning numbers:

3. Sequencing, replication, recovery

3.1 The pipeline

Gateway (existing order-gateway surface, eventually kernel-bypass) → sequencer (assigns global order within the shard, writes the replicated journal) → matching core → outputs (trade reports, drop copy, sequenced market data). Market data disseminates MoldUDP64-style: sequenced multicast with gap detection, a re-request server, and a snapshot service for late join — the NASDAQ ITCH/GLIMPSE pattern.

3.2 Log technology

	Aeron Cluster (Raft)	Chronicle-style journal + in-house replication
Consensus/HA	Integrated (leader election, majority ack)	Build it ourselves
Latency	µs-class; used by CME, Coinbase, Man Group	Single-digit µs write-to-read
Rust story	Bindings/port required (Aeron is C/Java native)	Memory-mapped journal is straightforward to own in Rust
Risk	Foreign runtime in the most critical path	We own a consensus-adjacent protocol

3.3 Failover is a client-protocol problem too

Raft-style failover can lose the in-flight message at the instant of leader loss — Aeron's docs say so explicitly. The client protocol must therefore support post-failover reconciliation of in-flight order state from day one. This is a correctness requirement, not an optimization, and it lands in Stage 2, not later. (House rule applies with force here: disconnect, server-initiated disconnect, crash/restart, and reconnect/recovery paths are all first-class test surfaces before anything goes near production.)

4. Portfolio margin

4.1 Methodology family

SPAN (legacy CME): 16 scenarios per instrument (price × vol shocks), published as risk arrays a few times daily; account margin is worst-case scenario loss with inter-commodity spread credits. Per-position margin is a table lookup — no pricing model needed at evaluation time.
OCC TIMS: ±price-move grid priced with binomial/Black-Scholes, aggregated class group → product group → portfolio group with offsets at each level.
SPAN 2 / Eurex Prisma: full historical VaR, thousands of scenarios, parameter files in the tens of GB — decisively a periodic batch computation, never per-order.
Crypto venues: Deribit-style ~21–23-scenario matrices per currency; Binance uniMMR is a continuously evaluated equity/maintenance-margin ratio with liquidation at a threshold, not a per-order blocking check.

4.2 The tiered architecture

Each instrument carries a precomputed scenario P&L vector (SPAN-style risk array), refreshed by the risk tier on its own cadence.
Each account carries a cached scenario-loss vector per shard. An incoming order's marginal margin impact is incremental vector math — add the order's per-scenario contribution, take the new worst case — plus a Greeks-based first-order term for anything the grid doesn't capture.
The result is checked against the shard's local risk budget for that account (D5). Admit or reject; never call out of the shard.
Binance's own PM order check has exactly this shape (new order cost = increase in IM + order's open loss against a virtual available balance) — the pattern is proven at retail-crypto scale, and we tighten it to HFT-grade by keeping it in-process.

Authoritative tier (asynchronous): a central risk aggregator continuously runs the full scenario/VaR revaluation across the whole portfolio, granting cross-underlier offset credit, re-anchoring the per-account cached vectors (correcting Taylor-approximation drift), and rebalancing per-shard budgets. This is the one place GPU acceleration is justified, and only once scenario counts make CPUs the bottleneck.

Backstop: a liquidation engine keyed off the authoritative tier. The hot path is conservative by design and will occasionally over-margin hedged portfolios; the async tier corrects within its cadence; if an account is genuinely under-margined despite admission control, liquidation — not order blocking — is the safety mechanism. All three layers exist or the design is unsound: admission control without true-up drifts, true-up without liquidation has no teeth.

4.3 Why this fits the sharding ("homologation")

Because options on one underlier are mathematically related, risk must aggregate at the underlier level even if series were spread across engines. Sharding by underlier (D3) means the dominant offsets — verticals, calendars, covered structures — are local to one shard's budget and priced correctly in the synchronous path. Only the smaller cross-underlier correlation credit lives exclusively in the central aggregator, and it is exactly the part that tolerates asynchrony. This is the formalized version of SPAN's "combined commodity" and TIMS's class-group hierarchy; the survey's conclusion is that mature exchanges already operate this way (Eurex pins all products of one underlier to one partition), so the idea needs no novelty budget.

4.4 Relationship to today's risk stack

risk-engine2 and the gateway's pre-trade checks remain the system of record for the current EP3-backed product set; nothing in this RFC changes them. The scenario-vector / budget machinery is new build, and §6's staging is arranged so the margin tier (Stage 4) can be developed and validated against recorded production flow before it gates a single real order — the same signal-only-soak discipline as continuous recon checks.

5. Product mechanics the sharding must support

Implied spread matching (CME Globex pattern): calendar spreads match against outright legs and vice versa, with lot predetermination across sources before per-instrument allocation. Inherently cross-instrument within a curve.
Mass quoting: options market makers stream quote volumes that dwarf equities (per Cboe, OPRA data rates are >60x equities). Atomic quote-set updates and self-match prevention must be engine-local.
Purge ports and kill switches: a dedicated cancel-only fast path so cancels are never queued behind order traffic in a volatility spike; with no filter, a purge is a firm-level kill switch. Cboe recommends ≥2 purge ports per venue. These only work atomically if the whole chain is on one engine.
Combo/complex books: multi-leg orders match against an implied combo book and the leg markets, requiring race-free atomic multi-leg execution — again engine-local.

6. Staged plan

Stage 1 — Core engine. One deterministic single-threaded matching core per partition, in Rust in the existing workspace: book structures per §2.1, ring-buffer hand-off per §2.2, replay-from-journal recovery. Target <1µs in-engine match latency, ≥1M orders/sec/core. Plan-changing benchmark: below ~500k orders/sec under realistic book depth, stop and profile cache behavior and allocation before reaching for shards.

Stage 4 — Tiered portfolio margin. Hot-path scenario-vector check + per-shard budgets; central aggregator with full revaluation and budget rebalancing; liquidation engine. Soak signal-only against recorded/production flow before it gates orders. Plan-changing threshold: if the conservative estimate's false-rejection rate is material to revenue, tighten revaluation cadence or give hot underliers dedicated shards with richer local scenario sets — before contemplating abandoning pre-trade checks.

7. Alternatives considered

GPU-parallel matching. Rejected. Warp-lockstep execution serializes divergent branches; matching is branchy, sequential (price-time priority is a total order), and latency-oriented; PCIe transfers sit in the hot path. No production exchange does this, for these reasons.
Synchronous full-portfolio revaluation per order. Rejected. SPAN 2/Prisma-class revaluation is thousands of scenarios over tens of GB of parameters — a periodic batch job by construction. Every real system that does pre-trade portfolio checks does them via precomputed arrays/marginal impact.
One mega-engine for all instruments. Rejected. Even at 6M orders/sec/ core, a single engine is a fan-out, cache-residency, and blast-radius problem long before it is a throughput problem; OPRA-scale options traffic is handled in production only by partitioning.
Synchronous cross-shard margin calls. Rejected. Any design where order admission awaits another shard or a central service reintroduces the latency and availability coupling that sharding exists to remove. Budgets + async rebalance is the standard credit-allocation answer.
Kafka/Redpanda as the sequenced input log. Rejected for the hot path (ms-class). Fine downstream for analytics/drop-copy fan-out.

8. Non-goals

The EP3 cutover plan. The engine gets built regardless; when and how it takes over live flow from EP3 — migration sequencing, parallel-run period, rollback posture — is a commercial-and-operational decision for its own RFC.
Choosing the margin methodology (scenario grid shape, shock sizes, offsets, regulatory regime). §4 is the architecture for whichever methodology risk policy selects. Note the regulatory caveat: TIMS/SPAN have specific standing (SEC 15c3-5, portfolio-margin rules); a crypto-style uniMMR clone does not automatically satisfy securities/futures regulation.
Colocation/proximity hosting, fee schedules, market-maker programs.
ROME v2 sharding details beyond the routing consequence noted in Stage 3.

9. Open questions

Aeron vs. in-house journal+replication (§3.2). Aeron buys proven Raft at the cost of a foreign runtime in the most critical path and an immature Rust story; in-house buys ownership at the cost of building consensus-adjacent machinery. Settle by Stage-2 prototype, not debate.
Budget rebalancing cadence and sizing policy — how much headroom per shard per account, rebalance frequency vs. capital efficiency. Needs modeling against real flow.
Where the liquidation engine sits relative to today's risk-engine2/settlement-engine boundaries.
Snapshot cadence vs. replay-time budget for shard recovery SLOs.
Tick-ladder vs. tree per product class — measure with our actual tick tables.

10. Evidence caveats

Numbers above are planning figures from a survey, not measurements of our code: the ~1KB/book footprint is order-of-magnitude only; QuantCup's original winning score is unrecoverable (figures are from reimplementations); vendor GPU claims beyond J.P. Morgan's 40x and Cboe Hanweck's "millions of valuations/sec" were excluded as unsourced; OPRA peak rates mix 1-second and 1-millisecond microburst timescales (microbursts run roughly an order of magnitude hotter). Re-derive every capacity number empirically at each stage gate before it drives fleet sizing or spend.