RFC: Battery Testing Against ax-demo

Date: 2026-04-10

Background

admin-cli test ep3-cancel-replace-discovery (rs/admin-cli/src/ep3_cancel_replace_discovery.rs) is a "space probe" that runs 14 sequential scenarios against a real EP3 environment to characterize cancel-replace semantics. It clears the market, runs a specific action pattern, and pretty-prints execution timelines for a human to inspect. It has served well as a discovery tool for one narrow surface area.

It has three limitations that keep it from being a general exchange test battery:

Private API only. It drives EP3 admin gRPC directly, not the public ax-exchange-sdk WS/REST surface that customers actually hit.
No structured assertions. Pass/fail is "did anything throw." Humans read the output to judge correctness.
Sleep-based concurrency. settle_ms between operations is the only sync primitive — flaky under load, wasteful under idle, and hides timing bugs.

This RFC proposes a successor: an admin-cli test battery tool that runs a scenario catalog against our public API surface, executes against ax-demo (or any environment) from a dev laptop or CI, and produces structured pass/fail plus timeline artifacts. It also proposes the minimum changes to ax-demo needed to make the battery safe to run against live demo infrastructure, and an offline coverage-report pipeline that doesn't perturb ax-demo at all.

Goals

Structured scenarios covering order lifecycle, cancel-replace, self-trade prevention, margin/risk rejects, order flags (IOC/FOK/post-only), book priority, reconnect/resume, drop-copy parity, market-data vs order-gateway consistency, duplicate clord_id, rate limits, reject reasons.
Event-driven harness — assertions run against recorded WS event streams, not polled state. Kills most sleep calls, kills most flakes.
Safe against live ax-demo — battery runs cannot interfere with other demo users or markets, and other demo users cannot interfere with battery runs.
Mixed-surface inspection — public SDK for actions (what customers hit), admin client for setup and post-hoc inspection (positions, drop-copy, risk state).
Offline coverage report — given the git SHA ax-demo is running, produce a line/branch coverage report of the battery without instrumenting ax-demo.
Runnable from a laptop — a developer with two API keys should be able to run the battery against ax-demo and get a pass/fail report in minutes.

Non-Goals

Replacing unit tests or property tests. The battery is end-to-end only.
Load testing. Throughput/latency characterization is a separate tool.
Fuzzing. Scenarios are hand-written, not generated.
Instrumenting the live ax-demo binary. Coverage is computed offline at the same git SHA against a local build.
A YAML scenario DSL. Scenarios are plain Rust until we know what primitives we actually need.

Design Overview

Harness

New crate module rs/admin-cli/src/test_battery/:

test_battery/
  mod.rs          — entry point, scenario registry, CLI wiring
  env.rs          — Env: holds public-SDK clients (maker/taker), admin client,
                    sandbox config (symbol, price scale, accounts)
  recorder.rs     — EventRecorder: subscribes to each client's WS stream,
                    records every event with a monotonic timestamp into a
                    per-client timeline
  scenario.rs     — Scenario trait, Outcome struct, assertion helpers
  assertions.rs   — expect_event_within, expect_order_state, expect_no_event_for,
                    expect_fill_sequence, wait_for<F>
  report.rs       — JUnit XML + per-scenario markdown timeline emission
  scenarios/
    order_lifecycle.rs
    cancel_replace.rs
    self_trade_prevention.rs
    margin_rejects.rs
    order_flags.rs
    book_priority.rs
    reconnect_resume.rs
    drop_copy_parity.rs
    ...

The Scenario trait:

#[async_trait]
pub trait Scenario: Send + Sync {
    fn name(&self) -> &'static str;
    fn surface_tags(&self) -> &[SurfaceTag];  // for coverage registry, below
    async fn run(&self, env: &Env) -> Result<Outcome>;
}

Env holds two OrderGatewayWsClient instances (from ax-exchange-sdk) for maker and taker, plus an admin Ep3Client for inspection and setup. The existing rs/admin-cli/src/replace_order_test.rs already demonstrates the public-SDK WS-client shape — reuse that.

Event-driven sync, not sleep

EventRecorder spawns a task per client that pulls WS events into a timeline. Assertions take the form:

env.maker.place_order(...).await?;
env.wait_for(Client::Maker, |ev| matches!(ev, OrderAcked { .. }),
             Duration::from_secs(2)).await?;

wait_for blocks on the recorded stream until a predicate matches or times out. There are no sleep(settle_ms) calls in scenarios. Timeouts are per-wait, not per-scenario, so a flake gives a precise failure site.

For scenarios that assert absence of an event (e.g., "no fill should occur"), expect_no_event_for(duration) waits the full window and asserts nothing matched.

Scenario structure

Each scenario follows:

setup   — sandbox ACLs guarantee empty markets, so setup is mostly
          price/qty parameterization
run     — a sequence of actions interleaved with wait_for / expect_*
teardown — cancel_all as a defensive measure; harness verifies markets
          are clean before marking the scenario passed

Outcome carries events (per-client recorded timeline), assertions (pass/fail list with source locations), and timing (wall clock + per-wait latencies). The report emitter consumes Outcome to produce JUnit XML for CI and a markdown timeline per scenario for humans.

Sandbox in ax-demo (soft sandbox)

The battery runs against a soft sandbox inside ax-demo: dedicated users, firm, and symbols, isolated by authz rules inside ax-demo itself. Not a separate deployment — the whole point is that the battery exercises the same binaries, config, DB, and gateway real users hit.

Namespace:

Firm: firms/BATTERY/
Users: battery-maker-1, battery-taker-1, battery-self-1 (same user both sides, for self-trade prevention scenarios), battery-crossfirm-1 under a second firm for cross-firm scenarios.
Symbols: BATTERY.* — one of each product type (perp, dated future, option if applicable). The variety matters: if the sandbox only has a perp, the battery will never hit the dated-future branches and coverage will look artificially thin exactly where it matters.

Admin helpers, gated:

force_mark_price(symbol, price) — only accepts BATTERY.* symbols, only callable by admin. Needed for margin-reject scenarios.
force_settlement(symbol) — same gating.

These live in the admin surface but are hard-gated on symbol prefix and caller identity. See safety model below.

ACL & safety model

The only thing protecting real demo markets from a battery bug is the ACL, so it must be fail-closed and doubly-gated:

Non-battery users cannot touch BATTERY.* symbols (rejected at the gateway). Prevents humans from polluting battery state.
Battery users cannot touch non-BATTERY.* symbols (rejected at the gateway). Prevents a buggy scenario from scribbling on real demo markets.
Battery-only admin endpoints (force_mark_price, etc.) check both that the caller is a battery user and that the target symbol is a battery symbol. Either check alone is a footgun.

Enforcement lives at the order gateway / risk layer, wherever the existing user→symbol authz check already runs. Confirming where that check lives is a prerequisite to this RFC landing.

Preflight smoke test. Before every battery run, the harness executes two assertions and aborts on failure:

A normal user attempting to trade BATTERY.PERP-1 is rejected.
A battery user attempting to trade a non-battery symbol is rejected.

If either fails, the isolation the battery is relying on doesn't exist, and the run is aborted before any scenario executes. This is the single most important piece of safety machinery in the design.

Concurrency

Even with isolation, two battery runs hitting the same sandbox simultaneously will stomp on each other. Options considered:

(a) Advisory lock held for the run duration, stored in Postgres or Redis.
(b) Per-run sub-sandbox with dynamically created symbols.
(c) Serial execution. CI runs one at a time, humans coordinate.

Start with (c). Add (a) when the first human wants to run the battery ad-hoc while CI is running. (b) is overkill until the battery is large enough that runs take more than a few minutes.

Coverage report (offline)

Coverage is a property of (battery, git SHA, sandbox config). Computed entirely offline, without touching ax-demo:

Read ax-demo's running SHA from /version (or equivalent).
git checkout <sha> and build the stack with RUSTFLAGS="-C instrument-coverage".
Boot the stack locally with a sandbox config snapshot — a fixture that mirrors the BATTERY.* instruments, firms/BATTERY/ users, risk params, and any other state the scenarios depend on. Without this, the local stack won't exercise the same code paths and the report will be dishonest.
Run the battery against the local instrumented stack.
Dump profraw, run grcov, emit HTML + lcov.
Tear down the local stack.

ax-demo is never instrumented. Live ax-demo stays fast; coverage runs happen in CI against a local throwaway stack at the same SHA.

Two-target model: the same battery binary runs in two modes:

Behavior mode, against live ax-demo, gives real race-condition coverage and passes/fails scenarios. This is the authoritative result.
Coverage mode, against a local instrumented stack, gives the line/branch coverage report. Instrumentation perturbs timing, so race-dependent scenarios may exercise slightly different code paths here — that's fine for "what does the battery cover" but not for "did the battery pass."

PR-level coverage diff falls out for free: run coverage mode in CI on each PR that touches scenarios, diff against main, post "this PR adds +N lines of coverage" as a PR comment. This is the reporting that actually motivates people to add scenarios.

API-surface coverage registry (day-one, cheap)

Separate from line coverage: a registry enumerating every public SDK message type and every documented reject reason, with a map from surface-tag to scenarios that exercise it. Each scenario declares its surface_tags(). The registry emits a report: "of N reject reasons, the battery exercises M."

This is cheap to build (a day), immediately useful, and answers a question line coverage can't: "what holes does our battery have at the API level?" Ship this first; line coverage comes second.

Open Questions

Where does the user→symbol ACL live today? The battery safety model assumes it's a single chokepoint at the order gateway. If it's scattered, that needs consolidating before the sandbox is safe.
Is there an existing admin "force mark price" endpoint? If so, we extend its gating. If not, we add one, battery-symbol-only.
Does ax-demo expose its git SHA? Required for the offline coverage pipeline. If not, add a /version endpoint.
Sandbox maintenance ownership. When a new product type or risk rule ships, someone has to extend the sandbox to cover it. Proposed: a checklist item in the PR template for new exchange features — "does the battery sandbox need a new instrument/user/scenario?" — plus a CLAUDE.md note.
API key provisioning. Battery users need real API keys. Are these generated per-run, stored in CI secrets, or pre-provisioned once and rotated?

Migration / Rollout

Land the ACL chokepoint audit and any consolidation needed (open question 1).
Land the sandbox namespace (firms/BATTERY/, BATTERY.* symbols, ACL rules) in ax-demo. Ship the preflight smoke test as an independent binary first, so we can verify isolation works before any scenarios run.
Land the harness skeleton (test_battery/env.rs, recorder.rs, scenario.rs, assertions.rs) with one or two reference scenarios (order lifecycle, cancel-replace) ported from the discovery tool to prove the event-driven pattern.
Land the API-surface coverage registry and wire it into the harness.
Grow the scenario catalog incrementally. Each new exchange feature should ship with at least one battery scenario.
Land the offline coverage pipeline as a nightly CI job.
Once the battery is fast and stable, wire it into per-PR CI against a throwaway stack.

Costs

Sandbox maintenance. Small ongoing cost — someone has to keep the sandbox config in sync with real instrument types as they evolve. Mitigated by PR template checklist.
ACL risk. The soft sandbox's entire safety story rests on the ACL being correct. Mitigated by fail-closed design, double-gating on admin endpoints, and the preflight smoke test.
Shared state spillover. Battery runs consume rate limits, risk-engine CPU, DB write volume on the shared ax-demo. Probably fine at current scale; worth monitoring if the battery grows to thousands of scenarios.
Two code paths to maintain in the harness (public SDK for actions, admin client for inspection). Acceptable — that's the whole point of mixed surface.