RFC: Battery Testing Against ax-demo

Date: 2026-04-10

Background

admin-cli test ep3-cancel-replace-discovery (rs/admin-cli/src/ep3_cancel_replace_discovery.rs) is a "space probe" that runs 14 sequential scenarios against a real EP3 environment to characterize cancel-replace semantics. It clears the market, runs a specific action pattern, and pretty-prints execution timelines for a human to inspect. It has served well as a discovery tool for one narrow surface area.

It has three limitations that keep it from being a general exchange test battery:

  1. Private API only. It drives EP3 admin gRPC directly, not the public ax-exchange-sdk WS/REST surface that customers actually hit.
  2. No structured assertions. Pass/fail is "did anything throw." Humans read the output to judge correctness.
  3. Sleep-based concurrency. settle_ms between operations is the only sync primitive flaky under load, wasteful under idle, and hides timing bugs.

This RFC proposes a successor: an admin-cli test battery tool that runs a scenario catalog against our public API surface, executes against ax-demo (or any environment) from a dev laptop or CI, and produces structured pass/fail plus timeline artifacts. It also proposes the minimum changes to ax-demo needed to make the battery safe to run against live demo infrastructure, and an offline coverage-report pipeline that doesn't perturb ax-demo at all.

Goals

Non-Goals

Design Overview

Harness

New crate module rs/admin-cli/src/test_battery/:

test_battery/
  mod.rs          — entry point, scenario registry, CLI wiring
  env.rs          — Env: holds public-SDK clients (maker/taker), admin client,
                    sandbox config (symbol, price scale, accounts)
  recorder.rs     — EventRecorder: subscribes to each client's WS stream,
                    records every event with a monotonic timestamp into a
                    per-client timeline
  scenario.rs     — Scenario trait, Outcome struct, assertion helpers
  assertions.rs   — expect_event_within, expect_order_state, expect_no_event_for,
                    expect_fill_sequence, wait_for<F>
  report.rs       — JUnit XML + per-scenario markdown timeline emission
  scenarios/
    order_lifecycle.rs
    cancel_replace.rs
    self_trade_prevention.rs
    margin_rejects.rs
    order_flags.rs
    book_priority.rs
    reconnect_resume.rs
    drop_copy_parity.rs
    ...

The Scenario trait:

#[async_trait]
pub trait Scenario: Send + Sync {
    fn name(&self) -> &'static str;
    fn surface_tags(&self) -> &[SurfaceTag];  // for coverage registry, below
    async fn run(&self, env: &Env) -> Result<Outcome>;
}

Env holds two OrderGatewayWsClient instances (from ax-exchange-sdk) for maker and taker, plus an admin Ep3Client for inspection and setup. The existing rs/admin-cli/src/replace_order_test.rs already demonstrates the public-SDK WS-client shape reuse that.

Event-driven sync, not sleep

EventRecorder spawns a task per client that pulls WS events into a timeline. Assertions take the form:

env.maker.place_order(...).await?;
env.wait_for(Client::Maker, |ev| matches!(ev, OrderAcked { .. }),
             Duration::from_secs(2)).await?;

wait_for blocks on the recorded stream until a predicate matches or times out. There are no sleep(settle_ms) calls in scenarios. Timeouts are per-wait, not per-scenario, so a flake gives a precise failure site.

For scenarios that assert absence of an event (e.g., "no fill should occur"), expect_no_event_for(duration) waits the full window and asserts nothing matched.

Scenario structure

Each scenario follows:

setup   — sandbox ACLs guarantee empty markets, so setup is mostly
          price/qty parameterization
run     — a sequence of actions interleaved with wait_for / expect_*
teardown — cancel_all as a defensive measure; harness verifies markets
          are clean before marking the scenario passed

Outcome carries events (per-client recorded timeline), assertions (pass/fail list with source locations), and timing (wall clock + per-wait latencies). The report emitter consumes Outcome to produce JUnit XML for CI and a markdown timeline per scenario for humans.

Sandbox in ax-demo (soft sandbox)

The battery runs against a soft sandbox inside ax-demo: dedicated users, firm, and symbols, isolated by authz rules inside ax-demo itself. Not a separate deployment the whole point is that the battery exercises the same binaries, config, DB, and gateway real users hit.

Namespace:

Admin helpers, gated:

These live in the admin surface but are hard-gated on symbol prefix and caller identity. See safety model below.

ACL & safety model

The only thing protecting real demo markets from a battery bug is the ACL, so it must be fail-closed and doubly-gated:

  1. Non-battery users cannot touch BATTERY.* symbols (rejected at the gateway). Prevents humans from polluting battery state.
  2. Battery users cannot touch non-BATTERY.* symbols (rejected at the gateway). Prevents a buggy scenario from scribbling on real demo markets.
  3. Battery-only admin endpoints (force_mark_price, etc.) check both that the caller is a battery user and that the target symbol is a battery symbol. Either check alone is a footgun.

Enforcement lives at the order gateway / risk layer, wherever the existing usersymbol authz check already runs. Confirming where that check lives is a prerequisite to this RFC landing.

Preflight smoke test. Before every battery run, the harness executes two assertions and aborts on failure:

If either fails, the isolation the battery is relying on doesn't exist, and the run is aborted before any scenario executes. This is the single most important piece of safety machinery in the design.

Concurrency

Even with isolation, two battery runs hitting the same sandbox simultaneously will stomp on each other. Options considered:

Start with (c). Add (a) when the first human wants to run the battery ad-hoc while CI is running. (b) is overkill until the battery is large enough that runs take more than a few minutes.

Coverage report (offline)

Coverage is a property of (battery, git SHA, sandbox config). Computed entirely offline, without touching ax-demo:

  1. Read ax-demo's running SHA from /version (or equivalent).
  2. git checkout <sha> and build the stack with RUSTFLAGS="-C instrument-coverage".
  3. Boot the stack locally with a sandbox config snapshot a fixture that mirrors the BATTERY.* instruments, firms/BATTERY/ users, risk params, and any other state the scenarios depend on. Without this, the local stack won't exercise the same code paths and the report will be dishonest.
  4. Run the battery against the local instrumented stack.
  5. Dump profraw, run grcov, emit HTML + lcov.
  6. Tear down the local stack.

ax-demo is never instrumented. Live ax-demo stays fast; coverage runs happen in CI against a local throwaway stack at the same SHA.

Two-target model: the same battery binary runs in two modes:

PR-level coverage diff falls out for free: run coverage mode in CI on each PR that touches scenarios, diff against main, post "this PR adds +N lines of coverage" as a PR comment. This is the reporting that actually motivates people to add scenarios.

API-surface coverage registry (day-one, cheap)

Separate from line coverage: a registry enumerating every public SDK message type and every documented reject reason, with a map from surface-tag to scenarios that exercise it. Each scenario declares its surface_tags(). The registry emits a report: "of N reject reasons, the battery exercises M."

This is cheap to build (a day), immediately useful, and answers a question line coverage can't: "what holes does our battery have at the API level?" Ship this first; line coverage comes second.

Open Questions

  1. Where does the usersymbol ACL live today? The battery safety model assumes it's a single chokepoint at the order gateway. If it's scattered, that needs consolidating before the sandbox is safe.
  2. Is there an existing admin "force mark price" endpoint? If so, we extend its gating. If not, we add one, battery-symbol-only.
  3. Does ax-demo expose its git SHA? Required for the offline coverage pipeline. If not, add a /version endpoint.
  4. Sandbox maintenance ownership. When a new product type or risk rule ships, someone has to extend the sandbox to cover it. Proposed: a checklist item in the PR template for new exchange features "does the battery sandbox need a new instrument/user/scenario?" plus a CLAUDE.md note.
  5. API key provisioning. Battery users need real API keys. Are these generated per-run, stored in CI secrets, or pre-provisioned once and rotated?

Migration / Rollout

  1. Land the ACL chokepoint audit and any consolidation needed (open question 1).
  2. Land the sandbox namespace (firms/BATTERY/, BATTERY.* symbols, ACL rules) in ax-demo. Ship the preflight smoke test as an independent binary first, so we can verify isolation works before any scenarios run.
  3. Land the harness skeleton (test_battery/env.rs, recorder.rs, scenario.rs, assertions.rs) with one or two reference scenarios (order lifecycle, cancel-replace) ported from the discovery tool to prove the event-driven pattern.
  4. Land the API-surface coverage registry and wire it into the harness.
  5. Grow the scenario catalog incrementally. Each new exchange feature should ship with at least one battery scenario.
  6. Land the offline coverage pipeline as a nightly CI job.
  7. Once the battery is fast and stable, wire it into per-PR CI against a throwaway stack.

Costs