Position Feed Promotion

Design & phased cutover — promoting the order-gateway's EP3-fed position cache to the authoritative margin source, then retiring the risk-engine's initial-margin compute.

Tracking: A-2764 (umbrella) IM-authority shift: A-2766 / A-3406 / A-3407 / A-3408 Account-rekey: A-3402 / A-3582 Status snapshot: 2026-06-10

§0Principles

Neither feed is trustworthy alone, so a clean tape diff is the gate

The Redis risk-snapshot lags under fast fills; the OG cache can mishandle EP3 edge cases — and a position EP3 never delivered is invisible to the feed entirely (get_position returns 0 for absent). So we promote on evidence, not faith: both feeds are recorded to position tapes and diffed off-line (compare_position_tapes.py). The independent second witness is what earns the flip.

The rollout is reversible config — until exactly one door

Authority is selected by ORDER_GATEWAY_POSITIONS_SOURCE (risk-engine default ↔ ep3): the mechanism lands with zero behavior change and the flip is an env-var roll, undone in seconds. Every step is reversible except one — the risk-engine dropping its IM compute (#2062, which folded in the former #2063), which deletes the second witness for good. Note that #2062 now lands the publish-stop and the compute-drop together, so there is no longer a reversible publish-stop intermediate: merging #2062 is crossing the door.

The soak is the only safety margin for the one-way step

Once #2062 lands there is no independent cross-check left, and no technical fix recovers it. The single real judgment call is how long to soak ep3 in prod before crossing — that soak time is the safety margin traded away with the second witness. As of this snapshot prod has not flipped to ep3 yet (demo has), so that soak clock has not even started.

§1Progress / Tracker

Live snapshot — phase status and the bar below are computed from GitHub PR state at build time (2026-06-10). 6/10 phases done · 60%.

Done — 6 In progress — 3 Not started — 1
1Shadow position cache + position tapesdone

Order-gateway subscribes directly to the EP3 position stream and computes IM locally in shadow mode, logging divergence against the Redis path. Both feeds are recorded to disk with daily rotation, and a Python script diffs them off-line. This is the apparatus the whole cutover is gated on.

  • #1196 O1 — order-gateway listens to EP3 position updates (shadow cache + tapes)
  • #1455 auto-rotate and prune position tapes daily at 5 PM ET
  • #1619 compare_position_tapes.py off-line tape diff
  • #2303 log each source's signed position in the divergence warn + label the counter position_mismatch (A-3605)
  • #2347 divergence warn attribution — equity, snapshot age, RE-published figures, per-symbol position-IM deltas — plus an avail-gap decision-flip warn (A-3621)
2Order-gateway publishes its own open-order margindone

O3 — the order-gateway computes and publishes the open-order margin contribution itself, rather than depending on the risk-engine for it. A prerequisite for OG owning the full margin picture once the cache is authoritative.

  • #1205 O3 — publish open-order margin from order-gateway itself → alee/promote-local-position-cache
3Instrument can_send_order timingdone

og_can_send_order_duration_us — measures the margin-check latency on the hot path, so the cache-vs-Redis promotion can be evaluated for performance, not just correctness.

  • #2163 time can_send_order
  • #2270 criterion bench suite for the position-cache hot paths (reads, writes, read-under-write)
4Harden the cache margin path — fail closeddone

Resolves the present-but-unvaluable failure (A-3476). A position in the cache whose instrument/mark can't be resolved was summed as zero IM — fail-open, overstating available margin. It now bail!s. Margin-increasing orders fail closed; reducing/neutral orders skip full-portfolio valuation.

  • #2164 harden position-cache margin path; fail-closed (A-3476)
  • Superseded / related
  • #2107 earlier fail-closed + divergence instrumentation (absorbed into #2164)
  • #2273 closed/deferred — best-effort valuation of underwater reduces purely to record available for observability (A-3586)
5EP3 empty-bookmark snapshot fix + subscription probedone

Resolves the demo divergence that opened A-3476 — a dormant 13-lot XAU-PERP that was a never-in-cache case caused by a protocol bug. EP3 sent the position in the full snapshot, then sent an empty "bookmark" snapshot the old apply_snapshot treated as a full replace and wiped. The fix marks the cache ready without replacing contents (A-3478).

  • #2112 ignore EP3 empty-bookmark position snapshot (A-3478) → alee/og-position-cache-red
  • #2110 admin-cli EP3 position-subscription probe (A-3477)
6Land the cutover mechanism — Stack A (no behavior change)done

Introduce ORDER_GATEWAY_POSITIONS_SOURCE (risk-engine default / ep3) and route read/replace paths through it. Both ship on the risk-engine default, so production behavior is identical and rollback is config-only — the apparatus, not the flip. Closes A-2764.

  • #2165 ORDER_GATEWAY_POSITIONS_SOURCE primary/shadow margin framework
  • #2166 route read/replace paths through the flag (Closes A-2764)
  • Superseded / related
  • #2033 earlier runtime-selectable margin source on the abandoned alee/og-promote-position-cache branch
7Operational verification — the second witness as gatein progress

Position tapes went on in prod 2026-05-26 to start the verification clock; ≥72h of taping has long since elapsed, but the gate is divergence, not wall-clock. Each round surfaced a new divergence class that re-opened the window: present-but-unvaluable (A-3476, #2164), the empty-bookmark wipe (A-3478, #2112), and a telemetry gap that hid the signed position behind the warn (A-3605, #2303). 2026-06-09 — first decomposed day (#2303's pos= field live in prod): of 61 warns, 57 agreed on the position; the 4 position_mismatch cases were ±1–7-lot fill-propagation races where the ep3 cache held the fresher (correct) figure. The raw warn count is dominated by margin-valuation noise, not feed divergence: both decision paths share the OG-computed open-orders margin term, which ran far above the RE-published figure for the most active MM (computed avail=2422 vs published avail=388k at one 15:44 reject) — that depresses avail toward the decision boundary, so ms-scale valuation deltas flip decisions and spray warns. The gap is invisible to the divergence comparison itself (it depresses both sources equally) and is tracked separately (A-3621); #2347 adds the attribution telemetry (published figures + per-symbol position-IM deltas in the warn, plus an avail-gap decision-flip warn). Gate reframed: judge the feed on the position_mismatch=true divergence rate — which must be explained and ~0 — not on raw warn counts, which cannot converge while MM avail sits at the boundary. Salvage item: the tape-pull/analyze tooling lives only on the abandoned #2033 branch; recover or rebuild it before signing off.

⏲ prod position-taping / verification window (open since 2026-05-26)
100% · complete — elapsed 349h ≥ 72h target
awaiting sign-off position-mismatch divergences explained and ~0 (valuation noise attributed via #2347 / A-3621)
8Flip the source to ep3 (demo → prod soak)not started

Set ORDER_GATEWAY_POSITIONS_SOURCE=ep3 on demo, soak, then prod. The cache path becomes authoritative for can_send_order; Redis runs as shadow. Fully reversible via config — the last reversible step. The soak duration is the safety margin for the one-way step that follows. Demo is flipped and soaking — demo decision divergence logs show ep3 (primary) / risk-engine (shadow), with only cents-scale disagreements. Prod has not flipped: prod still shows risk-engine (primary) / ep3 (shadow), so the prod soak clock has not started.

awaiting sign-off ep3 live on demo (demo logs show ep3 primary)
⏲ prod soak on ep3 (armed — prod still on risk-engine primary)armed — 120h target; starts when started is set
awaiting sign-off prod soak clean — cleared to land the witness-drop (#2062)
9Shift IM authority RE→OG — Stack Bin progress

Move initial-margin authority from the risk-engine to the order-gateway, then retire the RE compute. Strictly ordered; the last step is the only irreversible one in the whole plan. 9a OG publishes og:margin:{user} (merged); 9b api-gateway fuses it into UserRiskSnapshot; 9c/9d are now one PR — #2062 stops publishing IM and drops the compute together (it folded in the former #2063), so the witness-drop is #2062. Do it only after a clean prod soak — which, per step 8, has not begun.

  • #2034 9a — OG publishes per-user margin snapshot to Redis under og:margin:{user_id} (merged on main)
  • #2061 9b — api-gateway sources og:margin into UserRiskSnapshot (per-user + admin all-users)
  • #2062 9c+9d — risk-engine stops publishing and computing IM — the witness-drop (folded in #2063) → alee/api-gw-fuse-ogw-margin
  • Superseded / related
  • #2063 9d as a standalone PR — closed; compute-drop cherry-picked onto #2062 → alee/re-stop-publishing-im
  • #2033 abandoned base branch; Stack B now stacks on main
10Account-rekey the calculator + tape writerin progress

Re-key the OG calculator's internal state and the tape writer UserIdAccountId (A-3582 / A-3402), aligning with the merged transaction-engine rekey. Inert under the current 1:1 user:account mapping, so low-risk — but cleanest to land before Stack B churns the same files.

  • #2261 account-key the calculator's internal state + tape writer (A-3582)
  • #2262 account-key the risk-monitor risk_snapshot_updated consumer (A-3583)
  • Superseded / related
  • #2259 account-key the transaction-engine / api-gateway interface (merged)

Notebook

Reference design — the detailed mechanics behind the tracker.

§2aThe two position views

Two position views exist today and are continuously compared. The whole plan is the disciplined promotion of one over the other.

View How it's built Failure mode
Redis risk-snapshot (current authority) risk-engine derives positions from fills, computes IM, publishes a snapshot to Redis Fresh at rest, but lags under fast fills — the Wasabi margin-lag root cause
OG position cache (being promoted) order-gateway subscribes to the EP3 position stream, computes IM locally; shadow today Fresher, but can mishandle EP3 protocol edge cases

Neither source is a priori authoritative — that symmetry is exactly why the position tapes exist as the second witness that gates cutover.

§2bEpistemics: the two "missing position" failures

Distinguishing these two failure classes drove the blocker work and explains why a second witness is needed at all. They look identical in their effect (a position the margin check doesn't see) but differ completely in detectability.

  1. Present-but-unvaluable fixed — #2164, A-3476 — a position is in the cache but its instrument/mark can't be resolved. The old code summed it as zero IM (fail-open, overstating available margin). Now it bail!s — fail closed. Margin-increasing orders fail closed if the portfolio can't be fully valued; reducing/neutral orders skip full-portfolio valuation so a missing mark on an unrelated position can't block a risk-reducing order. See check_margin.rs. This class is detectable from inside the feed — the position is present, you just can't price it.
  2. Never-in-the-cache the hard case — EP3 never delivered the position, or it was dropped and no subsequent Update arrives (dormant instrument). The feed cannot detect the absence of something it was never told about — get_position returns 0 for absent. This is not fixable from inside the feed.

The demo divergence that opened A-3476 — a dormant 13-lot XAU-PERP, ~$7.6k — was case (2) caused by a protocol bug, not a bad feed: EP3 sent the position in the full snapshot, then sent an empty "bookmark" snapshot that the old apply_snapshot treated as a full replace and wiped. Fixed in #2112 (A-3478): an empty snapshot now marks the cache ready without replacing contents.

This is why dropping the witness is the one-way door. The only general detector of case (2) is cross-checking against an independent source — i.e. the tape/shadow comparison against the Redis/fills-derived view. Dropping the risk-engine IM compute (#2062, which folded in the former #2063) removes that second witness. The plan accepts this trade-off, but gates it behind a clean prod soak on ORDER_GATEWAY_POSITIONS_SOURCE=ep3 (§2e) — a soak that, as of this snapshot, has not started because prod has not flipped.

§2cThe cutover mechanism

The promotion is gated entirely behind one env var, so that "which feed is authoritative" is an operational decision, not a deploy.

ValueAuthoritative for can_send_orderOther feed
risk-engine defaultRedis risk-snapshot (status quo)OG cache runs as shadow, tapes recorded
ep3OG EP3-fed position cacheRedis runs as shadow

Stack A (step 6) lands this framework and routes the read/replace paths through it (#2165#2166, both now merged on main), both shipping on the risk-engine default. Because the default reproduces current behavior exactly, the mechanism landed with no soak of its own; the flip to ep3 (step 8) is the operational act, rolled demo → prod and reversible in seconds.

Lineage note. An earlier runtime-selectable margin source lived on the abandoned alee/og-promote-position-cache branch (#2033, closed). Stack A is the clean re-implementation on main. That dead branch is also where the tape-pull/analyze tooling is stranded — see the salvage item in §1 step 7 and §3.2.

§2dShifting IM authority RE→OG (Stack B)

Promoting the cache for can_send_order (step 8) makes the OG the authority for the margin check. Stack B finishes the job: it makes the OG the authority for the published IM as well, then retires the risk-engine's compute. The sequence matters because each step assumes the previous one is live.

  1. OG publishes margin (#2034, merged) — order-gateway writes og:margin:{user_id} to Redis on a 2s sweep (30s TTL), backed by a lock-free imbl position cache. It was rebased off the dead #2033 onto main and landed 2026-06-07; additive — nothing read the key yet at merge.
  2. api-gateway fuses it (#2061) — og:margin is sourced into the UserRiskSnapshot response, per-user and admin all-users. Additive — nothing is removed yet, so both sources coexist.
  3. RE stops publishing and computing IM (#2062) — the witness-drop. Originally two PRs — a reversible publish-stop (9c) and the irreversible compute-drop (9d, the former #2063, now closed) — they were folded into one: keeping RE computing IM once it is no longer published bought nothing (no consumer reads historical IM from ClickHouse, and the admin views now fuse OG's live og:margin via #2061), so both land together. The consequence: there is no longer a reversible intermediate — merging #2062 is itself the point of no return. Do it only after step 8 has soaked clean in prod.

9a is merged and 9b is additive; the entire irreversibility now sits in #2062. That is the reason Stack B is sequenced after the operational soak rather than bundled with the mechanism — and the reason the publish-stop/compute-drop fold raises the stakes of getting the soak right, since you can no longer stop publishing as a reversible dry-run.

§2eThe reversibility boundary

The plan has exactly one irreversible step. Phases 1–8 are reversible (config flip). Phase 9d, now carried by #2062 (the witness-drop), is irreversible: the risk-engine stops computing IM and the shadow comparison loses its independent source. Because #2062 folded the reversible publish-stop in with the compute-drop, the boundary is now a single PR merge with no reversible dry-run before it. The only real judgment call is how long to soak step 8 on ep3 before landing #2062 — that soak is the safety margin traded away with the witness.

§3Design Questions

  1. Soak length — how long do we run step 8 on ep3 in prod before landing the irreversible #2062?
    • Not yet answered The judgment call of the plan, and now sharper: #2062 folds the publish-stop in with the compute-drop, so there is no reversible dry-run before the witness goes. Prod has not even flipped to ep3 yet, so the soak clock has not started. Needs a stated, defensible duration and divergence threshold before #2062 is scheduled.
  2. Prod divergence — is the ep3-negative-available pattern a position-feed error?
    • Answered — negative Resolved 2026-06-09 with #2303's pos= telemetry: in 57/61 warns the two sources agreed on the position — the avail disagreements are margin-valuation noise, not feed content. Both decision paths share the OG-computed open-orders margin term, which ran far above the RE-published figure for the most active MM (computed avail $2.4k vs published $388k), pushing avail to the decision boundary where ms-scale repricing deltas flip sign. "ep3 negative available" does not mean the cache thinks the account is underwater. The open-orders input gap itself is a real (path-independent) issue tracked as A-3621, with attribution telemetry merged in #2347.
  3. Open-orders margin inputs — why does the OG's locally computed term exceed the risk-engine's published figure (~$662k vs ~$276k on 2026-06-09)?
    • Not yet answered Both run the same OpenOrdersMarginComputer, so the inputs differ — OG's in-memory book vs RE's event-derived book. A phantom-orders-from-restart-sync theory was investigated and retracted (the suspect orders had terminated normally; the CH order log is too lossy for high-volume users to witness the book). Orthogonal to the ep3 flip — it biases both paths equally — but it silently rejects MM orders today. #2347's flip warn quantifies it continuously; the decisive measurement is diffing the OG book against EP3 SearchOrders(Open) live (or reading og:margin:{user}).
  4. Tape-pull tooling — salvage from the dead #2033 branch, or rebuild on main?
    • Not yet answered Leaning rebuild on main (the branch is otherwise dead). Must be resolved before the step-7 verification can be signed off.
  5. Stack B base — is #2034 still blocked on a rebase off the dead #2033 branch?
    • Answered — negative Resolved — #2034 was rebased off the abandoned #2033 onto main and merged 2026-06-07. The rest of Stack B (#2061 → #2062) now stacks on main.
  6. Permanent replacement witness — after #2062, do we need a standing independent detector?
    • Not yet answered The never-in-cache class is undetectable from inside the feed, and #2062 removes the only general detector. Open whether the answer is "trust the soak" or a permanent lightweight cross-check. Not committed either way.