Position Feed Promotion
Design & phased cutover — promoting the order-gateway's EP3-fed position cache to the authoritative margin source, then retiring the risk-engine's initial-margin compute.
§0Principles
Neither feed is trustworthy alone, so a clean tape diff is the gate
The Redis risk-snapshot lags under fast fills; the OG cache can mishandle EP3 edge cases — and a position EP3 never delivered is invisible to the feed entirely (get_position returns 0 for absent). So we promote on evidence, not faith: both feeds are recorded to position tapes and diffed off-line (compare_position_tapes.py). The independent second witness is what earns the flip.
The rollout is reversible config — until exactly one door
Authority is selected by ORDER_GATEWAY_POSITIONS_SOURCE (risk-engine default ↔ ep3): the mechanism lands with zero behavior change and the flip is an env-var roll, undone in seconds. Every step is reversible except one — the risk-engine dropping its IM compute (#2062, which folded in the former #2063), which deletes the second witness for good. Note that #2062 now lands the publish-stop and the compute-drop together, so there is no longer a reversible publish-stop intermediate: merging #2062 is crossing the door.
The soak is the only safety margin for the one-way step
Once #2062 lands there is no independent cross-check left, and no technical fix recovers it. The single real judgment call is how long to soak ep3 in prod before crossing — that soak time is the safety margin traded away with the second witness. As of this snapshot prod has not flipped to ep3 yet (demo has), so that soak clock has not even started.
§1Progress / Tracker
Live snapshot — phase status and the bar below are computed from GitHub PR state at build time (2026-06-10). 6/10 phases done · 60%.
1Shadow position cache + position tapesdone›
Order-gateway subscribes directly to the EP3 position stream and computes IM locally in shadow mode, logging divergence against the Redis path. Both feeds are recorded to disk with daily rotation, and a Python script diffs them off-line. This is the apparatus the whole cutover is gated on.
- #1196 O1 — order-gateway listens to EP3 position updates (shadow cache + tapes)
- #1455 auto-rotate and prune position tapes daily at 5 PM ET
- #1619
compare_position_tapes.pyoff-line tape diff - #2303 log each source's signed position in the divergence warn + label the counter
position_mismatch(A-3605) - #2347 divergence warn attribution — equity, snapshot age, RE-published figures, per-symbol position-IM deltas — plus an avail-gap decision-flip warn (A-3621)
2Order-gateway publishes its own open-order margindone›
O3 — the order-gateway computes and publishes the open-order margin contribution itself, rather than depending on the risk-engine for it. A prerequisite for OG owning the full margin picture once the cache is authoritative.
- #1205 O3 — publish open-order margin from order-gateway itself → alee/promote-local-position-cache
3Instrument can_send_order timingdone›
4Harden the cache margin path — fail closeddone›
Resolves the present-but-unvaluable failure (A-3476). A position in the cache whose instrument/mark can't be resolved was summed as zero IM — fail-open, overstating available margin. It now bail!s. Margin-increasing orders fail closed; reducing/neutral orders skip full-portfolio valuation.
5EP3 empty-bookmark snapshot fix + subscription probedone›
Resolves the demo divergence that opened A-3476 — a dormant 13-lot XAU-PERP that was a never-in-cache case caused by a protocol bug. EP3 sent the position in the full snapshot, then sent an empty "bookmark" snapshot the old apply_snapshot treated as a full replace and wiped. The fix marks the cache ready without replacing contents (A-3478).
6Land the cutover mechanism — Stack A (no behavior change)done›
Introduce ORDER_GATEWAY_POSITIONS_SOURCE (risk-engine default / ep3) and route read/replace paths through it. Both ship on the risk-engine default, so production behavior is identical and rollback is config-only — the apparatus, not the flip. Closes A-2764.
7Operational verification — the second witness as gatein progress›
Position tapes went on in prod 2026-05-26 to start the verification clock; ≥72h of taping has long since elapsed, but the gate is divergence, not wall-clock. Each round surfaced a new divergence class that re-opened the window: present-but-unvaluable (A-3476, #2164), the empty-bookmark wipe (A-3478, #2112), and a telemetry gap that hid the signed position behind the warn (A-3605, #2303). 2026-06-09 — first decomposed day (#2303's pos= field live in prod): of 61 warns, 57 agreed on the position; the 4 position_mismatch cases were ±1–7-lot fill-propagation races where the ep3 cache held the fresher (correct) figure. The raw warn count is dominated by margin-valuation noise, not feed divergence: both decision paths share the OG-computed open-orders margin term, which ran far above the RE-published figure for the most active MM (computed avail=2422 vs published avail=388k at one 15:44 reject) — that depresses avail toward the decision boundary, so ms-scale valuation deltas flip decisions and spray warns. The gap is invisible to the divergence comparison itself (it depresses both sources equally) and is tracked separately (A-3621); #2347 adds the attribution telemetry (published figures + per-symbol position-IM deltas in the warn, plus an avail-gap decision-flip warn). Gate reframed: judge the feed on the position_mismatch=true divergence rate — which must be explained and ~0 — not on raw warn counts, which cannot converge while MM avail sits at the boundary. Salvage item: the tape-pull/analyze tooling lives only on the abandoned #2033 branch; recover or rebuild it before signing off.
8Flip the source to ep3 (demo → prod soak)not started›
Set ORDER_GATEWAY_POSITIONS_SOURCE=ep3 on demo, soak, then prod. The cache path becomes authoritative for can_send_order; Redis runs as shadow. Fully reversible via config — the last reversible step. The soak duration is the safety margin for the one-way step that follows. Demo is flipped and soaking — demo decision divergence logs show ep3 (primary) / risk-engine (shadow), with only cents-scale disagreements. Prod has not flipped: prod still shows risk-engine (primary) / ep3 (shadow), so the prod soak clock has not started.
9Shift IM authority RE→OG — Stack Bin progress›
Move initial-margin authority from the risk-engine to the order-gateway, then retire the RE compute. Strictly ordered; the last step is the only irreversible one in the whole plan. 9a OG publishes og:margin:{user} (merged); 9b api-gateway fuses it into UserRiskSnapshot; 9c/9d are now one PR — #2062 stops publishing IM and drops the compute together (it folded in the former #2063), so the witness-drop is #2062. Do it only after a clean prod soak — which, per step 8, has not begun.
- #2034 9a — OG publishes per-user margin snapshot to Redis under
og:margin:{user_id}(merged on main) - #2061 9b — api-gateway sources
og:marginintoUserRiskSnapshot(per-user + admin all-users) - #2062 9c+9d — risk-engine stops publishing and computing IM — the witness-drop (folded in #2063) → alee/api-gw-fuse-ogw-margin
- Superseded / related
- #2063 9d as a standalone PR — closed; compute-drop cherry-picked onto #2062 → alee/re-stop-publishing-im
- #2033 abandoned base branch; Stack B now stacks on main
10Account-rekey the calculator + tape writerin progress›
Re-key the OG calculator's internal state and the tape writer UserId → AccountId (A-3582 / A-3402), aligning with the merged transaction-engine rekey. Inert under the current 1:1 user:account mapping, so low-risk — but cleanest to land before Stack B churns the same files.
Notebook
Reference design — the detailed mechanics behind the tracker.
§2aThe two position views
Two position views exist today and are continuously compared. The whole plan is the disciplined promotion of one over the other.
| View | How it's built | Failure mode |
|---|---|---|
| Redis risk-snapshot (current authority) | risk-engine derives positions from fills, computes IM, publishes a snapshot to Redis | Fresh at rest, but lags under fast fills — the Wasabi margin-lag root cause |
| OG position cache (being promoted) | order-gateway subscribes to the EP3 position stream, computes IM locally; shadow today | Fresher, but can mishandle EP3 protocol edge cases |
Neither source is a priori authoritative — that symmetry is exactly why the position tapes exist as the second witness that gates cutover.
§2bEpistemics: the two "missing position" failures
Distinguishing these two failure classes drove the blocker work and explains why a second witness is needed at all. They look identical in their effect (a position the margin check doesn't see) but differ completely in detectability.
- Present-but-unvaluable fixed — #2164, A-3476 — a position is in the cache but its instrument/mark can't be resolved. The old code summed it as zero IM (fail-open, overstating available margin). Now it
bail!s — fail closed. Margin-increasing orders fail closed if the portfolio can't be fully valued; reducing/neutral orders skip full-portfolio valuation so a missing mark on an unrelated position can't block a risk-reducing order. Seecheck_margin.rs. This class is detectable from inside the feed — the position is present, you just can't price it. - Never-in-the-cache the hard case — EP3 never delivered the position, or it was dropped and no subsequent
Updatearrives (dormant instrument). The feed cannot detect the absence of something it was never told about —get_positionreturns0for absent. This is not fixable from inside the feed.
The demo divergence that opened A-3476 — a dormant 13-lot XAU-PERP, ~$7.6k — was case (2) caused by a protocol bug, not a bad feed: EP3 sent the position in the full snapshot, then sent an empty "bookmark" snapshot that the old apply_snapshot treated as a full replace and wiped. Fixed in #2112 (A-3478): an empty snapshot now marks the cache ready without replacing contents.
ORDER_GATEWAY_POSITIONS_SOURCE=ep3 (§2e) — a soak that, as of this snapshot, has not started because prod has not flipped.
§2cThe cutover mechanism
The promotion is gated entirely behind one env var, so that "which feed is authoritative" is an operational decision, not a deploy.
| Value | Authoritative for can_send_order | Other feed |
|---|---|---|
risk-engine default | Redis risk-snapshot (status quo) | OG cache runs as shadow, tapes recorded |
ep3 | OG EP3-fed position cache | Redis runs as shadow |
Stack A (step 6) lands this framework and routes the read/replace paths through it (#2165 → #2166, both now merged on main), both shipping on the risk-engine default. Because the default reproduces current behavior exactly, the mechanism landed with no soak of its own; the flip to ep3 (step 8) is the operational act, rolled demo → prod and reversible in seconds.
alee/og-promote-position-cache branch (#2033, closed). Stack A is the clean re-implementation on main. That dead branch is also where the tape-pull/analyze tooling is stranded — see the salvage item in §1 step 7 and §3.2.
§2dShifting IM authority RE→OG (Stack B)
Promoting the cache for can_send_order (step 8) makes the OG the authority for the margin check. Stack B finishes the job: it makes the OG the authority for the published IM as well, then retires the risk-engine's compute. The sequence matters because each step assumes the previous one is live.
- OG publishes margin (#2034, merged) — order-gateway writes
og:margin:{user_id}to Redis on a 2s sweep (30s TTL), backed by a lock-freeimblposition cache. It was rebased off the dead#2033ontomainand landed 2026-06-07; additive — nothing read the key yet at merge. - api-gateway fuses it (#2061) —
og:marginis sourced into theUserRiskSnapshotresponse, per-user and admin all-users. Additive — nothing is removed yet, so both sources coexist. - RE stops publishing and computing IM (#2062) — the witness-drop. Originally two PRs — a reversible publish-stop (9c) and the irreversible compute-drop (9d, the former #2063, now closed) — they were folded into one: keeping RE computing IM once it is no longer published bought nothing (no consumer reads historical IM from ClickHouse, and the admin views now fuse OG's live
og:marginvia #2061), so both land together. The consequence: there is no longer a reversible intermediate — merging #2062 is itself the point of no return. Do it only after step 8 has soaked clean in prod.
9a is merged and 9b is additive; the entire irreversibility now sits in #2062. That is the reason Stack B is sequenced after the operational soak rather than bundled with the mechanism — and the reason the publish-stop/compute-drop fold raises the stakes of getting the soak right, since you can no longer stop publishing as a reversible dry-run.
§2eThe reversibility boundary
The plan has exactly one irreversible step. Phases 1–8 are reversible (config
flip). Phase 9d, now carried by #2062 (the witness-drop), is irreversible:
the risk-engine stops computing IM and the shadow comparison loses its
independent source. Because #2062 folded the reversible publish-stop in with
the compute-drop, the boundary is now a single PR merge with no reversible
dry-run before it. The only real judgment call is how long to soak step 8 on
ep3 before landing #2062 — that soak is the safety margin traded away with
the witness.
§3Design Questions
-
Soak length — how long do we run step 8 on
ep3in prod before landing the irreversible #2062?- Not yet answered The judgment call of the plan, and now sharper: #2062 folds the publish-stop in with the compute-drop, so there is no reversible dry-run before the witness goes. Prod has not even flipped to
ep3yet, so the soak clock has not started. Needs a stated, defensible duration and divergence threshold before #2062 is scheduled.
- Not yet answered The judgment call of the plan, and now sharper: #2062 folds the publish-stop in with the compute-drop, so there is no reversible dry-run before the witness goes. Prod has not even flipped to
-
Prod divergence — is the
ep3-negative-available pattern a position-feed error?- Answered — negative Resolved 2026-06-09 with #2303's
pos=telemetry: in 57/61 warns the two sources agreed on the position — the avail disagreements are margin-valuation noise, not feed content. Both decision paths share the OG-computed open-orders margin term, which ran far above the RE-published figure for the most active MM (computed avail $2.4k vs published $388k), pushing avail to the decision boundary where ms-scale repricing deltas flip sign. "ep3 negative available" does not mean the cache thinks the account is underwater. The open-orders input gap itself is a real (path-independent) issue tracked as A-3621, with attribution telemetry merged in #2347.
- Answered — negative Resolved 2026-06-09 with #2303's
-
Open-orders margin inputs — why does the OG's locally computed term exceed the risk-engine's published figure (~$662k vs ~$276k on 2026-06-09)?
- Not yet answered Both run the same
OpenOrdersMarginComputer, so the inputs differ — OG's in-memory book vs RE's event-derived book. A phantom-orders-from-restart-sync theory was investigated and retracted (the suspect orders had terminated normally; the CH order log is too lossy for high-volume users to witness the book). Orthogonal to the ep3 flip — it biases both paths equally — but it silently rejects MM orders today. #2347's flip warn quantifies it continuously; the decisive measurement is diffing the OG book against EP3SearchOrders(Open)live (or readingog:margin:{user}).
- Not yet answered Both run the same
-
Tape-pull tooling — salvage from the dead #2033 branch, or rebuild on main?
- Not yet answered Leaning rebuild on main (the branch is otherwise dead). Must be resolved before the step-7 verification can be signed off.
-
Stack B base — is #2034 still blocked on a rebase off the dead #2033 branch?
- Answered — negative Resolved — #2034 was rebased off the abandoned #2033 onto main and merged 2026-06-07. The rest of Stack B (#2061 → #2062) now stacks on main.
-
Permanent replacement witness — after #2062, do we need a standing independent detector?
- Not yet answered The never-in-cache class is undetectable from inside the feed, and #2062 removes the only general detector. Open whether the answer is "trust the soak" or a permanent lightweight cross-check. Not committed either way.