Date: 2026-06-10
Status: Draft
Scope: recon-engine; the four quiet-window invariant checks
A sidequest off No-Downtime Upgrades §5. Four recon-engine checks are pinned to the daily downtime window because their two sides have no common clock. That RFC argues the fix is epoch-pinned as-of comparison. This RFC says how we get there safely: do not touch the existing checks. Build the continuous versions as new checks, run both side by side in signal-only mode, and let weeks of recorded agreement — not code review — decide when the new ones earn paging duty. The old checks are financial-integrity tripwires; we don't swap a tripwire, we run its replacement in parallel until it has proven it trips on the same things and nothing else.
One
new
check
per
quiet-window
check,
registered
alongside
the
old
one
in all_checks
with
its
own
id()
and
a
frequent
default
schedule.
Old
and
new differ
only
in
how
they
read,
never
in
what
invariant
they
assert.
| Existing (windowed, pages) | New (continuous, signal-only) | Mechanism change |
|---|---|---|
transactions-balance |
transactions-balance-live |
Read
(sum of balances, epoch E)
atomically
from
PG;
compare
against
CH
transaction
sums
as
of
E. |
position-realized-pnls-balance |
position-realized-pnls-balance-live |
Pin
both
CH
tables
to
the
same
epoch:
positions
as-of
E
vs
transactions
as-of
E. |
pnl-is-zero-sum |
pnl-is-zero-sum-live |
Already cycle-pinned (#1693); the live variant runs on every mark cycle instead of twice daily, measuring whether the full-scan cost is actually tolerable at frequency. |
omnibus-balances-square |
omnibus-balances-square-live |
Precondition becomes "no in-flight custody transfers" (detected), not "exchange downtime" (scheduled); skip the run when transfers straddle the reads. |
Default
schedule
for
the
live
checks:
0 */5 * * * *
(matching
the
cheap checks),
except
pnl-is-zero-sum-live
which
keys
off
mark
cycles
and omnibus-balances-square-live
which
runs
every
15
minutes
subject
to
its precondition.
The
shared
mechanism,
per
the
parent
RFC:
the
trade-engine's
tandem
commit (state_machine_driver.rs)
writes
balances,
transactions,
and
the
resume token
with
latest_execution_ns
atomically
in
one
PG
transaction,
and
CH writes
for
a
batch
are
issued
before
that
batch's
token
commits.
So:
trade_engine.resume_tokens.latest_execution_ns
in
a
single repeatable-read
transaction.
The
pair
is
a
consistent
cut
at
a
known
epoch E
by
construction.
E.
query_sum_by_transaction_type_as_of already
exists;
positions
need
an
as-of
variant
(filter timestamp_ns <= E
with
argMax
selection
—
never
trust
row
order).
E
may
lag
a few
seconds
behind
the
PG
commit.
The
check
polls
CH's
max
ingested timestamp_ns
and
proceeds
once
it
reaches
E,
with
a
bounded
retry budget;
exhausting
the
budget
is
an
Error
(infra
signal),
not
a
Fail (invariant
signal).
Conflating
the
two
would
poison
the
soak
data.
CheckStatus
is
Pass | Fail | Error.
The
omnibus
live
check
needs
a
fourth outcome:
"precondition
not
met,
nothing
asserted."
Add
CheckStatus::Skipped, recorded
in
invariants_log
like
any
other
status.
A
skip
is
not
a
pass
—
the soak
analysis
must
track
skip
rate,
because
a
check
that
skips
95%
of
the time
is
not
a
viable
replacement,
and
we
want
to
learn
that
during
the
soak, not
after
promotion.
New
checks
must
be
incapable
of
paging
during
the
soak.
Extend
the
per-check CheckConfig
(config.rs)
with
paging: bool
(default
true);
the incident.io
sink
consults
it
before
posting.
The
four
live
checks
ship
with paging = false
in
the
watch
config.
Failures
still
log
at
warn
and
land
in invariants_log
—
visible
everywhere
except
the
pager.
This
flag
is
also
the promotion
lever:
promotion
day
is
a
config
flip,
not
a
deploy.
Both
check
families
run
concurrently.
Old
checks
keep
their
schedules
and their
paging
duty
—
nothing
about
the
existing
safety
posture
changes.
The live
checks
accumulate
a
record
in
invariants_log
under
their
own check_ids,
which
makes
the
comparison
a
query,
not
a
judgment
call.
Sequencing: land on ax-demo first and let it run for ~1 week to shake out check bugs (wrong as-of filter, visibility-horizon misjudgment) where divergence is harmless. Then enable on prod for a 4-week soak.
Divergence
scan
(weekly,
scripted
—
same
discipline
as
the
position-feed promotion
soak):
for
each
old-check
fire
time
during
the
window,
join
against the
nearest
live-check
run
within
±10
minutes
and
compare
status
and left/right
values.
Separately,
enumerate
every
live-check
Fail,
Error, and
Skipped
episode
for
adjudication.
The live checks earn paging duty when, over the full prod soak:
Fail
is
adjudicated: either
it
reflects
a
real
invariant
break
(old
check
/
incident
confirms) —
which
is
the
live
check
working,
and
catching
it
hours
earlier
—
or
it is
root-caused
as
a
check
bug,
fixed,
and
that
check's
soak
clock restarts.
Error
(excluding
platform-wide
outages).
omnibus-balances-square-live skip
rate
low
enough
that
it
still
asserts
the
invariant
at
least
daily.
pnl-is-zero-sum-live
either
proves
the
full
scan
tolerable
or makes
the
case
for
the
accumulator
follow-up.
When
the
criteria
hold:
flip
paging = true
on
the
live
checks,
flip paging = false
on
the
old
ones,
and
keep
the
old
checks
running
on
their downtime
schedule
for
two
further
weeks
as
a
silent
cross-check.
Then
disable the
old
checks
via
config
(enabled = false),
and
remove
the
code
one release
later.
The
-live
ids
stay
—
renaming
back
would
split
each
check's history
in
invariants_log
across
two
ids
for
cosmetic
benefit.
If criteria are not met by week 4, the soak doesn't auto-extend indefinitely: adjudicate, fix, restart the affected check's clock, and if a check fundamentally can't reach agreement (most plausible for the Anchorage-coupled one), document why and leave the old check in place — a partial promotion of three checks is still three checks the downtime window no longer holds hostage.
CheckStatus::Skipped;
paging
flag
in
CheckConfig
+
incident.io
sink respects
it.
latest_execution_ns,
one
repeatable-read txn);
CH
positions
as-of
query;
visibility-horizon
retry
helper.
-live
checks,
registered
in
all_checks,
paging = false.
invariants_log
(old-vs-live
join
+ episode
enumeration),
runnable
by
anyone,
output
checked
into
the
soak log.
Companion
to
No-Downtime
Upgrades
§5.
Code references
as
of
99138ce.