Date: 2026-06-10
Status: Draft. Nothing here is implemented; this document proposes the end-state design and a phased rollout.
AX has no notifications system. What exists today is a set of disconnected fragments:
useMarginAlert.ts (gui/packages/app/hooks/useMarginAlert.ts)
compares
maintenance
margin utilization
from
the
risk
snapshot
against
MARGIN_CALL_WARNING_PERCENT
/ MARGIN_CALL_CRITICAL_PERCENT
and
renders
header
pills (gui/apps/web/src/components/GlobalMessages.web.tsx).
If
the
tab
is closed,
the
client
is
never
told
they
are
approaching
liquidation.
There
is no
record
that
the
alert
was
ever
shown.
risk-monitor
computes
breach
state
but
emits
nothing. (rs/risk-monitor/src/monitor.rs)
—
margin
breaches
and
limit
violations are
persisted,
with
no
path
to
the
customer.
rs/settlement-engine/src/notifier.rs posts
settlement
results
to
an
internal
channel;
recon-engine
fires incident.io
alerts).
No
email,
no
Telegram,
no
SMS,
no
push
anywhere
in
the codebase.
gui/packages/app/data/notifications/context.tsx:
max
12,
5-second auto-dismiss,
bottom-left
stack)
fed
ad
hoc
by
OrdersPub.tsx
on
order events.
There
is
no
inbox,
no
persistence,
no
"seen"
state
—
refresh
the page
and
everything
is
gone.
users
table
has
no
email column;
contact
emails
exist
only
inside
onboarding
questionnaire
rows (business_due_diligence_questionnaires.exchange_notifications_contact_*), unverified
and
unused.
Meanwhile the things an exchange must tell its customers keep growing: scheduled downtime and upgrades, new product/feature launches, dividend and corporate-action announcements on equity index products (#2256 just added equities to the Pyth stream), settlement notices, deposit/withdrawal confirmations, security events (new API key, new session), and — most critically — margin calls, where "we told the customer, and here is the proof" is a compliance requirement, not a nicety.
Non-goals (for now): SMS/mobile push (no mobile trading app yet), marketing campaigns/drip email, in-thread two-way chat over any channel.
Hosted notification platforms (Knock, Courier, Novu) solve template management and channel fan-out well, but they put customer identity, margin state, and announcement timing in a third party's hands, and none of them model our account-permission visibility rules. Self-hosted Novu was considered and rejected as a large operational surface (Mongo + Redis + several Node services) for a feature set we mostly don't need. We build a thin engine on the patterns we already run everywhere (Postgres outbox + per-service WebSocket), and use commodity providers only at the very edge (an email API, the Telegram Bot API, Slack chat.postMessage — the settlement-engine notifier already shows the shape of that last one).
A notification is a single logical message: what happened, addressed to an audience, with a severity and kind. It is immutable once published.
margin_call, scheduled_downtime,
dividend_announcement,
feature_announcement, withdrawal_settled,
security_new_api_key,
…).
Kinds
are
registered
in code
with
a
default
routing
policy
(see
below);
producers
cannot
invent kinds
at
runtime.
info
|
warning
|
critical
|
action_required. action_required
notifications
must
be
explicitly
acknowledged
(e.g. margin
call);
the
others
are
passive.
{"utilization_pct": 87.4, "account_id": "...", "threshold": "warning"}). Channel
renderers
turn
kind + params
into
channel-appropriate
copy;
a pre-rendered
plain
title/body
is
stored
as
fallback
so
old notifications
survive
template
changes.
users: [user_id...],
accounts: [account_id...], or
all.
Account
audiences
are
resolved
to
users
at
publish
time
via account_permissions.can_read
—
recipient
resolution
is
a
one-time write-path
snapshot,
never
a
read-path
fan-out
(consistent
with
the single-account-key
read
principle
from
PR
#2056).
toast (transient),
banner
(persistent
header
strip
until
dismissed/expired), modal
(blocking,
for
action_required).
Each (notification, user) pair carries lifecycle state, advanced only by the recipient:
addressed ──► delivered(channel)* ──► seen ──► dismissed
└────► acknowledged (action_required only)
action_required
modals; recorded
with
timestamp
for
the
compliance
trail.
A
contact
point
is
a
verified
address
on
a
channel:
a
confirmed
email address,
a
Telegram
chat_id
captured
via
a
t.me/<bot>?start=<token> linking
flow,
a
Slack
channel
ID
(AX
already
runs
shared
per-client
Slack channels
—
the
bot
posts
into
those).
The
ui
channel
is
implicit
and
needs no
contact
point.
Unverified
addresses
are
never
sent
to.
Each
kind
declares
a
default
routing
matrix
(which
channels,
at
which severity).
Users
override
per
(category,
channel)
—
category,
not
kind,
to keep
the
settings
page
tractable
(account_activity,
risk, announcements,
security).
Two
hard
rules
sit
above
preferences:
ui
channel: margin_call,
account_frozen,
scheduled_downtime,
security
events. (Users
may
still
disable
the
email/telegram
copies
of
most
of
these; margin
calls
default-on
for
every
channel
with
a
verified
contact
point.)
risk
and
security,
which
are
opt-out.
| kind (examples) | severity | ui | telegram | slack | |
|---|---|---|---|---|---|
| margin_call | action_required | modal ✔ | ✔ | ✔ | ✔ |
| margin_warning | warning | banner | opt | opt | opt |
| scheduled_downtime | warning | banner ✔ | ✔ | opt | ✔ |
| dividend_announcement | info | inbox | opt | opt | opt |
| feature_announcement | info | inbox | opt | — | — |
| withdrawal_settled | info | toast | ✔ | opt | opt |
| security_new_api_key | warning | toast ✔ | ✔ | ✔ | — |
(✔ = default on, opt = default off, — = never. Exact matrix lives in code next to the kind registry, not in this doc.)
A
new
service,
rs/notification-engine,
following
the
standard
skeleton (main.rs/config.rs/service.rs,
env
config
via ax_sdk_internal::env_vars,
HealthPublisher,
TaskGroup).
producers notification-engine sinks
───────── ─────────────────── ─────
risk-monitor ─┐ ┌─► ui: WS stream + REST inbox
settlement-engine ─┤ internal ┌──────────┐ ┌────────────┐ │ (dispatcher tails Postgres
api-gateway ─┼─ publish ───►│ ingest / │──►│ Postgres │──┤ on seq / state_seq)
admin GUI ─► api-gw │ (HTTP/IPC) │ router │ │ outbox │ ├─► email worker ──► provider API
admin routes ───────┘ └──────────┘ │ deliveries │ ├─► telegram worker ► Bot API
│ └────────────┘ └─► slack worker ──► chat.postMessage
▼ │
rate limiter / ▼ (batch, analytics-style)
circuit breaker ClickHouse notification_log
(append-only audit)
Redis is deliberately absent: every hop in the UI path is either Postgres or a connection whose failure forces the client to re-sync from Postgres. The existing Redis pub/sub pattern fits market data, where the next tick supersedes a lost message; notifications are the opposite shape — low frequency, every message matters — so the durable store is the transport.
Producers
call
POST /internal/publish
on
notification-engine
(intra-VPC HTTP,
same
trust
posture
as
other
internal
service
calls),
passing
kind, params,
audience,
dedupe_key.
The
SDK-internal
crate
gets
a
small
typed client
(NotificationPublisher)
so
producers
don't
hand-roll
requests. Publishing
is
synchronous
up
to
the
Postgres
commit
—
once
the
producer
gets 200,
delivery
is
the
engine's
durable
responsibility.
Producers
must
treat publish
as
fire-and-forget
after
that;
a
producer
must
never
block
trading work
on
notification
delivery.
Admin-authored
announcements
(downtime,
dividends,
features)
come
in through
the
admin
GUI
at
admin.architect.co,
via
new /admin/notifications/*
routes
in
api-gateway
that
proxy
to notification-engine's
internal
API.
Admin
authn/authz
stays
where
it already
lives
(api-gateway,
users.is_admin),
and
the
acting
admin's
user id
is
threaded
through
as
the
actor
on
every
mutation
—
it
lands
in notifications.source
and
the
audit
log.
See
§Management
surfaces
for
the full
control
panel.
db/postgres/1.sql)Six
tables.
IDs
follow
the
existing
CHAR(16)
convention.
CREATE TABLE notifications (
id CHAR(16) PRIMARY KEY,
kind TEXT NOT NULL,
severity TEXT NOT NULL,
params JSONB NOT NULL DEFAULT '{}'::jsonb,
title TEXT NOT NULL,
body TEXT NOT NULL,
source TEXT NOT NULL, -- producing service / admin user id
dedupe_key TEXT UNIQUE,
account_id CHAR(16) REFERENCES trading_accounts(id), -- subject, if account-scoped
publish_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE SEQUENCE notification_state_seq;
CREATE TABLE notification_recipients (
seq BIGINT GENERATED ALWAYS AS IDENTITY,
state_seq BIGINT NOT NULL DEFAULT 0, -- engine sets nextval() on every state change
notification_id CHAR(16) NOT NULL REFERENCES notifications(id),
user_id CHAR(16) NOT NULL REFERENCES users(id),
seen_at TIMESTAMPTZ,
dismissed_at TIMESTAMPTZ,
acknowledged_at TIMESTAMPTZ,
PRIMARY KEY (notification_id, user_id)
);
CREATE INDEX ON notification_recipients (user_id, seq);
CREATE INDEX ON notification_recipients (state_seq);
CREATE TABLE notification_deliveries (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
notification_id CHAR(16) NOT NULL REFERENCES notifications(id),
user_id CHAR(16) NOT NULL REFERENCES users(id),
channel TEXT NOT NULL, -- 'email' | 'telegram' | 'slack'
contact_point_id CHAR(16) NOT NULL REFERENCES user_contact_points(id),
status TEXT NOT NULL DEFAULT 'pending',-- pending|sent|failed|dead|suppressed
attempts INT NOT NULL DEFAULT 0,
next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_error TEXT,
provider_message_id TEXT,
sent_at TIMESTAMPTZ,
UNIQUE (notification_id, user_id, channel)
);
CREATE INDEX ON notification_deliveries (status, next_attempt_at)
WHERE status = 'pending';
CREATE TABLE user_contact_points (
id CHAR(16) PRIMARY KEY,
user_id CHAR(16) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
channel TEXT NOT NULL,
address TEXT NOT NULL, -- email | telegram chat_id | slack channel id
verified_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE (user_id, channel, address)
);
CREATE TABLE user_notification_preferences (
user_id CHAR(16) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
category TEXT NOT NULL, -- account_activity|risk|announcements|security
channel TEXT NOT NULL,
enabled BOOLEAN NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, category, channel)
);
CREATE TABLE notification_channel_controls (
channel TEXT PRIMARY KEY, -- 'email' | 'telegram' | 'slack' | '*'
paused BOOLEAN NOT NULL DEFAULT FALSE,
reason TEXT,
actor_user_id CHAR(16) REFERENCES users(id),
updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);Design notes:
notification_recipients
rows
inside
the
publish
transaction.
At
AX's scale
(institutional
user
base,
O(10²–10³)
users)
a
broadcast
is
a
few thousand
rows
—
trivially
fine,
and
it
buys
simple
single-key
inbox
reads (WHERE user_id = $1 ORDER BY seq),
a
clean
per-user
cursor,
and point-in-time
audience
snapshots
(later
permission
changes
don't retroactively
change
who
was
notified
—
which
is
exactly
what
an
audit wants).
Revisit
only
if
the
user
count
grows
two
orders
of
magnitude.
seq
is
both
the
catch-up
cursor
and
the
realtime
transport.
A global
identity
column,
indexed
per
user.
Clients
resume
with ?since_seq=,
and
the
WS
dispatcher
(below)
tails
the
same
cursor
to discover
new
rows
—
no
gaps,
no
timestamp
skew,
one
ordering
for
both paths.
state_seq
plays
the
same
role
for
updates
(seen/dismissed/ acknowledged):
identity
columns
only
advance
on
insert,
and
the
engine
is the
sole
writer,
so
its
UPDATE
statements
set state_seq = nextval('notification_state_seq')
explicitly
—
no
triggers.
notification_channel_controls,
not
env vars,
so
an
admin
can
pause/resume
a
channel
from
the
control
panel without
a
redeploy;
who
paused
it
and
why
is
part
of
the
row.
One
global NOTIFICATIONS_DISABLED
env
var
remains
as
the
break-glass
override.
status = 'suppressed'
with
the
reason
in
last_error
—
suppression
is audited,
not
silent.
One
worker
task
per
channel
(spawned
under
TaskGroup,
wrapped
in run_durably
with
the
standard
DurableTaskConfig
exponential
backoff). Each
loop
claims
work
with
the
classic
skip-locked
poll:
SELECT ... FROM notification_deliveries
WHERE channel = $1 AND status = 'pending' AND next_attempt_at <= now()
ORDER BY next_attempt_at
LIMIT $batch
FOR UPDATE SKIP LOCKED;status = 'dead'
plus
an
ops
alert
(incident.io,
same
pattern
as recon-engine).
Dead
deliveries
are
re-runnable
by
admin
action.
provider_message_id
is
stored
when
available,
and external
channels
accept
the
(rare)
duplicate
rather
than
risk
a
silent drop.
The
UI
channel
is
exactly-once
by
construction
(state
lives
in Postgres).
vendor-internal/ax-reporting,
and
adding
an
SES
crate
would
force
that cascade
for
no
benefit.
Plain
reqwest
to
a
provider
API
avoids
it entirely.
sendMessage
via
reqwest;
chat
linked
via deep-link
start
token.
chat.postMessage
into
the
client's
shared
channel; generalize
settlement-engine/src/notifier.rs
into
a
shared sdk-internal
Slack
client
and
have
settlement-engine
consume
it
too.
risk-monitor
—
the
existing
client-side
useMarginAlert
thresholds become
display-only.
dedupe_key
uniqueness
makes
producer
retries
free.
suppressed
(audited),
with
the notification
still
always
landing
in
the
UI
inbox,
which
is
never
rate limited.
'*'
row
in
notification_channel_controls),
ops
gets paged,
and
an
admin
resumes
from
the
control
panel
after
reviewing
the queue.
A
runaway
producer
becomes
an
ops
incident,
not
a
customer-facing spam
incident.
A
NOTIFICATIONS_DISABLED
env
var
remains
as
break-glass.
notification-engine hosts its own REST + WS endpoints (the order-gateway / marketdata-publisher pattern), authenticated with the same session/api-key middleware as other services:
GET /notifications?since_seq=&limit=
—
inbox,
keyed
on
the authenticated
user
only.
POST /notifications/seen {up_to_seq}
—
bulk
mark-seen.
POST /notifications/{id}/dismiss,
POST /notifications/{id}/ack.
GET|PUT /notifications/preferences,
contact-point
CRUD
+
verification (email
confirmation
code,
telegram
link
token).
WS /notifications/stream
—
pushes
new
notifications
and
state
changes (so
a
dismiss
in
one
tab
clears
the
banner
in
another).
Realtime
mechanism:
tail
Postgres,
no
broker.
A
single
dispatcher
task inside
notification-engine
polls
notification_recipients
on
the
existing indexes
—
WHERE seq > $cursor
and
WHERE state_seq > $cursor,
~1s interval
—
and
routes
new/changed
rows
to
connected
users
in
memory.
That is
one
indexed
query
per
second
globally
against
a
table
growing
a
handful of
rows
per
minute.
Because
the
dispatcher
only
ever
reads
committed
rows, there
is
no
publish
step
that
can
be
missed:
a
notification
is
at
worst
one poll-tick
late,
never
lost.
This
also
makes
the
engine
multi-instance
safe for
free
—
each
replica
tails
Postgres
and
serves
its
own
connections,
with no
cross-instance
fan-out
to
coordinate.
If
sub-second
latency
ever matters,
layer
LISTEN/NOTIFY
on
top
purely
as
a
dispatcher
wake-up (NOTIFY
fires
transactionally
on
commit);
the
seq-cursor
poll
remains
the correctness
mechanism
either
way.
The
public
SDK
(rs/sdk)
gains
protocol
types
for
the
read
API
only
— generic
naming,
no
internal
service
or
datastore
references,
per
the existing
SDK
hygiene
rule.
Reconnect
contract
(this
is
connection-state
machinery,
so
per
repo policy:
client
disconnect,
server-initiated
disconnect,
engine restart/crash,
and
reconnect/catch-up
paths
all
get
tests
before
rollout): the
client
runs
since_seq
catch-up
on
every
(re)connect;
pushed
items
are deduped
by
seq,
so
a
re-push
is
harmless.
The
only
server-side
failure mode
left
is
the
engine
process
crashing
—
which
necessarily
drops
every
WS connection,
which
triggers
client
reconnect,
which
runs
catch-up.
The failure
and
its
heal
are
the
same
event;
there
is
no
state
in
which
a healthy
socket
sits
in
front
of
silently-dropped
messages.
display: banner notifications
(downtime,
margin
warning),
replacing
the
bespoke GlobalMessages
margin
pills
as
the
source
(the
pill
UI
can
stay
as
the rendering).
action_required
with
an
explicit
acknowledge button
wired
to
the
ack
endpoint.
OrdersPub.tsx
built
on
the
shared
useWebsocket hook
with
its
heartbeat/backoff
handling)
routes
display: toast
items into
pushNotification().
Management splits into three tiers by blast radius:
Routing policy is code. The kind registry, default routing matrix, severities, and mandatory (non-opt-out) kinds live in the engine crate and change by PR. This is the highest-blast-radius configuration in the system; code review and git history are better controls than any dashboard form, and the policy can never drift from the renderers that depend on it.
The
operational
control
plane
is
the
admin
GUI
at
admin.architect.co (gui/packages/admin,
behind
the
existing
admin
auth
+
Cloudflare Access
perimeter).
New
/admin/notifications/*
routes
in
api-gateway proxy
to
notification-engine's
internal
API;
the
acting
admin's
user
id travels
with
every
mutation
and
lands
in
the
ClickHouse
audit
log.
The rule:
any
action
that
causes
sends
to
customers
or
changes
what customers
receive
executes
as
an
authenticated
exchange
admin
— "someone
behind
Cloudflare
Access"
is
not
an
acceptable
audit
answer
for margin-call
infrastructure.
Four
pages:
publish_at/expires_at
scheduling,
per-channel
preview, and
a
publish
confirmation
that
displays
the
resolved
recipient count.
List
of
scheduled
and
active
announcements
with
cancel (scheduled)
and
expire-now
(active)
actions.
notification_channel_controls
(with
required
reason),
circuit breaker
state
and
resume,
and
a
read-only
view
of
current
rate-limit and
ceiling
configuration.
Observability
is
read-only
and
lives
in
Grafana
over
the
ClickHouse notification_log
(delivery
rates,
failure/suppression
counts,
queue depth,
breaker
state).
If
interactive
drill-down
outgrows
Grafana,
a dedicated
internal
dashboard
(the
*.afintech.dev
family)
can
render the
same
data
—
strictly
read-only,
with
every
mutating
affordance linking
into
admin.architect.co.
Postgres
holds
operational
state;
ClickHouse
holds
the
immutable
audit trail,
written
analytics-style
(batched
inserts,
modeled
on
the transactions
table
—
ReplacingMergeTree,
partitioned
by
month,
keyed
by delivery
id
+
attempt):
CREATE TABLE notification_log (
timestamp DateTime64(9) NOT NULL,
notification_id String NOT NULL,
user_id FixedString(16) NOT NULL,
kind LowCardinality(String) NOT NULL,
channel LowCardinality(String) NOT NULL,
event LowCardinality(String) NOT NULL, -- published|sent|failed|dead|suppressed|seen|dismissed|acknowledged
attempt UInt8 NOT NULL,
detail String NOT NULL DEFAULT '', -- error text / suppression reason / provider msg id
...
)Every state transition lands here, including recipient actions — so "customer X was margin-called at T, emailed at T+2s (provider id …), telegram'd at T+2s, saw it in the UI at T+9m, acknowledged at T+10m" is one query. This is the compliance artifact.
Mirroring
the
EP3-mock
philosophy:
a
DEBUG_NOTIFICATION_SINK
mode
routes all
external
sends
to
a
log/file
sink
(and
optionally
one
override
email address),
refused
in
production
by
the
existing
AX_ENVIRONMENT
guard. Mock
senders
make
the
full
pipeline
testable
in
CI
without
provider credentials.
| producer | kinds | trigger |
|---|---|---|
| risk-monitor | margin_warning, margin_call, account_frozen | threshold crossing with hysteresis (new module alongside existing risk domains) |
| admin GUI | scheduled_downtime, feature_announcement, dividend_announcement | human-authored, scheduled |
| settlement-engine / transaction-engine | withdrawal_settled, deposit_credited | settlement events |
| api-gateway | security_new_api_key, security_new_session | auth events |
Phase
1
—
spine
+
announcements.
Schema,
notification-engine
(ingest, recipients,
dispatcher
+
UI
channel
only),
REST/WS
read
path,
GUI bell/inbox/banner,
and
the
admin
Announcements
page (api-gateway
/admin/notifications/*
routes
+
gui/packages/admin). Ships
value
immediately
(downtime/feature/dividend
announcements)
with zero
external-channel
risk.
Phase
2
—
risk
+
email.
Contact
points
with
email
verification, preferences
page,
email
worker
+
provider
integration,
rate limiter
+
circuit
breaker,
ClickHouse
audit
log,
risk-monitor
margin producer
with
hysteresis,
action_required
ack
modal,
and
the
remaining admin
pages
(Deliveries,
Controls,
Customers)
—
required before
the
first
external
channel
goes
live,
since
they
are
the
operating tools
for
it.
This
is
the
phase
with
compliance
weight;
the connection-lifecycle
test
matrix
gates
it.
Phase 3 — telegram + slack + digests. Telegram linking flow and worker, Slack shared-channel worker (refactor settlement notifier into shared client), optional daily digest email for users who downgrade a category from instant to digest.
audience: all
announcements require
second-admin
approval,
or
is
the
confirmation-with-count
screen enough
for
v1?
audience: entity
(all
users
of
an
org)
before
account/user
audiences cover
real
use?