RFC: Notifications

Date: 2026-06-10

Status: Draft. Nothing here is implemented; this document proposes the end-state design and a phased rollout.

Background

AX has no notifications system. What exists today is a set of disconnected fragments:

Margin alerts are computed in the browser. useMarginAlert.ts (gui/packages/app/hooks/useMarginAlert.ts) compares maintenance margin utilization from the risk snapshot against MARGIN_CALL_WARNING_PERCENT / MARGIN_CALL_CRITICAL_PERCENT and renders header pills (gui/apps/web/src/components/GlobalMessages.web.tsx). If the tab is closed, the client is never told they are approaching liquidation. There is no record that the alert was ever shown.
risk-monitor computes breach state but emits nothing. (rs/risk-monitor/src/monitor.rs) — margin breaches and limit violations are persisted, with no path to the customer.
Slack sending exists only for ops (rs/settlement-engine/src/notifier.rs posts settlement results to an internal channel; recon-engine fires incident.io alerts). No email, no Telegram, no SMS, no push anywhere in the codebase.
The GUI has a transient toast system (gui/packages/app/data/notifications/context.tsx: max 12, 5-second auto-dismiss, bottom-left stack) fed ad hoc by OrdersPub.tsx on order events. There is no inbox, no persistence, no "seen" state — refresh the page and everything is gone.
No user contact info or preferences. The users table has no email column; contact emails exist only inside onboarding questionnaire rows (business_due_diligence_questionnaires.exchange_notifications_contact_*), unverified and unused.

Meanwhile the things an exchange must tell its customers keep growing: scheduled downtime and upgrades, new product/feature launches, dividend and corporate-action announcements on equity index products (#2256 just added equities to the Pyth stream), settlement notices, deposit/withdrawal confirmations, security events (new API key, new session), and — most critically — margin calls, where "we told the customer, and here is the proof" is a compliance requirement, not a nicety.

Goals

One pipeline from any internal producer to any customer-facing channel (UI inbox/banner/popup, email, Telegram, Slack), with per-kind routing policy and per-user preferences.
Durable, auditable delivery. Every notification has a persistent record of who it was addressed to, which channels it went out on, when, whether it succeeded, and whether the user saw/acknowledged it.
Anti-spam by construction. Dedup, coalescing, hysteresis at the source, rate limits at the sink, and a global circuit breaker so a buggy producer can never email every customer 1,000 times.
Catch-up correctness. A client that reconnects (or logs in days later) sees exactly the notifications it missed, in order, with no gaps and no duplicates.

Non-goals (for now): SMS/mobile push (no mobile trading app yet), marketing campaigns/drip email, in-thread two-way chat over any channel.

Build vs. buy

Hosted notification platforms (Knock, Courier, Novu) solve template management and channel fan-out well, but they put customer identity, margin state, and announcement timing in a third party's hands, and none of them model our account-permission visibility rules. Self-hosted Novu was considered and rejected as a large operational surface (Mongo + Redis + several Node services) for a feature set we mostly don't need. We build a thin engine on the patterns we already run everywhere (Postgres outbox + per-service WebSocket), and use commodity providers only at the very edge (an email API, the Telegram Bot API, Slack chat.postMessage — the settlement-engine notifier already shows the shape of that last one).

Domain model

Notification

A notification is a single logical message: what happened, addressed to an audience, with a severity and kind. It is immutable once published.

kind — machine-readable taxonomy key (margin_call, scheduled_downtime, dividend_announcement, feature_announcement, withdrawal_settled, security_new_api_key, …). Kinds are registered in code with a default routing policy (see below); producers cannot invent kinds at runtime.
severity — info | warning | critical | action_required. action_required notifications must be explicitly acknowledged (e.g. margin call); the others are passive.
params — structured JSONB payload (e.g. {"utilization_pct": 87.4, "account_id": "...", "threshold": "warning"}). Channel renderers turn kind + params into channel-appropriate copy; a pre-rendered plain title/body is stored as fallback so old notifications survive template changes.
audience — one of: users: [user_id...], accounts: [account_id...], or all. Account audiences are resolved to users at publish time via account_permissions.can_read — recipient resolution is a one-time write-path snapshot, never a read-path fan-out (consistent with the single-account-key read principle from PR #2056).
dedupe_key — producer-supplied idempotency key, unique. Re-publishing the same key is a no-op. Producers that retry (and they all should) get exactly-once notification creation for free.
display — UI placement hint derived from kind/severity: toast (transient), banner (persistent header strip until dismissed/expired), modal (blocking, for action_required).
publish_at / expires_at — scheduling and TTL. A downtime notice expires when the window ends; expired notifications drop out of banners and unseen counts but remain in the inbox and audit trail.

Per-recipient state

Each (notification, user) pair carries lifecycle state, advanced only by the recipient:

addressed ──► delivered(channel)* ──► seen ──► dismissed
                                        └────► acknowledged   (action_required only)

seen — set in bulk when the user opens the inbox (mark-seen up to cursor); drives the unread badge.
dismissed — per-notification; removes banners.
acknowledged — explicit click-through on action_required modals; recorded with timestamp for the compliance trail.

Channels and contact points

A contact point is a verified address on a channel: a confirmed email address, a Telegram chat_id captured via a t.me/<bot>?start=<token> linking flow, a Slack channel ID (AX already runs shared per-client Slack channels — the bot posts into those). The ui channel is implicit and needs no contact point. Unverified addresses are never sent to.

Routing policy and preferences

Each kind declares a default routing matrix (which channels, at which severity). Users override per (category, channel) — category, not kind, to keep the settings page tractable (account_activity, risk, announcements, security). Two hard rules sit above preferences:

Mandatory kinds cannot be opted out of on the ui channel: margin_call, account_frozen, scheduled_downtime, security events. (Users may still disable the email/telegram copies of most of these; margin calls default-on for every channel with a verified contact point.)
Nothing external is opt-out by accident: external channels are opt-in per category except risk and security, which are opt-out.

kind (examples)	severity	ui	email	telegram	slack
margin_call	action_required	modal ✔	✔	✔	✔
margin_warning	warning	banner	opt	opt	opt
scheduled_downtime	warning	banner ✔	✔	opt	✔
dividend_announcement	info	inbox	opt	opt	opt
feature_announcement	info	inbox	opt	—	—
withdrawal_settled	info	toast	✔	opt	opt
security_new_api_key	warning	toast ✔	✔	✔	—

(✔ = default on, opt = default off, — = never. Exact matrix lives in code next to the kind registry, not in this doc.)

Architecture

A new service, rs/notification-engine, following the standard skeleton (main.rs/config.rs/service.rs, env config via ax_sdk_internal::env_vars, HealthPublisher, TaskGroup).

producers                            notification-engine                     sinks
─────────                            ───────────────────                     ─────
risk-monitor          ─┐                                             ┌─► ui: WS stream + REST inbox
settlement-engine     ─┤  internal    ┌──────────┐   ┌────────────┐  │   (dispatcher tails Postgres
api-gateway           ─┼─ publish ───►│ ingest / │──►│  Postgres  │──┤    on seq / state_seq)
admin GUI ─► api-gw    │  (HTTP/IPC)  │  router  │   │  outbox    │  ├─► email worker ──► provider API
   admin routes ───────┘              └──────────┘   │ deliveries │  ├─► telegram worker ► Bot API
                                            │        └────────────┘  └─► slack worker ──► chat.postMessage
                                            ▼               │
                                      rate limiter /        ▼ (batch, analytics-style)
                                      circuit breaker  ClickHouse notification_log
                                                       (append-only audit)

Redis is deliberately absent: every hop in the UI path is either Postgres or a connection whose failure forces the client to re-sync from Postgres. The existing Redis pub/sub pattern fits market data, where the next tick supersedes a lost message; notifications are the opposite shape — low frequency, every message matters — so the durable store is the transport.

Ingestion

Producers call POST /internal/publish on notification-engine (intra-VPC HTTP, same trust posture as other internal service calls), passing kind, params, audience, dedupe_key. The SDK-internal crate gets a small typed client (NotificationPublisher) so producers don't hand-roll requests. Publishing is synchronous up to the Postgres commit — once the producer gets 200, delivery is the engine's durable responsibility. Producers must treat publish as fire-and-forget after that; a producer must never block trading work on notification delivery.

Admin-authored announcements (downtime, dividends, features) come in through the admin GUI at admin.architect.co, via new /admin/notifications/* routes in api-gateway that proxy to notification-engine's internal API. Admin authn/authz stays where it already lives (api-gateway, users.is_admin), and the acting admin's user id is threaded through as the actor on every mutation — it lands in notifications.source and the audit log. See §Management surfaces for the full control panel.

Persistence (Postgres, in `db/postgres/1.sql`)

Six tables. IDs follow the existing CHAR(16) convention.

CREATE TABLE notifications (
    id CHAR(16) PRIMARY KEY,
    kind TEXT NOT NULL,
    severity TEXT NOT NULL,
    params JSONB NOT NULL DEFAULT '{}'::jsonb,
    title TEXT NOT NULL,
    body TEXT NOT NULL,
    source TEXT NOT NULL,                  -- producing service / admin user id
    dedupe_key TEXT UNIQUE,
    account_id CHAR(16) REFERENCES trading_accounts(id),  -- subject, if account-scoped
    publish_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE SEQUENCE notification_state_seq;

CREATE TABLE notification_recipients (
    seq BIGINT GENERATED ALWAYS AS IDENTITY,
    state_seq BIGINT NOT NULL DEFAULT 0,   -- engine sets nextval() on every state change
    notification_id CHAR(16) NOT NULL REFERENCES notifications(id),
    user_id CHAR(16) NOT NULL REFERENCES users(id),
    seen_at TIMESTAMPTZ,
    dismissed_at TIMESTAMPTZ,
    acknowledged_at TIMESTAMPTZ,
    PRIMARY KEY (notification_id, user_id)
);
CREATE INDEX ON notification_recipients (user_id, seq);
CREATE INDEX ON notification_recipients (state_seq);

CREATE TABLE notification_deliveries (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    notification_id CHAR(16) NOT NULL REFERENCES notifications(id),
    user_id CHAR(16) NOT NULL REFERENCES users(id),
    channel TEXT NOT NULL,                 -- 'email' | 'telegram' | 'slack'
    contact_point_id CHAR(16) NOT NULL REFERENCES user_contact_points(id),
    status TEXT NOT NULL DEFAULT 'pending',-- pending|sent|failed|dead|suppressed
    attempts INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    last_error TEXT,
    provider_message_id TEXT,
    sent_at TIMESTAMPTZ,
    UNIQUE (notification_id, user_id, channel)
);
CREATE INDEX ON notification_deliveries (status, next_attempt_at)
    WHERE status = 'pending';

CREATE TABLE user_contact_points (
    id CHAR(16) PRIMARY KEY,
    user_id CHAR(16) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    channel TEXT NOT NULL,
    address TEXT NOT NULL,                 -- email | telegram chat_id | slack channel id
    verified_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    UNIQUE (user_id, channel, address)
);

CREATE TABLE user_notification_preferences (
    user_id CHAR(16) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    category TEXT NOT NULL,                -- account_activity|risk|announcements|security
    channel TEXT NOT NULL,
    enabled BOOLEAN NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (user_id, category, channel)
);

CREATE TABLE notification_channel_controls (
    channel TEXT PRIMARY KEY,              -- 'email' | 'telegram' | 'slack' | '*'
    paused BOOLEAN NOT NULL DEFAULT FALSE,
    reason TEXT,
    actor_user_id CHAR(16) REFERENCES users(id),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Design notes:

Fan-out-on-write. Publish resolves the audience to concrete notification_recipients rows inside the publish transaction. At AX's scale (institutional user base, O(10²–10³) users) a broadcast is a few thousand rows — trivially fine, and it buys simple single-key inbox reads (WHERE user_id = $1 ORDER BY seq), a clean per-user cursor, and point-in-time audience snapshots (later permission changes don't retroactively change who was notified — which is exactly what an audit wants). Revisit only if the user count grows two orders of magnitude.
seq is both the catch-up cursor and the realtime transport. A global identity column, indexed per user. Clients resume with ?since_seq=, and the WS dispatcher (below) tails the same cursor to discover new rows — no gaps, no timestamp skew, one ordering for both paths. state_seq plays the same role for updates (seen/dismissed/ acknowledged): identity columns only advance on insert, and the engine is the sole writer, so its UPDATE statements set state_seq = nextval('notification_state_seq') explicitly — no triggers.
Channel pause flags live in notification_channel_controls, not env vars, so an admin can pause/resume a channel from the control panel without a redeploy; who paused it and why is part of the row. One global NOTIFICATIONS_DISABLED env var remains as the break-glass override.
Deliveries are an outbox. Rows are created in the same transaction as the notification, one per (recipient × routed external channel), after applying policy + preferences + contact-point existence. A suppressed send (rate limit, missing contact point, circuit breaker) is recorded as status = 'suppressed' with the reason in last_error — suppression is audited, not silent.

Delivery workers

One worker task per channel (spawned under TaskGroup, wrapped in run_durably with the standard DurableTaskConfig exponential backoff). Each loop claims work with the classic skip-locked poll:

SELECT ... FROM notification_deliveries
WHERE channel = $1 AND status = 'pending' AND next_attempt_at <= now()
ORDER BY next_attempt_at
LIMIT $batch
FOR UPDATE SKIP LOCKED;

Retry: exponential backoff (30s · 4ⁿ, capped at 1h), max 10 attempts, then status = 'dead' plus an ops alert (incident.io, same pattern as recon-engine). Dead deliveries are re-runnable by admin action.
At-least-once at the edge. A provider call can succeed after we recorded a timeout; provider_message_id is stored when available, and external channels accept the (rare) duplicate rather than risk a silent drop. The UI channel is exactly-once by construction (state lives in Postgres).
Senders:
- Email — an HTTP-API provider (Postmark or similar) rather than aws-sdk-ses: the AWS SDK crates are exact-pinned to match vendor-internal/ax-reporting, and adding an SES crate would force that cascade for no benefit. Plain reqwest to a provider API avoids it entirely.
- Telegram — Bot API sendMessage via reqwest; chat linked via deep-link start token.
- Slack — chat.postMessage into the client's shared channel; generalize settlement-engine/src/notifier.rs into a shared sdk-internal Slack client and have settlement-engine consume it too.
Content policy at the edge: external payloads carry minimal state — "Your account requires attention: margin utilization has exceeded the maintenance threshold. Log in to review." with a link, never balances or position details. Full detail renders only in the authenticated UI.

Anti-spam: four layers

Source hysteresis. Producers of oscillating signals must debounce before publishing. For margin: notify on upward threshold crossing, then re-arm only after utilization falls 5 points below the threshold, with a per-(account, threshold) cooldown (1h). This logic moves server-side into risk-monitor — the existing client-side useMarginAlert thresholds become display-only.
Dedup. dedupe_key uniqueness makes producer retries free.
Per-(user, channel) token bucket in the router: e.g. email 10/hour burst 3, telegram 20/hour. Overflow → suppressed (audited), with the notification still always landing in the UI inbox, which is never rate limited.
Global circuit breaker. If pending external deliveries created in the last minute exceed a global ceiling (env-configured, e.g. 500), external sending pauses (a '*' row in notification_channel_controls), ops gets paged, and an admin resumes from the control panel after reviewing the queue. A runaway producer becomes an ops incident, not a customer-facing spam incident. A NOTIFICATIONS_DISABLED env var remains as break-glass.

Read path: API + realtime

notification-engine hosts its own REST + WS endpoints (the order-gateway / marketdata-publisher pattern), authenticated with the same session/api-key middleware as other services:

GET /notifications?since_seq=&limit= — inbox, keyed on the authenticated user only.
POST /notifications/seen {up_to_seq} — bulk mark-seen.
POST /notifications/{id}/dismiss, POST /notifications/{id}/ack.
GET|PUT /notifications/preferences, contact-point CRUD + verification (email confirmation code, telegram link token).
WS /notifications/stream — pushes new notifications and state changes (so a dismiss in one tab clears the banner in another).

Realtime mechanism: tail Postgres, no broker. A single dispatcher task inside notification-engine polls notification_recipients on the existing indexes — WHERE seq > $cursor and WHERE state_seq > $cursor, ~1s interval — and routes new/changed rows to connected users in memory. That is one indexed query per second globally against a table growing a handful of rows per minute. Because the dispatcher only ever reads committed rows, there is no publish step that can be missed: a notification is at worst one poll-tick late, never lost. This also makes the engine multi-instance safe for free — each replica tails Postgres and serves its own connections, with no cross-instance fan-out to coordinate. If sub-second latency ever matters, layer LISTEN/NOTIFY on top purely as a dispatcher wake-up (NOTIFY fires transactionally on commit); the seq-cursor poll remains the correctness mechanism either way.

The public SDK (rs/sdk) gains protocol types for the read API only — generic naming, no internal service or datastore references, per the existing SDK hygiene rule.

Reconnect contract (this is connection-state machinery, so per repo policy: client disconnect, server-initiated disconnect, engine restart/crash, and reconnect/catch-up paths all get tests before rollout): the client runs since_seq catch-up on every (re)connect; pushed items are deduped by seq, so a re-push is harmless. The only server-side failure mode left is the engine process crashing — which necessarily drops every WS connection, which triggers client reconnect, which runs catch-up. The failure and its heal are the same event; there is no state in which a healthy socket sits in front of silently-dropped messages.

GUI

Bell + inbox panel in the header (between MarketStatePill and the account dropdown): unseen badge from the cursor, panel lists inbox with mark-seen-on-open, per-item dismiss.
Banner strip under the header for active display: banner notifications (downtime, margin warning), replacing the bespoke GlobalMessages margin pills as the source (the pill UI can stay as the rendering).
Blocking modal for action_required with an explicit acknowledge button wired to the ack endpoint.
Toasts keep using the existing notification context; the stream provider (a sibling of OrdersPub.tsx built on the shared useWebsocket hook with its heartbeat/backoff handling) routes display: toast items into pushNotification().
Settings page: category × channel preference grid, contact-point management with verification flows.

Management surfaces

Management splits into three tiers by blast radius:

Routing policy is code. The kind registry, default routing matrix, severities, and mandatory (non-opt-out) kinds live in the engine crate and change by PR. This is the highest-blast-radius configuration in the system; code review and git history are better controls than any dashboard form, and the policy can never drift from the renderers that depend on it.
The operational control plane is the admin GUI at admin.architect.co (gui/packages/admin, behind the existing admin auth + Cloudflare Access perimeter). New /admin/notifications/* routes in api-gateway proxy to notification-engine's internal API; the acting admin's user id travels with every mutation and lands in the ClickHouse audit log. The rule: any action that causes sends to customers or changes what customers receive executes as an authenticated exchange admin — "someone behind Cloudflare Access" is not an acceptable audit answer for margin-call infrastructure. Four pages:
- Announcements — compose (announcement kinds only: severity, title, markdown body, params), audience picker (all / users / accounts), publish_at/expires_at scheduling, per-channel preview, and a publish confirmation that displays the resolved recipient count. List of scheduled and active announcements with cancel (scheduled) and expire-now (active) actions.
- Deliveries — the dead-letter queue with full error detail and attempt history; retry one or bulk; lookup of any delivery by notification, user, or provider message id.
- Controls — per-channel pause/resume backed by notification_channel_controls (with required reason), circuit breaker state and resume, and a read-only view of current rate-limit and ceiling configuration.
- Customers — per-user contact points (view, remove), the user's effective preference matrix (defaults + overrides) with admin override, and a per-user delivery timeline queried from the ClickHouse audit log — the "show me everything we sent customer X and what they acknowledged" view.
Observability is read-only and lives in Grafana over the ClickHouse notification_log (delivery rates, failure/suppression counts, queue depth, breaker state). If interactive drill-down outgrows Grafana, a dedicated internal dashboard (the *.afintech.dev family) can render the same data — strictly read-only, with every mutating affordance linking into admin.architect.co.

Audit (ClickHouse)

Postgres holds operational state; ClickHouse holds the immutable audit trail, written analytics-style (batched inserts, modeled on the transactions table — ReplacingMergeTree, partitioned by month, keyed by delivery id + attempt):

CREATE TABLE notification_log (
    timestamp DateTime64(9) NOT NULL,
    notification_id String NOT NULL,
    user_id FixedString(16) NOT NULL,
    kind LowCardinality(String) NOT NULL,
    channel LowCardinality(String) NOT NULL,
    event LowCardinality(String) NOT NULL,   -- published|sent|failed|dead|suppressed|seen|dismissed|acknowledged
    attempt UInt8 NOT NULL,
    detail String NOT NULL DEFAULT '',       -- error text / suppression reason / provider msg id
    ...
)

Every state transition lands here, including recipient actions — so "customer X was margin-called at T, emailed at T+2s (provider id …), telegram'd at T+2s, saw it in the UI at T+9m, acknowledged at T+10m" is one query. This is the compliance artifact.

Dev/mock environment

Mirroring the EP3-mock philosophy: a DEBUG_NOTIFICATION_SINK mode routes all external sends to a log/file sink (and optionally one override email address), refused in production by the existing AX_ENVIRONMENT guard. Mock senders make the full pipeline testable in CI without provider credentials.

Initial producers

producer	kinds	trigger
risk-monitor	margin_warning, margin_call, account_frozen	threshold crossing with hysteresis (new module alongside existing risk domains)
admin GUI	scheduled_downtime, feature_announcement, dividend_announcement	human-authored, scheduled
settlement-engine / transaction-engine	withdrawal_settled, deposit_credited	settlement events
api-gateway	security_new_api_key, security_new_session	auth events

Rollout

Phase 1 — spine + announcements. Schema, notification-engine (ingest, recipients, dispatcher + UI channel only), REST/WS read path, GUI bell/inbox/banner, and the admin Announcements page (api-gateway /admin/notifications/* routes + gui/packages/admin). Ships value immediately (downtime/feature/dividend announcements) with zero external-channel risk.

Phase 2 — risk + email. Contact points with email verification, preferences page, email worker + provider integration, rate limiter + circuit breaker, ClickHouse audit log, risk-monitor margin producer with hysteresis, action_required ack modal, and the remaining admin pages (Deliveries, Controls, Customers) — required before the first external channel goes live, since they are the operating tools for it. This is the phase with compliance weight; the connection-lifecycle test matrix gates it.

Phase 3 — telegram + slack + digests. Telegram linking flow and worker, Slack shared-channel worker (refactor settlement notifier into shared client), optional daily digest email for users who downgrade a category from instant to digest.

Open questions

Digest scheduling — per-user "daily digest" needs a scheduler keyed to user timezone (which we don't store). Punt to phase 3; default UTC?
Four-eyes on broadcasts — should audience: all announcements require second-admin approval, or is the confirmation-with-count screen enough for v1?
Entity-level audiences — onboarding has business entities; do we need audience: entity (all users of an org) before account/user audiences cover real use?
Retention — recipients/deliveries rows grow forever in Postgres; archive-to-ClickHouse-and-prune after N months, and what is N for compliance (margin-call records likely ≥ 7 years — keep CH as the permanent record and prune PG aggressively)?