RFC: Notifications

Date: 2026-06-10

Status: Draft. Nothing here is implemented; this document proposes the end-state design and a phased rollout.

Background

AX has no notifications system. What exists today is a set of disconnected fragments:

Meanwhile the things an exchange must tell its customers keep growing: scheduled downtime and upgrades, new product/feature launches, dividend and corporate-action announcements on equity index products (#2256 just added equities to the Pyth stream), settlement notices, deposit/withdrawal confirmations, security events (new API key, new session), and most critically margin calls, where "we told the customer, and here is the proof" is a compliance requirement, not a nicety.

Goals

  1. One pipeline from any internal producer to any customer-facing channel (UI inbox/banner/popup, email, Telegram, Slack), with per-kind routing policy and per-user preferences.
  2. Durable, auditable delivery. Every notification has a persistent record of who it was addressed to, which channels it went out on, when, whether it succeeded, and whether the user saw/acknowledged it.
  3. Anti-spam by construction. Dedup, coalescing, hysteresis at the source, rate limits at the sink, and a global circuit breaker so a buggy producer can never email every customer 1,000 times.
  4. Catch-up correctness. A client that reconnects (or logs in days later) sees exactly the notifications it missed, in order, with no gaps and no duplicates.

Non-goals (for now): SMS/mobile push (no mobile trading app yet), marketing campaigns/drip email, in-thread two-way chat over any channel.

Build vs. buy

Hosted notification platforms (Knock, Courier, Novu) solve template management and channel fan-out well, but they put customer identity, margin state, and announcement timing in a third party's hands, and none of them model our account-permission visibility rules. Self-hosted Novu was considered and rejected as a large operational surface (Mongo + Redis + several Node services) for a feature set we mostly don't need. We build a thin engine on the patterns we already run everywhere (Postgres outbox + per-service WebSocket), and use commodity providers only at the very edge (an email API, the Telegram Bot API, Slack chat.postMessage the settlement-engine notifier already shows the shape of that last one).

Domain model

Notification

A notification is a single logical message: what happened, addressed to an audience, with a severity and kind. It is immutable once published.

Per-recipient state

Each (notification, user) pair carries lifecycle state, advanced only by the recipient:

addressed ──► delivered(channel)* ──► seen ──► dismissed
                                        └────► acknowledged   (action_required only)

Channels and contact points

A contact point is a verified address on a channel: a confirmed email address, a Telegram chat_id captured via a t.me/<bot>?start=<token> linking flow, a Slack channel ID (AX already runs shared per-client Slack channels the bot posts into those). The ui channel is implicit and needs no contact point. Unverified addresses are never sent to.

Routing policy and preferences

Each kind declares a default routing matrix (which channels, at which severity). Users override per (category, channel) category, not kind, to keep the settings page tractable (account_activity, risk, announcements, security). Two hard rules sit above preferences:

  1. Mandatory kinds cannot be opted out of on the ui channel: margin_call, account_frozen, scheduled_downtime, security events. (Users may still disable the email/telegram copies of most of these; margin calls default-on for every channel with a verified contact point.)
  2. Nothing external is opt-out by accident: external channels are opt-in per category except risk and security, which are opt-out.
kind (examples) severity ui email telegram slack
margin_call action_required modal
margin_warning warning banner opt opt opt
scheduled_downtime warning banner opt
dividend_announcement info inbox opt opt opt
feature_announcement info inbox opt
withdrawal_settled info toast opt opt
security_new_api_key warning toast

( = default on, opt = default off, = never. Exact matrix lives in code next to the kind registry, not in this doc.)

Architecture

A new service, rs/notification-engine, following the standard skeleton (main.rs/config.rs/service.rs, env config via ax_sdk_internal::env_vars, HealthPublisher, TaskGroup).

producers                            notification-engine                     sinks
─────────                            ───────────────────                     ─────
risk-monitor          ─┐                                             ┌─► ui: WS stream + REST inbox
settlement-engine     ─┤  internal    ┌──────────┐   ┌────────────┐  │   (dispatcher tails Postgres
api-gateway           ─┼─ publish ───►│ ingest / │──►│  Postgres  │──┤    on seq / state_seq)
admin GUI ─► api-gw    │  (HTTP/IPC)  │  router  │   │  outbox    │  ├─► email worker ──► provider API
   admin routes ───────┘              └──────────┘   │ deliveries │  ├─► telegram worker ► Bot API
                                            │        └────────────┘  └─► slack worker ──► chat.postMessage
                                            ▼               │
                                      rate limiter /        ▼ (batch, analytics-style)
                                      circuit breaker  ClickHouse notification_log
                                                       (append-only audit)

Redis is deliberately absent: every hop in the UI path is either Postgres or a connection whose failure forces the client to re-sync from Postgres. The existing Redis pub/sub pattern fits market data, where the next tick supersedes a lost message; notifications are the opposite shape low frequency, every message matters so the durable store is the transport.

Ingestion

Producers call POST /internal/publish on notification-engine (intra-VPC HTTP, same trust posture as other internal service calls), passing kind, params, audience, dedupe_key. The SDK-internal crate gets a small typed client (NotificationPublisher) so producers don't hand-roll requests. Publishing is synchronous up to the Postgres commit once the producer gets 200, delivery is the engine's durable responsibility. Producers must treat publish as fire-and-forget after that; a producer must never block trading work on notification delivery.

Admin-authored announcements (downtime, dividends, features) come in through the admin GUI at admin.architect.co, via new /admin/notifications/* routes in api-gateway that proxy to notification-engine's internal API. Admin authn/authz stays where it already lives (api-gateway, users.is_admin), and the acting admin's user id is threaded through as the actor on every mutation it lands in notifications.source and the audit log. See §Management surfaces for the full control panel.

Persistence (Postgres, in db/postgres/1.sql)

Six tables. IDs follow the existing CHAR(16) convention.

CREATE TABLE notifications (
    id CHAR(16) PRIMARY KEY,
    kind TEXT NOT NULL,
    severity TEXT NOT NULL,
    params JSONB NOT NULL DEFAULT '{}'::jsonb,
    title TEXT NOT NULL,
    body TEXT NOT NULL,
    source TEXT NOT NULL,                  -- producing service / admin user id
    dedupe_key TEXT UNIQUE,
    account_id CHAR(16) REFERENCES trading_accounts(id),  -- subject, if account-scoped
    publish_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE SEQUENCE notification_state_seq;

CREATE TABLE notification_recipients (
    seq BIGINT GENERATED ALWAYS AS IDENTITY,
    state_seq BIGINT NOT NULL DEFAULT 0,   -- engine sets nextval() on every state change
    notification_id CHAR(16) NOT NULL REFERENCES notifications(id),
    user_id CHAR(16) NOT NULL REFERENCES users(id),
    seen_at TIMESTAMPTZ,
    dismissed_at TIMESTAMPTZ,
    acknowledged_at TIMESTAMPTZ,
    PRIMARY KEY (notification_id, user_id)
);
CREATE INDEX ON notification_recipients (user_id, seq);
CREATE INDEX ON notification_recipients (state_seq);

CREATE TABLE notification_deliveries (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    notification_id CHAR(16) NOT NULL REFERENCES notifications(id),
    user_id CHAR(16) NOT NULL REFERENCES users(id),
    channel TEXT NOT NULL,                 -- 'email' | 'telegram' | 'slack'
    contact_point_id CHAR(16) NOT NULL REFERENCES user_contact_points(id),
    status TEXT NOT NULL DEFAULT 'pending',-- pending|sent|failed|dead|suppressed
    attempts INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    last_error TEXT,
    provider_message_id TEXT,
    sent_at TIMESTAMPTZ,
    UNIQUE (notification_id, user_id, channel)
);
CREATE INDEX ON notification_deliveries (status, next_attempt_at)
    WHERE status = 'pending';

CREATE TABLE user_contact_points (
    id CHAR(16) PRIMARY KEY,
    user_id CHAR(16) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    channel TEXT NOT NULL,
    address TEXT NOT NULL,                 -- email | telegram chat_id | slack channel id
    verified_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    UNIQUE (user_id, channel, address)
);

CREATE TABLE user_notification_preferences (
    user_id CHAR(16) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    category TEXT NOT NULL,                -- account_activity|risk|announcements|security
    channel TEXT NOT NULL,
    enabled BOOLEAN NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (user_id, category, channel)
);

CREATE TABLE notification_channel_controls (
    channel TEXT PRIMARY KEY,              -- 'email' | 'telegram' | 'slack' | '*'
    paused BOOLEAN NOT NULL DEFAULT FALSE,
    reason TEXT,
    actor_user_id CHAR(16) REFERENCES users(id),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Design notes:

Delivery workers

One worker task per channel (spawned under TaskGroup, wrapped in run_durably with the standard DurableTaskConfig exponential backoff). Each loop claims work with the classic skip-locked poll:

SELECT ... FROM notification_deliveries
WHERE channel = $1 AND status = 'pending' AND next_attempt_at <= now()
ORDER BY next_attempt_at
LIMIT $batch
FOR UPDATE SKIP LOCKED;

Anti-spam: four layers

  1. Source hysteresis. Producers of oscillating signals must debounce before publishing. For margin: notify on upward threshold crossing, then re-arm only after utilization falls 5 points below the threshold, with a per-(account, threshold) cooldown (1h). This logic moves server-side into risk-monitor the existing client-side useMarginAlert thresholds become display-only.
  2. Dedup. dedupe_key uniqueness makes producer retries free.
  3. Per-(user, channel) token bucket in the router: e.g. email 10/hour burst 3, telegram 20/hour. Overflow suppressed (audited), with the notification still always landing in the UI inbox, which is never rate limited.
  4. Global circuit breaker. If pending external deliveries created in the last minute exceed a global ceiling (env-configured, e.g. 500), external sending pauses (a '*' row in notification_channel_controls), ops gets paged, and an admin resumes from the control panel after reviewing the queue. A runaway producer becomes an ops incident, not a customer-facing spam incident. A NOTIFICATIONS_DISABLED env var remains as break-glass.

Read path: API + realtime

notification-engine hosts its own REST + WS endpoints (the order-gateway / marketdata-publisher pattern), authenticated with the same session/api-key middleware as other services:

Realtime mechanism: tail Postgres, no broker. A single dispatcher task inside notification-engine polls notification_recipients on the existing indexes WHERE seq > $cursor and WHERE state_seq > $cursor, ~1s interval and routes new/changed rows to connected users in memory. That is one indexed query per second globally against a table growing a handful of rows per minute. Because the dispatcher only ever reads committed rows, there is no publish step that can be missed: a notification is at worst one poll-tick late, never lost. This also makes the engine multi-instance safe for free each replica tails Postgres and serves its own connections, with no cross-instance fan-out to coordinate. If sub-second latency ever matters, layer LISTEN/NOTIFY on top purely as a dispatcher wake-up (NOTIFY fires transactionally on commit); the seq-cursor poll remains the correctness mechanism either way.

The public SDK (rs/sdk) gains protocol types for the read API only generic naming, no internal service or datastore references, per the existing SDK hygiene rule.

Reconnect contract (this is connection-state machinery, so per repo policy: client disconnect, server-initiated disconnect, engine restart/crash, and reconnect/catch-up paths all get tests before rollout): the client runs since_seq catch-up on every (re)connect; pushed items are deduped by seq, so a re-push is harmless. The only server-side failure mode left is the engine process crashing which necessarily drops every WS connection, which triggers client reconnect, which runs catch-up. The failure and its heal are the same event; there is no state in which a healthy socket sits in front of silently-dropped messages.

GUI

Management surfaces

Management splits into three tiers by blast radius:

  1. Routing policy is code. The kind registry, default routing matrix, severities, and mandatory (non-opt-out) kinds live in the engine crate and change by PR. This is the highest-blast-radius configuration in the system; code review and git history are better controls than any dashboard form, and the policy can never drift from the renderers that depend on it.

  2. The operational control plane is the admin GUI at admin.architect.co (gui/packages/admin, behind the existing admin auth + Cloudflare Access perimeter). New /admin/notifications/* routes in api-gateway proxy to notification-engine's internal API; the acting admin's user id travels with every mutation and lands in the ClickHouse audit log. The rule: any action that causes sends to customers or changes what customers receive executes as an authenticated exchange admin "someone behind Cloudflare Access" is not an acceptable audit answer for margin-call infrastructure. Four pages:

    • Announcements compose (announcement kinds only: severity, title, markdown body, params), audience picker (all / users / accounts), publish_at/expires_at scheduling, per-channel preview, and a publish confirmation that displays the resolved recipient count. List of scheduled and active announcements with cancel (scheduled) and expire-now (active) actions.
    • Deliveries the dead-letter queue with full error detail and attempt history; retry one or bulk; lookup of any delivery by notification, user, or provider message id.
    • Controls per-channel pause/resume backed by notification_channel_controls (with required reason), circuit breaker state and resume, and a read-only view of current rate-limit and ceiling configuration.
    • Customers per-user contact points (view, remove), the user's effective preference matrix (defaults + overrides) with admin override, and a per-user delivery timeline queried from the ClickHouse audit log the "show me everything we sent customer X and what they acknowledged" view.
  3. Observability is read-only and lives in Grafana over the ClickHouse notification_log (delivery rates, failure/suppression counts, queue depth, breaker state). If interactive drill-down outgrows Grafana, a dedicated internal dashboard (the *.afintech.dev family) can render the same data strictly read-only, with every mutating affordance linking into admin.architect.co.

Audit (ClickHouse)

Postgres holds operational state; ClickHouse holds the immutable audit trail, written analytics-style (batched inserts, modeled on the transactions table ReplacingMergeTree, partitioned by month, keyed by delivery id + attempt):

CREATE TABLE notification_log (
    timestamp DateTime64(9) NOT NULL,
    notification_id String NOT NULL,
    user_id FixedString(16) NOT NULL,
    kind LowCardinality(String) NOT NULL,
    channel LowCardinality(String) NOT NULL,
    event LowCardinality(String) NOT NULL,   -- published|sent|failed|dead|suppressed|seen|dismissed|acknowledged
    attempt UInt8 NOT NULL,
    detail String NOT NULL DEFAULT '',       -- error text / suppression reason / provider msg id
    ...
)

Every state transition lands here, including recipient actions so "customer X was margin-called at T, emailed at T+2s (provider id ), telegram'd at T+2s, saw it in the UI at T+9m, acknowledged at T+10m" is one query. This is the compliance artifact.

Dev/mock environment

Mirroring the EP3-mock philosophy: a DEBUG_NOTIFICATION_SINK mode routes all external sends to a log/file sink (and optionally one override email address), refused in production by the existing AX_ENVIRONMENT guard. Mock senders make the full pipeline testable in CI without provider credentials.

Initial producers

producer kinds trigger
risk-monitor margin_warning, margin_call, account_frozen threshold crossing with hysteresis (new module alongside existing risk domains)
admin GUI scheduled_downtime, feature_announcement, dividend_announcement human-authored, scheduled
settlement-engine / transaction-engine withdrawal_settled, deposit_credited settlement events
api-gateway security_new_api_key, security_new_session auth events

Rollout

Phase 1 spine + announcements. Schema, notification-engine (ingest, recipients, dispatcher + UI channel only), REST/WS read path, GUI bell/inbox/banner, and the admin Announcements page (api-gateway /admin/notifications/* routes + gui/packages/admin). Ships value immediately (downtime/feature/dividend announcements) with zero external-channel risk.

Phase 2 risk + email. Contact points with email verification, preferences page, email worker + provider integration, rate limiter + circuit breaker, ClickHouse audit log, risk-monitor margin producer with hysteresis, action_required ack modal, and the remaining admin pages (Deliveries, Controls, Customers) required before the first external channel goes live, since they are the operating tools for it. This is the phase with compliance weight; the connection-lifecycle test matrix gates it.

Phase 3 telegram + slack + digests. Telegram linking flow and worker, Slack shared-channel worker (refactor settlement notifier into shared client), optional daily digest email for users who downgrade a category from instant to digest.

Open questions

  1. Digest scheduling per-user "daily digest" needs a scheduler keyed to user timezone (which we don't store). Punt to phase 3; default UTC?
  2. Four-eyes on broadcasts should audience: all announcements require second-admin approval, or is the confirmation-with-count screen enough for v1?
  3. Entity-level audiences onboarding has business entities; do we need audience: entity (all users of an org) before account/user audiences cover real use?
  4. Retention recipients/deliveries rows grow forever in Postgres; archive-to-ClickHouse-and-prune after N months, and what is N for compliance (margin-call records likely 7 years keep CH as the permanent record and prune PG aggressively)?