RFC: Unbounded History Pagination

Date: 2026-06-09

Status: Adopted. Stage 1 shipped as PR #2498 (transactions, 2026-06-16). On 2026-06-22 we adopted the directive below to extend the reform to all timeseries endpoints; where the directive conflicts with the design sections that follow, the directive wins.

Adopted directive (2026-06-22)

The pagination reform shipped for transactions in PR #2498 worked well; we are now applying it to all timeseries endpoints. Three principles govern each endpoint as it migrates:

Accelerate every filter combination. Every combination of selects on an endpoint's ChXXXFilters struct must be accelerated, through some combination of skip indexes, primary-key selection, and alternate projections. Where it is not feasible to support all combinations, limit or disallow the combinations that are unreasonable to serve rather than serving them slowly.
Partial and empty pages are legal — return early. The pagination contract (docs/internal/overview/pagination.mdx) has been updated to allow a page to come back short or empty while still carrying a next_cursor, so a handler does not have to kill itself to fill a page: if the scan is running long, return early and hand back a scan-boundary cursor ({timestamp_ns}). Ensure every filter combination can return early. Be very careful around the use of FINAL — if you don't need it, don't use it (FINAL forces a full merge-on-read and can potentially reach back unlimited in time).
Backstop every query with max_execution_time. Every ClickHouse query carries max_execution_time (SELECT_SETTINGS, currently 4s) as a last-resort defense; a timeout maps to a 400 telling the caller to narrow the range or add a filter, not a 500.

Where this supersedes the sections below: the target model's item 3 claimed an empty page can only occur at true exhaustion under fill-or-exhaust walking — under the directive, short and empty pages are legal non-terminal responses on any endpoint, and the scan-boundary cursor form (reserved in item 1) is now part of the documented cursor grammar. Termination is unchanged: the absence of next_cursor, and nothing else. Fill-or-exhaust slice-walking (§Design) is no longer the required mechanism — a handler should satisfy the contract with a single index-aligned keyset query where the schema allows (what transactions shipped), returning early with a boundary cursor only where it cannot.

Background

In early June we shipped a hard 7-day cap on all history endpoints (MAX_HISTORICAL_QUERY_WINDOW_NS, rs/sdk/src/protocol/time_range.rs): both time bounds are required, and any range wider than 7 days is rejected with a 400. This was a deliberate hotfix lineage:

A-3520 / PR #2168 — the order-history endpoint allowed unbounded full table scans of historical_orders (~272M rows per query) and was taking prod ClickHouse to its knees. Fix: bound the window to 7 days, and add a (timestamp_ns, user_id, order_id) projection so time-range queries prune.
A-3521 / PR #2172 — generalized the bound to all history endpoints (fills, trades, transactions, funding-transactions), rejecting rather than silently clamping.

The cap stopped the bleeding, but on 2026-06-09 it produced its first customer-facing incident: a Flowdesk customer's withdrawals "disappeared" from the Admin UI because they were older than the UI's largest selectable window. "Show me all withdrawals for this customer" is now impossible to express against the API — by design.

Meanwhile the GUI has independently grown three client-side workarounds:

useWindowedCursorPagination / stepWindowedQuery (gui/packages/app/util/historyWindow.ts) walks 7-day windows backwards, skipping empty windows, capped at 90 days lookback and 16 steps.
Admin useTransactions (gui/packages/admin/src/hooks/useTransactions.ts) loops a cursor up to 20 pages within one normalized window.
PR #2331 adds 7-day-aware filter UX ("walk back 7 days") to admin history tabs.

Every external API consumer (e.g. Flowdesk's integration, ops curl scripts) must rebuild the same loop. When a client-side loop must be reimplemented by every caller of an API, the loop belongs on the server.

Problem statement

The 7-day cap bounds the wrong variable. Query cost in ClickHouse is not driven by the width of the requested time range; it is driven by how many granules the query reads, which is a function of how well the query's predicates align with the table's primary key, partitioning, and skip indexes. Concretely:

An account-scoped query over 2 years against an account-led index reads a handful of granules. Cheap, but currently rejected.
A sparse filter (e.g. transaction_types=withdrawal for one user) over a single hot 7-day window can still read every granule in the window, run FINAL merge-on-read over all of it, plus a separate COUNT(*) ... FINAL for total_count — on every page. Expensive, but currently allowed.

The cap therefore simultaneously over- and under-protects, while exporting a window-walking loop to every client and silently hiding data in UIs that don't walk.

Industry survey (condensed)

Stripe, Slack, Shopify, GitHub (GraphQL): opaque keyset cursors; no hard time-range caps on list endpoints. "All transactions ever" is fine because each page is a bounded index seek + LIMIT; the unbounded range is never scanned in one query. Slack explicitly made cursors opaque "to give the server the ability to encode additional information within the cursor."
GitHub Search API is the cautionary tale: a hard 1,000-result cap whose documented workaround is client-side date-slicing — the most-complained-about pagination design in the industry, and structurally what we shipped.
DynamoDB normalizes the contract we need: each request reads a bounded amount (≤1 MB pre-filter) and may return zero items plus a continuation key; clients continue until the key is absent, not until a page is empty.
Grafana Loki splits every range query into ~1-day slices server-side, executes newest-first, stops when the limit is met — and also keeps a generous hard backstop (~30 days). Split-walking and a backstop coexist.
Elasticsearch replaced scroll with search_after + point-in-time: keyset tuple + tiebreaker, snapshot anchoring so concurrent writes don't shift pages.
Binance is the one exchange-API precedent for hard time caps (24h on trade history) — and it pairs the cap with a fromId keyset escape hatch.

The consensus mechanism for "unbounded history without unbounded scans" is: cursor = bounded per-page work, server-side slicing for non-indexed predicates, scan budgets as the backstop.

Target semantic model

The current public contract (docs/internal/overview/pagination.mdx) is mostly right — it already documents the time range as optional. Stage 1 restores that promise (the 7-day hotfix violated it) and makes four things precise, without adding response fields:

The cursor format is retained: {timestamp_ns}:{id}, a real row position. The server walks slices to fill-or-exhaust (see Design), so a page is either full — anchored on its last row — or terminal; the cursor never has to encode a partial-scan boundary, and stays the documented row format. We still describe it to clients as "pass it back unchanged" so a later move to a versioned/opaque token (to pin an "as-of" upper bound or carry walking progress) doesn't break callers, but no such token ships now.
The time range is an optional filter, subordinate to the cursor. Both bounds may be omitted; an open-ended request scans from now back to the start of history. The client no longer owns a mandatory window — the cursor owns position.
Termination is the absence of next_cursor, never a short or empty page. A page with fewer than limit rows is not a stop signal; only a missing next_cursor is. No complete flag and no searched_until_ns watermark are needed — next_cursor: None carries exactly the "exhausted" signal, and fill-or-exhaust walking guarantees an empty page only occurs at true exhaustion (so it is always terminal). The current "look-ahead" wording equates a non-full page with the last page; that is wrong once the server walks slices and must be removed before third parties couple to it.
Counts are lower bounds or absent. total_count on cursor pages is semantically ambiguous (count of which scope?) and costs a COUNT(*) ... FINAL per page. Omit it for wide/unbounded ranges; where provided, it means "at least N."

Design

DB layer (ClickHouse)

Align the index with the dominant query. History queries are overwhelmingly "one entity, ordered by time."

historical_orders already has the proj_ts_user projection (A-3520).
transactions is ReplacingMergeTree, PRIMARY KEY (timestamp, account_id, sequence_number), queried with FINAL. Projections are not usable here: ClickHouse does not apply projections to SELECT ... FINAL queries (and ALTER TABLE ... ADD PROJECTION on ReplacingMergeTree requires opting into deduplicate_merge_projection_mode). The implementable alternative is a bloom_filter skip index on account_id (and user_id if we query it directly): granule pruning composes fine with FINAL. Bloom filters skip granules with zero matches, so they help sparse/cold accounts most and degrade for accounts active in every granule — acceptable, since hot accounts' rows cluster near the scan frontier anyway.
Cursor predicate: today's form (ts < $c) OR (ts = $c AND id < $i) relies on the optimizer's OR-analysis for PK pruning. Add the redundant range conjunct — ts <= $c AND (...) — so pruning is guaranteed, not inferred. (The server-side slice bounds below add ts >= slice_start AND ts < slice_end conjuncts naturally.)
Scan budgets as backstop: per-query max_execution_time (with timeout_overflow_mode left at throw) bounds any single slice query. max_rows_to_read / read_overflow_mode='break' is available but returns silently-truncated results with no resume position, and partial results can poison the query cache (ClickHouse #67476) — prefer slice-granular budgets (the cursor resumes at an exact window boundary) over row-granular breaks.
Audit FINAL + per-page COUNT(*): set do_not_merge_across_partitions_select_final=1 (partitions are monthly; merges need not cross them), and stop issuing the count query for wide/unbounded ranges per model item 4.

API layer (api-gateway / order-gateway)

Accept unbounded ranges; walk windows server-side to fill-or-exhaust. The handler chunks the requested (or open-ended) range into slices — MAX_HISTORICAL_QUERY_WINDOW_NS becomes the slice size instead of a rejection threshold — and queries slice-by-slice in sort order until the page fills or the range is exhausted:

resolve range:  start = requested or 0; end = requested or now (pinned)
position     =  cursor position, else range edge per sort direction
loop:
    query slice [max(start, pos − slice), pos) with remaining limit + 1
    accumulate rows; advance pos to slice boundary
    stop when limit + 1 rows collected or range exhausted
respond:
    next_cursor = last row position (page full) | absent (range exhausted)

Because the loop only stops on a full page or true exhaustion, next_cursor is always a real row position or absent — never a partial-scan boundary. That is what keeps the {ts}:{id} cursor format and lets termination ride entirely on next_cursor: None, with no complete/searched_until_ns. The cost is unbounded request latency on very sparse open-ended scans: each slice is max_execution_time-boxed, but the number of slices a single request walks is not capped.

The GUI's stepWindowedQuery is a working prototype of exactly this algorithm, written in the wrong tier; the reform ports it down the stack.

Backstops, not gates: per-slice max_execution_time bounds each query; a filter-aligned index (below) keeps each slice cheap. A whole-ledger admin scan with no entity filter remains the one genuinely unprunable query; option: require an entity, symbol, or type filter for open-ended ranges on admin endpoints. Per-query cost protection is strictly stronger than the range cap (a 7-day window over a hot month is still a heavy query today), which also addresses the admin-vs-trading-traffic isolation concern.

If tail latency bites — a request walking years of empty slices — the escape hatch is a per-request MAX_SLICES_PER_REQUEST cap (e.g. 13 ≈ 90 days). A cap means a request can stop mid-scan and hand back a slice-boundary resume position on a possibly-empty page, which in turn requires the empty-page-with- cursor contract and a versioned/opaque cursor v2 ({position: row(ts, id) | boundary(ts), as_of_end_ns, …}, base64; legacy {ts}:{id} cursors still decode to a row position). We do not ship this now; it is recorded as the known next step if fill-or-exhaust proves too slow.

GUI layer

Delete the client-side walkers (historyWindow.ts windowing, admin 20-page loops, 16-step empty-window heuristics) once the server walks. Hooks become plain useInfiniteQuery with getNextPageParam: (last) => last.next_cursor.
Time filters become optional refinement, not a required gate; "All time" returns as a legal option.
Show "1,000+" rather than exact totals where the count is omitted. With fill-or-exhaust walking there are no non-terminal partial pages, so the GUI needs no "keep searching" affordance — a request returns a full page or the last page. (A "searched back to ⟨date⟩ — keep searching" UX returns only if we later adopt the capped variant.)

Docs / public contract

Update docs/internal/overview/pagination.mdx (and any public derivative) to: remove the "Look-ahead pattern" section (it teaches "non-full page ⟹ done", which is wrong once the server walks slices — the limit + 1 trick survives in the implementation, just not as the documented termination rule); keep "Client-side iteration" (its "continue until next_cursor is absent" rule is already correct); state that the time range is optional and total_count may be omitted for wide/unbounded ranges. The cursor format and the rest of the contract are unchanged. The public version must describe iteration behaviorally, without naming the datastore.

Rollout

Transactions endpoints first (the incident surface) — shipped as PR #2498: 7-day cap lifted, plain keyset cursor (no slice-walking needed — the timestamp-led PK short-circuits ORDER BY/LIMIT), FINAL dropped in favor of event_id dedup on read, symbol skip index added, per-page COUNT(*) dropped, max_execution_time backstop with timeout mapped to 400.
All remaining timeseries endpoints (fills / trades / funding-rates / funding-transactions / orders) follow the same shape, per the adopted directive above. (historical_orders already has its projection; trades is symbol-led and may want a per-account projection or index of its own. Funding-rates started in PR #2099 — see that review for directive gaps to close: per-page COUNT(*) retained, no max_execution_time on its queries.)
GUI simplification per above, including reverting the load-bearing parts of PR #2331's filter clamping to optional UX.
Docs rewrite lands with stage 1, before third parties couple to the current cursor format and termination semantics.

What we are explicitly not committing to

Migrating every paginated endpoint to cursor v2 (limit/offset stays for small stable collections: users, risk profiles).
Removing total_count everywhere (offset-paginated admin tables keep it).
Removing the 7-day constant — it survives as the internal slice size and as a per-slice cost unit.
read_overflow_mode='break' row-granular partial results (revisit if slice-granular budgets prove too coarse).

Open questions

Should open-ended admin whole-ledger queries (no entity/symbol/type filter) be allowed at all, or 400 with "add a filter or bound the range"?
How slow does a fill-or-exhaust request get on the worst real sparse query, and at what latency do we adopt the capped variant (and with it cursor v2)?
trades table: per-account access path (projection is viable there — no FINAL-blocking? trades is ReplacingMergeTree with FINAL too, so likely also bloom-filter territory).