RFC: Observability for CI/CD and DevOps

Date: 2026-05-25

Background

We want to record and visualize signals about how our engineering process is running over time, starting with two concrete questions:

  1. CI build times. How are rust-test and rust-clippy durations trending? Are caches still working? Did a recent change blow up compile times? (Defined in .github/workflows/code-review-slow.yml.)
  2. PR activity. How many PRs are getting merged over time? Per developer? What's the review-to-merge lead time? Where do PRs get stuck?

We already operate a working observability stack:

Nothing in this stack currently ingests CI or VCS data. This RFC is about which seams to add it on, without spinning up a third platform.

Two slices of data, two ingestion paths

The headline observation is that CI timing data and PR activity data have fundamentally different shapes, and trying to force one pipeline to handle both leads to a worse system than accepting the seam.

Workflow / job / step timing PR + review activity
Shape Trace (parent / child spans) Long-lived entities with state
Ends? Each run completes once, immutable PRs reopen, reviews dismiss, labels churn
Webhook fits? Yes, workflow_* events are intrinsically traces Awkward webhooks give deltas, queries want state
Backfill? Rarely interesting (last 30 days is enough) Needed (12 months of history is a common ask)
Natural query "p50 of rust-clippy over 30d" "PRs merged per dev in March"
OTel models it well? Yes No, only as a generic event bus

The right answer is one pipeline per slice:

GitHub workflow_run / workflow_job  ──►  otelcol (githubreceiver)  ──►  otel_traces
GitHub PR / review / comment        ──►  bespoke poller            ──►  pr_events, pr_current

Both write to the same ClickHouse Cloud cluster and are visualized in the same Grafana on ax-grafana.

Part 1: CI timing via OTel Collector + githubreceiver

Why this shape

opentelemetry-collector-contrib includes a github receiver that ingests GitHub webhook events (workflow_run, workflow_job), verifies HMAC signatures natively, and emits OTel traces shaped as one trace per workflow run with child spans per job and per step. We get per-step timings (cache restore vs cargo clippy itself vs apt-get install, etc.) without writing parser code.

Data lands in the existing otel_traces table, which the existing collector deployment already exports to. Grafana queries it the same way it queries service traces.

Network constraint and the Tailscale story

ax-grafana is tailnet-only. Grafana is fronted by Tailscale Serve; nginx listens on :80 on the tailnet (see configs/grafana/nginx.conf). GitHub.com webhooks deliver from public GitHub IP ranges only there is no way to deliver webhooks over a private network. So using githubreceiver requires a public ingress endpoint, full stop.

Three Tailscale-flavored ways to handle that, ranked:

  1. Tailscale Funnel + nginx IP allowlist + HMAC (recommended). Funnel exposes one path of ax-grafana to the public internet over HTTPS with a real cert. nginx restricts that path to GitHub's published webhook IP ranges (from https://api.github.com/meta, hooks array, refreshed periodically). The receiver verifies HMAC. Three layers: TLS + source IP + signature.

  2. tailscale/github-action in workflows. Each job joins the tailnet as an ephemeral node and pushes telemetry outbound to a private endpoint. Avoids any public exposure, but you lose githubreceiver (it expects webhook payloads, not OTLP from runners), so you'd hand-roll spans in each job. Defeats the purpose.

  3. Self-hosted runners inside the tailnet. Does not actually help webhooks still come from GitHub.com, not the runners. Ignore.

Option 1 is the path.

Concrete deployment

Add to configs/grafana/compose.yml:

otelcol:
  image: otel/opentelemetry-collector-contrib:<pinned>
  container_name: otelcol_gh_ingest
  network_mode: host
  restart: unless-stopped
  volumes:
    - ./otelcol-config.yaml:/etc/otelcol/config.yaml:ro
  env_file:
    - .env
    - .env.secret

configs/grafana/otelcol-config.yaml:

receivers:
  github:
    endpoint: 127.0.0.1:8088
    path: /events
    secret: ${env:GITHUB_WEBHOOK_SECRET}
exporters:
  clickhouse:
    endpoint: tcp://${env:CH_HOST}:9440?secure=true
    database: default
    traces_table_name: otel_traces
    username: ${env:CH_USER}
    password: ${env:CH_PASSWORD}
service:
  pipelines:
    traces:
      receivers: [github]
      exporters: [clickhouse]

configs/grafana/nginx.conf gains a third location:

location = /gh-webhook {
    # Periodically refreshed from api.github.com/meta (hooks array).
    allow 140.82.112.0/20;
    allow 192.30.252.0/22;
    allow 185.199.108.0/22;
    # ... rest of GitHub's published webhook ranges
    deny all;
    proxy_pass http://127.0.0.1:8088/events;
}

Plus a small cron job that pulls /meta, regenerates the allowlist snippet, and reloads nginx if it changed (~50 lines of bash).

GitHub-side: repo Settings -> Webhooks -> Add. URL is the public Funnel hostname + /gh-webhook, content type JSON, secret is whatever GITHUB_WEBHOOK_SECRET resolves to in Doppler, events = "Workflow jobs" + "Workflow runs".

Tailscale Funnel: tailscale funnel --bg --set-path /gh-webhook http://localhost:80/gh-webhook (exact invocation TBD when implementing).

Grafana dashboard

New JSON in configs/grafana/grafana/dashboards/ci-timing.json. Sample panel query:

SELECT toStartOfHour(Timestamp) AS t,
       SpanName,
       quantile(0.5)(Duration) / 1e9 AS p50_s,
       quantile(0.95)(Duration) / 1e9 AS p95_s
FROM otel_traces
WHERE ServiceName = 'github-actions'
  AND SpanName IN ('rust-clippy', 'rust-test')
  AND Timestamp >= now() - INTERVAL 30 DAY
GROUP BY t, SpanName
ORDER BY t

We also want a per-step breakdown panel (arrayJoin the step spans, group by step name) to catch the case where cache restore time dominates.

Part 2: PR + review activity via a small bespoke poller

Why not also OTel

We could shove every pull_request and pull_request_review webhook payload into otel_logs as structured events and reconstruct PR state via materialized views. It works, but:

  1. Backfill. Webhooks only fire from the moment the hook is enabled. To answer "PRs per week for the last 12 months," we need the GitHub API, not webhooks. githubreceiver doesn't backfill.
  2. Entity queries are awkward over event streams. "Show me open PRs older than 7 days" wants a row of current state, not a fold over events.
  3. Drift recovery. A missed webhook means permanent inconsistency until you rebuild from the API anyway.
  4. Enrichment. "PRs touching the risk-engine crate," "PRs that closed a Linear issue" these are bespoke transforms, not collector features.

Proposed shape

A small periodic poller that calls the GitHub GraphQL API every ~5 minutes:

Backfill is just $last_sync = 12 months ago.

The poller can run as another container in configs/grafana/compose.yml on ax-grafana. It calls out to api.github.com no public ingress needed. Auth is a GitHub App installation token (preferred over a PAT so it's scoped and rotatable).

Why not put this in alerter

Alerter is framed as "evaluates alert rules and runs probes." Its README explicitly bounds scope: "if it's gone over 10,000 lines of code, we've lost our minds." Adding PR-activity ingestion crosses both the "first write path into CH" line and the "this isn't an alerting concern" line. Keep alerter focused; put the poller in a new small service (or a couple of cron'd gh CLI + clickhouse-client scripts as a v0 if we want to start dirt-cheap and see what sticks).

Alerter does benefit from this work in one way: once the CI/PR tables exist, we can write metric_threshold-style rules against them (e.g., "warn if rust-clippy p50 jumps +30% week-over-week", "warn if median review-to-merge time exceeds 48h"). That's exactly what alerter is for.

What we get for the effort

Probably ~300-500 lines of Rust (or a Go binary, or a Python script for v0). Polls GraphQL, writes CH, exposes nothing.

DORA-metric-style dashboards become straightforward queries against pr_current / pr_events: deployment frequency, lead time for changes, change-failure rate (joined with incident.io data), mean-time-to-restore.

Prior art worth glancing at first

This is DORA-metrics / engineering-productivity territory and is well-trodden:

We may want to buy the PR-activity dashboard from one of these and only self-host the CI-timing slice (which they generally don't cover as well as a bespoke OTel pipeline does).

Recommendation

  1. Build the CI-timing slice first. otelcol + githubreceiver + Funnel + nginx IP allowlist + HMAC. Lands data in otel_traces, one new dashboard. Estimated effort: a day or two end-to-end.
  2. Survey commercial DORA tools before building the PR-activity slice. If something fits, plug it into Grafana (most have an API) and stop.
  3. Only if we decide to build PR-activity ourselves, write a small poller as a new service, keep it out of alerter, design the CH schema for entity queries rather than event streams.
  4. Eventually wire metric_threshold rules in alerter against both new data sets so regressions page rather than waiting to be noticed in a dashboard.

Open questions