Date: 2026-05-25
We want to record and visualize signals about how our engineering process is running over time, starting with two concrete questions:
rust-test
and
rust-clippy
durations trending?
Are
caches
still
working?
Did
a
recent
change
blow
up compile
times?
(Defined
in
.github/workflows/code-review-slow.yml.)
We already operate a working observability stack:
ax-logs
Doppler
project,
demo
and
prod configs)
is
the
durable
store.
The
otel_*
tables
are
populated
by the
existing
pipeline.
otel_logs
/
otel_traces
/ otel_metrics_*.
(See
docs/internal/overview/observability/logs.mdx.)
ax-grafana
host
(tailnet-only,
fronted
by Tailscale
Serve
on
:443).
The
ClickHouse
datasource
is
provisioned in
configs/grafana/grafana/provisioning/datasources/clickhouse.yml and
dashboards
live
alongside
as
JSON.
axum),
bearer-token
auth,
and
receives
inbound
heartbeat POSTs
from
services.
Nothing in this stack currently ingests CI or VCS data. This RFC is about which seams to add it on, without spinning up a third platform.
The headline observation is that CI timing data and PR activity data have fundamentally different shapes, and trying to force one pipeline to handle both leads to a worse system than accepting the seam.
| Workflow / job / step timing | PR + review activity | |
|---|---|---|
| Shape | Trace (parent / child spans) | Long-lived entities with state |
| Ends? | Each run completes once, immutable | PRs reopen, reviews dismiss, labels churn |
| Webhook fits? | Yes,
workflow_*
events
are
intrinsically
traces |
Awkward — webhooks give deltas, queries want state |
| Backfill? | Rarely interesting (last 30 days is enough) | Needed (12 months of history is a common ask) |
| Natural query | "p50
of
rust-clippy
over
30d" |
"PRs merged per dev in March" |
| OTel models it well? | Yes | No, only as a generic event bus |
The right answer is one pipeline per slice:
GitHub workflow_run / workflow_job ──► otelcol (githubreceiver) ──► otel_traces
GitHub PR / review / comment ──► bespoke poller ──► pr_events, pr_current
Both write to the same ClickHouse Cloud cluster and are visualized in the same Grafana on ax-grafana.
githubreceiveropentelemetry-collector-contrib
includes
a
github
receiver
that ingests
GitHub
webhook
events
(workflow_run,
workflow_job), verifies
HMAC
signatures
natively,
and
emits
OTel
traces
shaped
as
one trace
per
workflow
run
with
child
spans
per
job
and
per
step.
We
get per-step
timings
(cache
restore
vs
cargo clippy
itself
vs apt-get install,
etc.)
without
writing
parser
code.
Data
lands
in
the
existing
otel_traces
table,
which
the
existing collector
deployment
already
exports
to.
Grafana
queries
it
the
same way
it
queries
service
traces.
ax-grafana
is
tailnet-only.
Grafana
is
fronted
by
Tailscale
Serve; nginx
listens
on
:80
on
the
tailnet
(see
configs/grafana/nginx.conf). GitHub.com
webhooks
deliver
from
public
GitHub
IP
ranges
only
— there
is
no
way
to
deliver
webhooks
over
a
private
network.
So using
githubreceiver
requires
a
public
ingress
endpoint,
full
stop.
Three Tailscale-flavored ways to handle that, ranked:
Tailscale
Funnel
+
nginx
IP
allowlist
+
HMAC
(recommended). Funnel
exposes
one
path
of
ax-grafana
to
the
public
internet
over HTTPS
with
a
real
cert.
nginx
restricts
that
path
to
GitHub's published
webhook
IP
ranges
(from
https://api.github.com/meta, hooks
array,
refreshed
periodically).
The
receiver
verifies
HMAC. Three
layers:
TLS
+
source
IP
+
signature.
tailscale/github-action
in
workflows.
Each
job
joins
the tailnet
as
an
ephemeral
node
and
pushes
telemetry
outbound
to
a private
endpoint.
Avoids
any
public
exposure,
but
you
lose githubreceiver
(it
expects
webhook
payloads,
not
OTLP
from runners),
so
you'd
hand-roll
spans
in
each
job.
Defeats
the purpose.
Self-hosted runners inside the tailnet. Does not actually help — webhooks still come from GitHub.com, not the runners. Ignore.
Option 1 is the path.
Add
to
configs/grafana/compose.yml:
otelcol:
image: otel/opentelemetry-collector-contrib:<pinned>
container_name: otelcol_gh_ingest
network_mode: host
restart: unless-stopped
volumes:
- ./otelcol-config.yaml:/etc/otelcol/config.yaml:ro
env_file:
- .env
- .env.secretconfigs/grafana/otelcol-config.yaml:
receivers:
github:
endpoint: 127.0.0.1:8088
path: /events
secret: ${env:GITHUB_WEBHOOK_SECRET}
exporters:
clickhouse:
endpoint: tcp://${env:CH_HOST}:9440?secure=true
database: default
traces_table_name: otel_traces
username: ${env:CH_USER}
password: ${env:CH_PASSWORD}
service:
pipelines:
traces:
receivers: [github]
exporters: [clickhouse]configs/grafana/nginx.conf
gains
a
third
location:
location = /gh-webhook {
# Periodically refreshed from api.github.com/meta (hooks array).
allow 140.82.112.0/20;
allow 192.30.252.0/22;
allow 185.199.108.0/22;
# ... rest of GitHub's published webhook ranges
deny all;
proxy_pass http://127.0.0.1:8088/events;
}
Plus
a
small
cron
job
that
pulls
/meta,
regenerates
the
allowlist snippet,
and
reloads
nginx
if
it
changed
(~50
lines
of
bash).
GitHub-side:
repo
Settings
->
Webhooks
->
Add.
URL
is
the
public Funnel
hostname
+
/gh-webhook,
content
type
JSON,
secret
is
whatever GITHUB_WEBHOOK_SECRET
resolves
to
in
Doppler,
events
= "Workflow
jobs"
+
"Workflow
runs".
Tailscale
Funnel:
tailscale funnel --bg --set-path /gh-webhook http://localhost:80/gh-webhook (exact
invocation
TBD
when
implementing).
New
JSON
in
configs/grafana/grafana/dashboards/ci-timing.json.
Sample panel
query:
SELECT toStartOfHour(Timestamp) AS t,
SpanName,
quantile(0.5)(Duration) / 1e9 AS p50_s,
quantile(0.95)(Duration) / 1e9 AS p95_s
FROM otel_traces
WHERE ServiceName = 'github-actions'
AND SpanName IN ('rust-clippy', 'rust-test')
AND Timestamp >= now() - INTERVAL 30 DAY
GROUP BY t, SpanName
ORDER BY tWe
also
want
a
per-step
breakdown
panel
(arrayJoin
the
step
spans, group
by
step
name)
to
catch
the
case
where
cache
restore
time dominates.
We
could
shove
every
pull_request
and
pull_request_review
webhook payload
into
otel_logs
as
structured
events
and
reconstruct
PR
state via
materialized
views.
It
works,
but:
githubreceiver
doesn't
backfill.
A small periodic poller that calls the GitHub GraphQL API every ~5 minutes:
$last_sync,
plus
their
reviews,
comments, commits,
files-changed
count.
pr_current
—
one
row
per
PR
with
current
state (ReplacingMergeTree
keyed
on
(repo, number))
pr_events
—
append-only
log
of
state
transitions (MergeTree
ordered
by
(repo, number, event_ts))
Backfill
is
just
$last_sync = 12 months ago.
The
poller
can
run
as
another
container
in
configs/grafana/compose.yml on
ax-grafana.
It
calls
out
to
api.github.com
—
no
public
ingress needed.
Auth
is
a
GitHub
App
installation
token
(preferred
over
a
PAT so
it's
scoped
and
rotatable).
Alerter
is
framed
as
"evaluates
alert
rules
and
runs
probes."
Its README
explicitly
bounds
scope:
"if
it's
gone
over
10,000
lines
of code,
we've
lost
our
minds."
Adding
PR-activity
ingestion
crosses both
the
"first
write
path
into
CH"
line
and
the
"this
isn't
an alerting
concern"
line.
Keep
alerter
focused;
put
the
poller
in
a
new small
service
(or
a
couple
of
cron'd
gh
CLI
+
clickhouse-client scripts
as
a
v0
if
we
want
to
start
dirt-cheap
and
see
what
sticks).
Alerter
does
benefit
from
this
work
in
one
way:
once
the
CI/PR
tables exist,
we
can
write
metric_threshold-style
rules
against
them (e.g.,
"warn
if
rust-clippy
p50
jumps
+30%
week-over-week",
"warn if
median
review-to-merge
time
exceeds
48h").
That's
exactly
what alerter
is
for.
Probably ~300-500 lines of Rust (or a Go binary, or a Python script for v0). Polls GraphQL, writes CH, exposes nothing.
DORA-metric-style
dashboards
become
straightforward
queries
against pr_current
/
pr_events:
deployment
frequency,
lead
time
for changes,
change-failure
rate
(joined
with
incident.io
data), mean-time-to-restore.
This is DORA-metrics / engineering-productivity territory and is well-trodden:
dora-team/fourkeys
—
Google's
reference
impl
for
the
four
DORA metrics,
ingests
via
webhooks.
OSS-Compass,
oss-review-toolkit
—
GraphQL-based
collectors.
We may want to buy the PR-activity dashboard from one of these and only self-host the CI-timing slice (which they generally don't cover as well as a bespoke OTel pipeline does).
githubreceiver
+ Funnel
+
nginx
IP
allowlist
+
HMAC.
Lands
data
in
otel_traces, one
new
dashboard.
Estimated
effort:
a
day
or
two
end-to-end.
metric_threshold
rules
in
alerter
against both
new
data
sets
so
regressions
page
rather
than
waiting
to
be noticed
in
a
dashboard.
otel/opentelemetry-collector-contrib
should
we pin?
The
github
receiver
was
in
alpha
at
the
time
of
writing; verify
the
config
schema
(endpoint
/
path
/
secret
field names
have
shifted
across
releases)
against
the
version
we
choose.
demo
and prod
repos'
CI
data,
or
one
per
environment?
Probably
one, since
CI
is
per-repo
not
per-env.
otel_collector
user
has INSERT
on
the
otel_*
tables;
confirm
before
pointing
the
new collector
at
it.