DREW
Dependency Remediation & Engineering Worker — an autonomous remediation agent that polls the repo's Dependabot alerts, picks the highest-severity open one, bumps the vulnerable dependency (fixing any breaking changes the bump introduces), opens a PR, and shepherds it through CI to merge — throttled to one open PR at a time. The trunk is live as a supervised prototype (7 merged bumps); three tracks branch off it: CI/CD pipeline watch, the Adjudicator review bot, and the deferred incident-triage lane.
§0Principles
DREW is a coding agent — use the harness, don't rebuild it
BARD is a read-only analyst and hand-rolls a tiny tool loop on the raw Anthropic SDK (py/bard/bard/agent.py). That's right for SQL-only work. DREW must clone a repo, edit manifests, run cargo/just, fix the call-site breakage a version bump introduces, and drive git/gh through CI — which is the Claude Code loop. A plain version-bump bot can't repair breaking changes; the harness can. Reimplementing it on the raw SDK would be rebuilding Claude Code, badly. DREW's brain is the harness; only the orchestration around it is custom. → §2a, §2b
Automating PRs without automating review just moves the bottleneck
A bot that opens PRs but stops there hasn't removed human toil — it has relocated it to the scarcest resource, the reviewer: every parked drew/deps/* PR waits on exactly the person DREW was built to relieve. So the deliverable is the full path to merge. On the trunk, safety comes from scope, not a per-PR human — a mechanical, revertable change class (manifests/lockfiles + minimal call-site fixes), clamped capabilities (repo-scoped App token, network-layer egress allowlist), and merge gated on full CI green plus a deterministic out-of-agent verifier. Where a required human review remains, the Adjudicator track exists to satisfy it under conservative, defer-by-default thresholds. The worst case stays one mechanical, CI-passing bump that a single git revert undoes. → §2g, §2e, §2d
Throttled, ephemeral, auditable
DREW holds one open PR at a time and works the backlog highest-severity-first, so at most one unattended merge is ever in flight. Each remediation runs in a fresh, disposable, hard-killable container with a clean clone — nothing persists — and every tool call + full transcript lands in #drew-audit + S3. Trust is earned in phases: dry-run → supervised live runs → lights-out merge on the safest dependency class, widening from there. → §2c, §1
§1Progress / Tracker
Live snapshot — phase status and the bar below are computed from GitHub PR state at build time (2026-06-10). 1/12 phases done · 8%.
Trunk — Dependabot remediation (v1)
The proving ground. Staged so that merge authority is the last thing granted — the supervisor and the cage came first with no write token, the bump engine ran dry, live runs are operator-triggered and human-approved, and only phase 5 grants lights-out merge.
1Supervisor — poll, rank, throttle, guardsin progress›
A supervisor that polls the Dependabot alerts API
(GET /repos/architect-xyz/ax/dependabot/alerts?state=open), ranks open
alerts highest-severity-first (critical → high → medium → low,
tie-broken by CVSS then age, grouped by resolution), and enforces the core
throttle: if the open DREW-PR count is at its cap (DREW_MAX_OPEN_PRS,
default 1), do nothing this cycle. It dedups (skip alerts that already
have a drew/deps/* branch), spawns one worker for the top group, and
owns the audit trail. It does not think — it routes.
Built (prototype, #2300): the local-first CLI form — just alerts /
just next / just remediate / just resume — authenticating as the
ax-drew GitHub App (JWT → ~1h installation token, never the
operator's gh auth), with artifacts under .drew-audit/. Remaining
for prod: the long-lived daemon posture — scheduled polling, Redis
guards (per-alert lock, concurrency gate), secrets from AWS Secrets
Manager, and the #drew-audit Slack + S3 trail (BARD's guard/ skeleton).
#drew-audit + S3 trail- #2300 DREW prototype — supervisor CLI, ranking, resolution groups, open-PR throttle, App auth
2Worker cage + egress jail + harness wiringin progress›
The disposable unit: a container that takes a fresh shallow clone and runs
claude -p (headless) on a metered Anthropic API key
(§3.1). Its only route to the network is
iron-proxy — default-deny,
permitting api.anthropic.com, github.com/api.github.com, and the
package registries a bump needs (static.crates.io + index,
registry.npmjs.org) and nothing else. Tool surface constrained via
--allowedTools + --permission-mode; PreToolUse blocks dangerous
bash, PostToolUse audits every call; output captured as
--output-format stream-json.
Built (prototype, #2300): the worker container (non-root,
--cap-drop ALL, no-new-privileges, --init, cpu/mem caps, named +
torn down on Ctrl-C) and the mandatory egress jail (py/drew/egress/)
— doctor/remediate refuse to run unless the jail is up; ships in
warn mode. Remaining: flip warn → enforce after a clean run,
boundary credential-swap (GitHub token + x-api-key), and read-only
rootfs / tmpfs-noexec hardening (§2e).
- #2300 worker container, hardening flags, mandatory iron-proxy egress jail (warn mode)
3Remediation engine — bump + build + fix, dry-rundone›
The core skill, built to run with nothing pushed. Given one alert
group, the worker resolves the fix
(security_vulnerability.first_patched_version,
§3.11), decides direct-vs-transitive, edits the
manifest if direct, runs the precise lockfile bump, then
just rs/format + cargo check — and, when the bump breaks a call
site, edits source until it compiles (the coding-agent value a plain
bump bot can't deliver, §2d). Output is a diff + build result —
no branch, no PR. just remediate runs exactly this; dry-run is the
default everywhere.
- #2300 dry-run remediation engine — resolve, bump, build, fix; diff + build result, nothing pushed
4Supervised live runs — operator-triggered, human-gatedin progress›
First live writes, with a human in the loop twice over: every run is
operator-triggered (just remediate --live) and every merge passes
the repo's required review. The worker opens a ready PR,
immediately arms auto-merge (gh pr merge --auto), shepherds only
to drive CI green (≤ max_shepherd_attempts), then exits with a
parseable hand-off — DREW_STATUS: MERGED | PARKED | BAILED — and the
supervisor re-attaches to parked PRs via drew resume, picking the
shepherd loop back up in media res. This phase replaced the planned
draft-PR shadow: ready PRs + required review give the same calibration
with a real merge path.
The live track record below is the calibration data. Remaining before phase 5: the deterministic out-of-agent verifier — built and validated against these very diffs (it must pass clean bumps and flag scope violations), because in phase 5 that checker, not a human, is the last gate before merge. Also the tuning run for the context-window kill threshold, which starts at 50% (§3.6).
5Lights-out — scheduled supervisor, merge on greennot started›
Grant unattended operation: the supervisor runs on a schedule (no
human trigger), and merge completes without a per-PR human. Mechanically
auto-merge is already armed today — what changes is the gate: either
the required-review rule is relaxed for the drew/deps/* class
(checks-only branch protection), or the Adjudicator track supplies
the approving review under its conservative thresholds
(§3.12 — this is now the adjudicator's job to earn).
It lands narrow — lockfile-only / patch-level bumps first (5.1),
widening to minor, then major / breaking-change fixes (5.2) as the record
justifies. Exhaust the shepherd attempts or trip the verifier → bail:
leave the PR open, comment why, post to #drew-audit, and move on.
Hard prerequisites: the phase-1 daemon posture, the phase-2 enforce-mode
jail + credential-swap, and the phase-4 verifier.
CI/CD — pipeline watch & speed
Monitor and improve the CI/CD pipelines — make rust-test and rust-clippy faster while provably maintaining coverage. Ambient benefit to every developer on every PR. Needs only the cage, the App's existing Actions: read, and the audit trail — not merge authority.
branches from phase 2
C1Pipeline telemetry — measure before touchingnot started›
Read-only first: the supervisor pulls per-workflow / per-job / per-step
timings, queue waits, and cache hit rates from the Actions API
(Actions: read — the App already has it), persists the series, and
builds the picture: where do rust-test and rust-clippy actually spend
their time, which steps regressed and when, what's the p50/p95 wall-clock
per PR. Deliverable: a trend report in #drew-audit and a ranked hotspot
list (cold caches, redundant rebuilds, serial bottlenecks, oversized
runners idling). No opinion without data — this phase is the data.
C2Optimization PRs — human-merged, coverage-provennot started›
DREW proposes pipeline improvements as PRs — cache keying, test sharding
/ cargo nextest, clippy invocation scope, runner sizing, dependency
pre-builds — each carrying before/after timing evidence from C1 and a
coverage proof: a deterministic diff showing the executed-test set
and enabled-lint set did not shrink (§3.18). Every CI
change is human-merged — pipeline definitions gate the whole repo, and
the ax-drew App deliberately lacks Workflows: write
(§3.17 — open), so this lane cannot even push a workflow
edit without an explicit new grant. Speed claims are verified on main
after merge, not just on the PR branch.
C3Regression watchdog — ratchet the winsnot started›
Continuous monitoring so improvements don't silently erode: alert
#drew-audit when main's CI wall-clock regresses past a sustained
threshold, bisect to the offending commit, and file the issue with the
evidence attached. The ratchet — not one-off optimization — is what makes
this track ambient infrastructure rather than a single project.
Adjudicator — autonomous PR review
The answer to the moved bottleneck (principle 2). The adjudicator is a separate review bot: it reviews PRs and auto-approves only under very conservative thresholds — if there is any doubt, it must defer to a human. Approval authority is earned the same way merge authority was: shadow first. branches from phase 2
A1Shadow reviewer — advisory comments, no authoritynot started›
The adjudicator reviews every DREW PR — multi-pass: correctness of the breaking-change fixes, diff-scope discipline, supply-chain checks (new transitive deps, install scripts, typosquats) — and posts its findings as comments only. It cannot approve, request changes, or merge. Calibration is the whole point: every verdict is compared against the human reviewer's eventual decision, and the number that matters is the false-pass rate — how often the adjudicator would have approved something a human caught. No authority is requested until that rate is measured, not estimated (§2g).
A2Conservative auto-approve — DREW's bounded class onlynot started›
A separate GitHub App identity (ax-adjudicator,
§3.14) gains Pull requests: write and may submit an
approving review — but only when every threshold passes: the diff
is in the bounded mechanical class, the deterministic verifier passed,
full CI is green, the review found zero findings, and the shadow-phase
false-pass rate is below target. Any doubt → defer: the adjudicator
posts an explicit "deferred to human" comment with its analysis and does
nothing else. Approval is the only authority granted — merge still
flows through branch protection + DREW's armed auto-merge, so the human
review gate on phase 5 is satisfied without being deleted.
A3Widen — other mechanical PR classes, defer-by-defaultnot started›
Extend beyond DREW's own PRs to other mechanical, low-blast-radius classes — generated-code refreshes, docs-only changes, config bumps with schema-validated diffs — one class at a time, each with its own shadow calibration before any approval authority. The stance stays defer-by-default: the adjudicator exists to clear the obvious, not to replace review. Human-authored feature work is out of scope indefinitely.
Incident triage — diagnose & draft a fix
The original DREW mission, deferred behind the Dependabot lane because it is the harder, higher-blast-radius job. Inverted posture: draft-only, human-gated, never auto-merged — untrusted prod logs are an injection surface the dependency lane doesn't have. branches from phase 2
T1Incident triage — diagnose & draft a fix (future)not started›
A Slack Socket-Mode listener on #ax-incidents, read-only senses
(incident.io / ClickHouse logs / Sentry), an automated triage+diagnose
pipeline (the incident-demo/incident-prod methodology), and — crucially
— a posture inverted from the Dependabot lane: open-ended root cause over
untrusted prod logs, so it is draft-only, human-gated (✅
in-thread), and never auto-merged. It reuses the supervisor / worker /
egress-jail / audit stack proven on the trunk; what it adds is the
read-only MCP senses, the diagnosis pipeline, and a shadow-triage
calibration run. Mind the md_pub ↔ marketdata_publisher name
translation, and the stack-wide-vs-local gate that keeps DREW from
"fixing" code for an infra hiccup. Egress re-adds incident.io + Sentry +
the ClickHouse log MCPs for this lane only.
Notebook
Reference design — the detailed mechanics behind the tracker.
§2aWhy DREW is not BARD-shaped
This framing drives every downstream decision, so it comes first. BARD and DREW look like siblings — both are LLM bots wired into the AX stack — but they sit on opposite sides of one line: read-only analysis vs. autonomous code change.
| BARD analyst | DREW engineer | |
|---|---|---|
| Job | Answer BI questions over Postgres/ClickHouse | Remediate a vulnerable dependency end-to-end |
| Tool surface | 5 narrow, fully-constrained read tools | Filesystem, grep, build, git, gh — open-ended |
| LLM plumbing | Hand-rolled loop on raw Anthropic SDK (agent.py) | The Claude Code harness (it already is this loop) |
| Side effects | None — sql_safety rejects non-SELECT | Edits manifests, opens a PR, merges it on green CI |
For BARD's job, a bespoke loop on the raw SDK is exactly right: the tool surface is tiny and every tool is independently guarded. For DREW's job, the tool surface is a coding agent — edit a manifest, run the build, fix the breakage a bump introduces, drive a PR through CI. That loop, with its context management, permissioning, and sandbox, is precisely what Claude Code already implements. The bump itself a script could do; repairing the call sites a major-version bump breaks is the part that needs the agent.
alee/carl-rfc → PR #1760; alee/gopher-design-doc → PR #1807, both closed unmerged, Apr–May 2026) designed the same supervisor + sandboxed-headless-Claude-Code shape for exactly this job — Dependabot remediation — and both deferred incident triage to "a separate RFC". DREW v1 is that Dependabot bot, realized; incident triage is the deferred RFC, now track triage. They independently reached DREW's harness conclusion (headless CLI, not an SDK loop), worked out the sandbox/egress/budget machinery in detail, and auto-merged safe bumps — the posture DREW adopts. Mined into §2e, §2f, and catalogued in §3.10. Like them it bills a metered API key (the Max-subscription path DREW first chose lost its flat-rate edge once Anthropic began metering headless subscription use — §3.1). Where DREW diverges: it rides the harness (not a raw loop), runs on a Mac mini (they assumed EC2), and adds the coding-agent ability to fix the breaking changes a bump introduces rather than only landing already-compatible versions.
§2bThe harness decision
Given §2a, the real question is not "raw SDK vs harness" (the harness wins for a coding agent) but which form of the harness, and how to wrap it. Three candidates:
| Approach | Coding loop | Isolation | Billing / ToS |
|---|---|---|---|
| Raw Anthropic SDK (BARD's way) | You build it all by hand | In-process | API key only |
Claude Code CLI (claude -p) | Built-in | Subprocess — fresh, hard-killable, resource-capped | API key or Claude subscription is ToS-clean |
| Claude Agent SDK | Built-in (same engine) | In-process (a hang risks the supervisor) | API key only — subscription/OAuth tokens are a ToS violation |
Two findings dominate, and they both point the same way for DREW:
- Billing no longer forces the hand — isolation does. The original plan rode a Claude Max subscription, which is ToS-clean only with the CLI (not the Agent SDK), making auth the deciding fork. But as of Anthropic's 2026-06-15 change, subscription headless use (
claude -p/ Agent SDK) draws from a monthly Agent-SDK credit metered at the same per-MTok rates as the API — the flat-rate edge is gone (§3.1). DREW therefore runs on a metered API key, which is ToS-clean with both the CLI and the Agent SDK. So the choice falls to finding #2 — subprocess isolation — plus the API key being a swappable header credential the egress proxy can rewrite (§2e). decided: API key → CLI on isolation grounds - For a security-sensitive autonomous agent, subprocess isolation is a feature. The Agent SDK's edge is in-process control (dynamic permission callbacks, typed messages). But what DREW most needs is to spawn each remediation as a throwaway, network-jailed, resource-capped, hard-killable unit and tear it down — so the supervisor stays clean and can
docker killa runaway. That argues for the CLI-as-subprocess inside a fresh container. The "dynamic permission" capability is recovered withPreToolUsehooks (which run in headless mode) plus a scoped GitHub token, so the in-process callback isn't needed.
claude -p --output-format stream-json), one fresh sandboxed container per remediation, on a metered Anthropic API key. The raw SDK is rejected (rebuilds the loop); the Agent SDK is now ToS-viable on the API key but still held in reserve — it trades away the subprocess isolation the security model leans on, so it's only worth it if DREW ever needs tight in-process orchestration.
§2cArchitecture: supervisor / worker
The same split BARD and NATE already use — a cheap, long-lived service that does the constrained routing, and an expensive, disposable brain that does the open-ended work. The supervisor is dumb and never dies; the worker is smart and always dies.
Dependabot alerts API ──(poll, every N min)──┐
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ DREW SUPERVISOR (long-lived Python service, like `bard slackbot`) │
│ • poll open alerts; rank by severity → CVSS → age │
│ • THROTTLE: open DREW deps PRs at cap? → wait this cycle │
│ • Redis: per-alert lock, concurrency gate = 1, wall-clock timers │
│ • dedup: alert already has a drew/deps/* branch → skip │
│ • spawns ONE worker container for the top alert group │
│ • relays PR link + merged/parked/bailed status to #drew-audit │
└───────────────────────────────┬──────────────────────────────────────────┘
│ docker run --rm (fresh, jailed)
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ DREW WORKER (ephemeral container = `claude -p` harness) │
│ fresh shallow clone of adx @ main │
│ bump → just rs/format → cargo check → fix call-site breakage │
│ deterministic verifier → gh pr create → arm auto-merge → /shepherd │
│ exits with DREW_STATUS: MERGED | PARKED | BAILED │
│ tools: Read/Grep/Glob, Bash(git/cargo/just/gh), Edit │
│ egress: allowlist proxy (Anthropic · GitHub · crates.io/npm) │
│ GitHub App token: contents+PR write, merge via branch protection │
└──────────────────────────────────────────────────────────────────────────┘
The supervisor holds all durable state and all secrets; the worker
receives a narrowly-scoped slice for the duration of one remediation and
is destroyed (--rm) afterward. The worker also decides its own stop
— /shepherd has no natural end, so the worker exits with a parseable
DREW_STATUS line (MERGED / PARKED / BAILED) and that exit is the event
the supervisor reacts to; on PARKED it schedules drew resume, which
re-attaches to the parked PR's branch and picks the loop back up
in media res. The throttle, concurrency gate, timers, and dedup live in
Redis in prod, following BARD's guard/ module.
§2dThe alert → merged-PR pipeline
One alert group, one worker, one PR — the supervisor has already picked the highest-severity open alert and confirmed the open-PR throttle has room. The worker's job is to turn that alert into a merged, CI-green bump. The agent's leverage is steps 4 and 6 — fixing the breakage a bump introduces, and driving CI to green; a script handles the rest.
| # | Step | What happens |
|---|---|---|
| 1 | Resolve | Read the alert: ecosystem / package.name / manifest_path / severity and security_vulnerability.first_patched_version. Decide direct vs transitive dependency; the target version is the advisory's first patched version (§3.11), not the newest release. |
| 2 | Branch | drew/deps/<crate>-<ver> off main in the fresh clone. |
| 3 | Bump | Direct → edit Cargo.toml then cargo update -p <crate> --precise <ver>; transitive → cargo update -p <crate> --precise <ver> (lockfile only). just rs/format. |
| 4 | Build & fix | cargo check (warm cache, §2f). If the bump broke call sites, edit source until it compiles — the coding-agent step. Bumps needing real redesign hit the ceiling → bail & escalate. |
| 5 | Verify | (deterministic, out-of-agent) diff touches only the manifest/lockfile + plausibly-related source; no unexpected new dependency or secret; cargo check passed. Fail → bail & escalate. |
| 6 | PR & shepherd | gh pr create (ready, repo-convention body), arm auto-merge, then /shepherd: watch CI, fix failures (≤ N attempts), push, repeat — never sit polling pending checks or a reviewer. Exit MERGED / PARKED / BAILED. |
| 7 | Report | One line to #drew-audit: alert, CVE/GHSA, bump, PR link, and merged / parked / escalated — needs a human. |
cargo check won't come clean within budget, DREW leaves the PR open with a comment explaining how far it got and escalates, rather than forcing a bad merge.
§2eSecurity model & blast radius
An autonomous agent with shell + gh write access to a trading-system repo that merges its own PRs is a serious blast-radius question. The injection surface is far smaller than the incident lane — DREW reads dependency metadata and advisories, not untrusted prod logs — but the merge authority is larger. The defenses are layered so that no single failure is catastrophic.
| Threat | Control |
|---|---|
| Exfiltration / C2 over the network | Network-layer egress allowlist via iron-proxy (lifted from the CARL/GOPHER RFCs) — the worker's only route out is a default-deny MITM proxy permitting Anthropic, GitHub, and the package registries (static.crates.io + index, registry.npmjs.org) and nothing else; it TLS-terminates, so rules can be path-scoped (e.g. only /repos/architect-xyz/ax/* on api.github.com), and it filters MCP tools/list so denied tools never reach the agent. Built and mandatory in the prototype (py/drew/egress/) — ships in warn mode to learn the long tail, then enforce. The harness sandbox's hostname filter is defense-in-depth, not the wall. |
Unreviewed code landing on main | DREW does merge — so safety is scope + gate, not a human on each PR. Bounded change class (dependency manifest/lockfile + minimal call-site fixes), full CI as the gate (not best-effort cargo check), the deterministic out-of-agent verifier before merge (§2d), and the open-PR throttle so at most one unattended merge is ever in flight — trivially git revert-able. Merge is gated by branch protection (required status checks), not granted by the token. Phased rollout (supervised live runs → patch bumps → wider) caps early exposure. |
| Malicious / hijacked upstream package (supply chain) | Pin to the advisory's first_patched_version, never "latest" — DREW won't pull an unrelated newer release. The bumped code only ever executes inside CI's own sandbox, never on the worker host; the worker's egress jail means a hostile crate's build script can't phone home from DREW's box either. New transitive dependencies introduced by a bump are flagged by the verifier — and are a first-class check in the Adjudicator's review pass (§2g). |
| Prompt injection via advisory / package metadata | Smaller surface than the incident lane, but advisory text and changelogs are still untrusted input to a shell-capable agent. Contained by: the bounded token (no workflows/admin), the egress jail, the ephemeral clone (can't reach other repos or secrets), a system-prompt rule to treat all fetched data as evidence, never instructions, and iron-proxy credential-swap — the worker holds only opaque proxy tokens that iron-proxy rewrites to the real GitHub token and Anthropic API key at egress (both are swappable header credentials, with require: true), so an exfiltrated env yields tokens useless outside the sandbox. |
| Runaway loop / slow CI | Wall-clock timeout via docker kill plus a cap on shepherd fix-attempts (§3.6) — though the normal stop is the worker's own DREW_STATUS exit, with the timeout as backstop for a hang. The open-PR throttle already means there is never more than a single worker. Billing is a metered API key, so USD is a real per-remediation cost — DREW meters its own estimate + realized spend (§3.6) — but wall-clock and the attempt cap, not dollars, remain the runaway controls. |
| Container escape / host pivot | Hardened worker container (CARL §4 recipe): --rm, non-root, drop all caps + --security-opt no-new-privileges, --init, cpu/mem limits, named container torn down on Ctrl-C, dedicated sandbox bridge only (no metadata service, no Tailscale, no Redis). Still to land: read-only root fs, tmpfs /tmp·/home·/workspace with noexec, no docker.sock audit. Creds injected via --env-file that the supervisor unlinks after start — never baked into the image. |
| Agent going off the rails (degraded loop) | Context-window monitor (GOPHER §9.5, reframed): parse stream-json usage, track peak context-%, and hard-kill at 50% of the model's window to start (the conservative edge of RULER's ~50–65% usable band — §3.6). The supervised-live phase (phase 4) also logs context-% vs. outcome so the 50% starting point can be tuned from DREW's own data rather than GOPHER's assumed 33%. |
| Lingering state / lack of audit trail | Fresh --rm container + fresh clone per remediation; nothing persists. Every tool call (PostToolUse hook) plus the gzipped stream-json transcript lands under .drew-audit/ today, S3 + a one-line #drew-audit notice in prod — alert, bump, PR link, merged/parked/escalated — so every merge is visible. (No /drew pause kill switch — the timers and open-PR throttle are sufficient containment.) |
~/.claude), which isn't a swappable header key — so credential-swap could only cover the GitHub token, and the Anthropic side fell back to jailing egress to api.anthropic.com plus guarding ~/.claude. Moving to a metered ANTHROPIC_API_KEY (§3.1) closes that gap: like CARL/GOPHER, iron-proxy now swaps the x-api-key header too, so the worker holds opaque proxy tokens for both credentials and an exfiltrated env is useless outside the sandbox.
§2fDeployment on the Mac mini
The deployment borrows ax-bard's software story (Tailscale + AWS Secrets Manager + Docker) but on a single Mac mini (M4, 10-core, 16 GB, 512 GB) rather than EC2 — see the host trade-off in §3.9.
- Runtime: Docker Desktop (or colima). The supervisor is an always-on container; workers are
docker run --rmsiblings on an isolated bridge network whose default route is the egress proxy. - Auth: a metered Anthropic API key injected into the worker (
claude -puses it automatically) — from Secrets Manager in prod, a local.envin dev. A swappable header credential, so iron-proxy can credential-swap it at egress like the GitHub token (§3.1, §2e). - Secrets: loaded from AWS Secrets Manager at supervisor startup (BARD-style) — the GitHub App credentials from which a per-remediation installation token is minted — and handed to a worker only for the life of one remediation.
- Admin access: Tailscale for inbound/management, mirroring
ax-bard. - Lifecycle: the supervisor runs under
launchd(Mac-native) or as a restart-always container. - Build cache: a warm
CARGO_HOMEregistry + sharedtarget/viasccache, as named volumes mounted read-mostly into the ephemeral worker (GOPHER §7). Doubly central here — every remediation runscargo update+ a rebuild, so a cold cache would dominate the budget. Key it on(rust-toolchain hash, Cargo.lock hash)and accept a rebuild when either moves — note a dependency bump movesCargo.lockby design, so DREW pays a partial-rebuild on each fix; size the timer for that. Directly relieves the 16 GB pressure below.
cargo check, and shell out to git/gh. The M4's 10 cores handle that comfortably, and its Docker-Linux-VM gives a clean place to put the egress jail.cargo check on the ax workspace inside the Docker Linux VM (give the VM ~12 GB, leaving ~4 GB for macOS). One worker fits; two concurrent would thrash. Two mitigations make 16 GB sufficient for v1: (1) the open-PR throttle already pins concurrency at one, so a second worker never starts; (2) CI is the real compile authority — the PR runs full CI on push and is the merge gate, so DREW's local cargo check is a best-effort pre-flight that fails fast before opening a doomed PR. Step up to the 24 GB model only if you later want parallel workers or fuller local builds.§2gThe Adjudicator direction — moving the bottleneck, not hiding it
DREW solved "who does the mechanical work" and immediately exposed the next constraint: who reviews it. Every parked DREW PR waits on the same scarce resource the bot was built to relieve. Two bad answers bracket the design space: delete the review gate (unacceptable — branch protection's required review is doing real work) and keep a human on every bump forever (the bottleneck DREW exists to remove).
The adjudicator is the third answer: a second, independent agent whose only job is review. Three design commitments, in priority order:
- Defer-by-default. The adjudicator's contract is asymmetric: a
wrong defer costs one human review (the status quo); a wrong
approve lands unreviewed code on
main. So the thresholds are deliberately one-sided — it approves only when every check passes (bounded diff class, deterministic verifier, full CI green, zero review findings, calibrated false-pass rate below target), and any doubt — a finding it can't dismiss, a diff outside the class, a confidence wobble — produces an explicit "deferred to human" comment. It must never be cheaper for the adjudicator to approve than to defer. - Separation of duties. The adjudicator is a different GitHub App
identity (
ax-adjudicator) from the author bot — GitHub already refuses self-approval, and the security property is worth stating: the agent that wrote the diff and the agent that judges it share no token, no container, no transcript (§3.14). They can disagree; that disagreement is signal, posted to#drew-audit. - Calibration before authority. Like the trunk's dry-run → supervised → lights-out ladder, the adjudicator ships as a shadow reviewer (A1) whose verdicts are scored against human outcomes. The promotion criterion is a measured false-pass rate on a meaningful sample — not a vibe that the comments look smart.
Scope grows the same way trust did on the trunk: DREW's own drew/deps/*
PRs first (the most mechanical, best-understood class — and the one whose
author's bounded scope the adjudicator can verify deterministically),
then other mechanical classes (A3). Human feature work stays out of
scope indefinitely — the adjudicator clears the obvious so humans can
spend review where judgment is actually needed.
§2hThe CI/CD direction — ambient leverage
The second growth direction points DREW at the pipelines themselves:
rust-test and rust-clippy sit on every PR's critical path, so minutes
shaved there compound across every developer, every day — including
DREW's own shepherd loop, which waits on the same CI to go green. The
lane is deliberately shaped like the trunk's trust ladder:
- Measure first (C1). Read-only telemetry from the Actions API the App can already reach. No optimization is proposed without a baseline and a hotspot ranking — the failure mode of "CI tuning" is cargo-culted cache tweaks that help nothing and break subtly.
- Coverage is the invariant (C2). Every optimization PR must carry a machine-checkable proof that the executed-test set and enabled-lint set did not shrink (§3.18). "Faster because it does less" is a regression wearing a speedup's clothes — the proof is what keeps the lane honest, and it's deterministic, so it can live in CI itself.
- Humans merge pipeline changes — structurally. Workflow definitions
gate everything else in the repo, including DREW's own merge path, so
this lane never gets auto-merge. Better: the
ax-drewApp deliberately lacksWorkflows: write, so the cage cannot push a workflow edit even if the agent tries — how C2's PRs get authored at all is an open permissions question (§3.17). - Then ratchet (C3). One-off wins erode; the durable value is the
watchdog that notices
main's wall-clock regressing, bisects to the commit, and files the issue with evidence. That's the "monitor" half of the mission, and it runs forever.
§3Design Questions
-
Auth / billing — subscription or metered API key?
- Answered — affirmative Metered API key. The earlier choice — a Max subscription, for flat-rate cost predictability — was overtaken by Anthropic's 2026-06-15 change: subscription headless use (
claude -p/ Agent SDK) now draws from a monthly Agent-SDK credit metered at the same per-MTok rates as the API, so the flat-rate edge is gone. With that gone, the API key wins on three counts: it's simpler to operate (noclaude setup-token, no 1-year OAuth expiry, no macOS-Keychain→Linux bridge); it's a swappable header credential, so iron-proxy can credential-swap it at egress exactly like the GitHub token (§2e), closing a gap the OAuth path left open; and it's ToS-clean with both the CLI and the Agent SDK, so billing no longer dictates the harness form — the CLI is chosen on subprocess-isolation grounds (§2b). Metered means real per-remediation cost, so DREW reports its own estimate + realized spend (§3.6), BARD-style.
- Answered — affirmative Metered API key. The earlier choice — a Max subscription, for flat-rate cost predictability — was overtaken by Anthropic's 2026-06-15 change: subscription headless use (
-
Autonomy — how far does DREW go unsupervised?
- Answered — affirmative Shepherd → auto-merge, earned in stages. DREW opens a ready PR, arms auto-merge, and shepherds CI to green — no human drives the mechanics. Today (phase 4) two human gates remain: the operator triggers each run, and branch protection's required review approves each merge. Phase 5 removes the trigger (scheduled supervisor); the required-review gate is resolved either by checks-only protection on the
drew/deps/*class or by the Adjudicator track supplying the approving review under its conservative thresholds (§3.12). Defensible because the change class is bounded and mechanical, full CI is the gate, the deterministic verifier backstops it, and the open-PR throttle caps a bad merge at a singlegit revert(§2e). The incident-triage track keeps the opposite posture — draft-only, human-gated ✅, never auto-merged — because its blast radius and injection surface are far larger.
- Answered — affirmative Shepherd → auto-merge, earned in stages. DREW opens a ready PR, arms auto-merge, and shepherds CI to green — no human drives the mechanics. Today (phase 4) two human gates remain: the operator triggers each run, and branch protection's required review approves each merge. Phase 5 removes the trigger (scheduled supervisor); the required-review gate is resolved either by checks-only protection on the
-
Egress enforcement — harness sandbox, or network-layer proxy?
- Answered — affirmative iron-proxy at the container layer is the real boundary (the harness sandbox is defense-in-depth only — it's hostname-based and doesn't terminate TLS). Chosen over hand-rolled squid/mitmproxy because it ships default-deny allowlisting, TLS-terminating path-scoped rules, MCP
tools/listfiltering, and boundary credential-swap out of the box — and both CARL and GOPHER already converged on it. Built and mandatory in the prototype (py/drew/egress/—doctor/remediaterefuse to run without it), currently in warn mode. Allowlist for the Dependabot lane: Anthropic, GitHub, and the package registries (static.crates.io+ index,registry.npmjs.org) — no incident.io/Sentry/ClickHouse (those return for the triage track). Pin a tagged release; flip warn → enforce after a clean run (§2e).
- Answered — affirmative iron-proxy at the container layer is the real boundary (the harness sandbox is defense-in-depth only — it's hostname-based and doesn't terminate TLS). Chosen over hand-rolled squid/mitmproxy because it ships default-deny allowlisting, TLS-terminating path-scoped rules, MCP
-
GitHub credential — fine-grained PAT or a GitHub App?
- Answered — affirmative GitHub App
ax-drew(CARL §3 gives the template) — built; DREW opens PRs asapp/afintech-drew, never as the operator. Org-installed but repo-filtered toarchitect-xyz/ax. Minimal repo permissions:Contents: write,Pull requests: write,Checks: read,Commit statuses: read+Actions: read(the shepherd must read the status rollup and failing workflow-job logs — without thesegh pr checks/gh run view --log-failedreturn 403 "Resource not accessible by integration"; both are read, distinct from the withheldActions/Workflows: write),Dependabot alerts: read(the trigger feed),Issues: write(escalation comments),Metadata: read— and explicitly notAdministration,Workflows,Actions: write,Members,Secrets. Merge needs no extra permission; it is gated by branch protection, not the token (§3.12). WithholdingWorkflowsmeans GitHub-Actions-ecosystem bumps (which edit.github/) are out of scope for v1 — cargo/npm only — and it constrains the CI/CD track (§3.17). The supervisor mints a ~1h installation token per remediation (never long-lived in the worker); enable App GPG-signed "Verified" commits in prod.
- Answered — affirmative GitHub App
-
Ingest mechanism — poll the alerts API, or a webhook?
- Answered — affirmative Poll the Dependabot alerts API (
GET /repos/architect-xyz/ax/dependabot/alerts?state=open) on an interval and rank the result. Polling needs no inbound endpoint (outbound only — friendly to the egress jail), is naturally idempotent against the open-PR throttle, and a few-minutes lag is irrelevant for vulnerability remediation. Thedependabot_alertwebhook is a latency optimization for later, not v1.#drew-auditis output-only — no Slack trigger in this lane (the Socket-Mode listener returns for the triage track).
- Answered — affirmative Poll the Dependabot alerts API (
-
Per-remediation budget — what are the caps?
- Answered — affirmative Meter + report; wall-clock + attempt cap. On a metered API key, dollars are a real per-remediation cost, so DREW meters itself like BARD: the supervisor logs an up-front estimate and the realized per-MTok spend (parsed from the worker's stream-json
usage) for every remediation, persists both to the audit trail, and calibrates the estimate from realized data. That's reporting, not a hard cap. The actual runaway guards stay time-based: a per-remediation wall-clock timeout enforced bydocker kill(now a backstop for a hung worker — the normal stop is the worker's ownDREW_STATUSexit) and a cap on shepherd fix-attempts (≤ N CI-failure → fix → push cycles). Exhausting either bails to "escalated — needs a human" with the PR left open. The open-PR throttle already pins concurrency at one. A third lever — context-window degradation — is real but uncalibrated: research shows quality drops well before the window fills (RULER finds only ~50–65% of advertised context is reliably usable for multi-hop work; Chroma Context Rot and "Context Length Alone Hurts…" show 14–85% degradation as input grows even with perfect retrieval), but there is no published agentic-coding threshold and GOPHER's "33%" was an internal guess. So DREW starts with a hard-kill at 50% of the window — generous for a bump task, so it fires only on genuine runaways — and logs peak context-% vs. outcome in the supervised phase (4) to tune that number later.
- Answered — affirmative Meter + report; wall-clock + attempt cap. On a metered API key, dollars are a real per-remediation cost, so DREW meters itself like BARD: the supervisor logs an up-front estimate and the realized per-MTok spend (parsed from the worker's stream-json
-
Repo working copy — fresh clone per remediation, or a warm cache?
- Answered — affirmative Fresh shallow clone per remediation, destroyed with the container. Ephemerality is a security property (no cross-run state, no lingering secrets). If clone latency becomes a problem, a read-only reference cache (
git clone --reference) is an optimization that preserves the property — but default to fresh. Note this is the source clone; the warmCARGO_HOME/targetbuild cache (§2f) is a separate, read-mostly volume.
- Answered — affirmative Fresh shallow clone per remediation, destroyed with the container. Ephemerality is a security property (no cross-run state, no lingering secrets). If clone latency becomes a problem, a read-only reference cache (
-
Scope — which bumps does DREW take, and in what order?
- Answered — affirmative Widen by risk class. Unattended merge authority lands on the safest class first: lockfile-only / patch-level bumps with no manifest range change and no source edits (phase 5.1), then widens to minor, then major / breaking-change fixes (5.2) as the track record earns it. Ecosystem scope is cargo + npm; GitHub-Actions/workflow bumps are excluded (they need
Workflowswrite, §3.4). The supervised phase (4) attempts every class from the start since a human approves each merge — the live record so far spans npm and cargo bumps.
- Answered — affirmative Widen by risk class. Unattended merge authority lands on the safest class first: lockfile-only / patch-level bumps with no manifest range change and no source edits (phase 5.1), then widens to minor, then major / breaking-change fixes (5.2) as the track record earns it. Ecosystem scope is cargo + npm; GitHub-Actions/workflow bumps are excluded (they need
-
Host — Mac mini or EC2 (like BARD)?
- Answered — affirmative A single Mac mini (M4, 16 GB). EC2 was evaluated and would mirror
ax-bardexactly (Tailscale + Secrets Manager + VPC egress jail + AMI workflow), but the dollar spread is small and the Mac mini wins on flat capex (~$799 one-time vs ~$256/mo for an always-onm7g.2xlarge), M4 build speed, and full ownership. EC2 Mac (mac2.metal) is ruled out — no macOS need and ~$474/mo with a 24h host minimum. Accepted trade-offs: ops/uptime are on us (a physical SPOF), and the egress jail is built Linux-side in the Docker VM rather than VPC-native. The 16 GBcargo checkconcern is handled by the one-at-a-time throttle + CI-as-compile-authority (§2f).
- Answered — affirmative A single Mac mini (M4, 16 GB). EC2 was evaluated and would mirror
-
Prior art — what's worth lifting from CARL & GOPHER?
- Answered — affirmative Mined. The two closed Dependabot-bot RFCs (§2a) targeted exactly DREW's v1 job, so the overlap is near-total. Adopted: Dependabot-alert polling + severity ranking; auto-merge of green bumps; iron-proxy egress + credential-swap + MCP tool-filtering (§2e); the container-hardening recipe (read-only fs, tmpfs
noexec, nodocker.sock, drop-caps, env-file-then-unlink); the GitHub App minimal-permission set + signed commits (§3.4); the context-window monitor — reframed from GOPHER's 33% to a 50% hard-kill, tuned from live data (§3.6); warm cargo/sccache keyed on toolchain+lock hash (§2f); S3 transcripts +#drew-audit; and metered-API-key billing with per-run self-metering (theirBudgetTrackershape), plus the sharedguard/ConcurrencyGate. Promoted (was deferred): the deterministic out-of-agent verifier (§2d) — with unattended merge there is no human reviewing each PR, so the dumb checker becomes load-bearing rather than a nicety. Dropped: CARL/GOPHER's/drew pause+ auto-pause — the timers and open-PR throttle already contain a runaway. Diverged: the harness (not a raw SDK loop), the Mac mini host (not EC2), and the agent's ability to fix breaking changes (they bumped only compatible versions). Worth re-reading: GOPHER §11 "things to flag" — the cargo-cache-invalidation and transitive-advisory gotchas apply directly.
- Answered — affirmative Mined. The two closed Dependabot-bot RFCs (§2a) targeted exactly DREW's v1 job, so the overlap is near-total. Adopted: Dependabot-alert polling + severity ranking; auto-merge of green bumps; iron-proxy egress + credential-swap + MCP tool-filtering (§2e); the container-hardening recipe (read-only fs, tmpfs
-
Target version — the advisory's first patched version, or latest?
- Answered — affirmative Pin to
security_vulnerability.first_patched_version, the minimal bump that clears the advisory — not the newest release. Jumping to "latest" maximizes the surface of unrelated breaking changes (more call-site churn, more CI risk) and can pull in a newer release that itself carries a fresh, not-yet-flagged vulnerability. The minimal bump is the smallest reviewable diff and the most likely to pass CI untouched. If the minimal patched version is yanked or itself unbuildable, step up to the next viable release and note it in the PR body.
- Answered — affirmative Pin to
-
Merge gate — how does auto-merge reconcile with branch protection?
- Not yet answered Merge is whatever branch protection on
mainalready allows — DREW gets no bypass. The worker armsgh pr merge --autoat PR-open, so the merge completes the moment every required gate passes; where a required human review is configured (today's posture), the PR parks until a human approves — which degrades cleanly to supervised operation rather than failing. The open decision has sharpened into the Adjudicator track: the original options were (a) keep a required reviewer on DREW's PRs (soft human gate, slower) or (b) checks-only protection for thedrew/deps/*class (true autonomy, no review at all). The adjudicator offers (c): keep the required-review rule, let a calibrated bot satisfy it under conservative thresholds, defer to a human on any doubt (§3.16). Start with (a) — the status quo — and let the adjudicator's shadow record (A1) decide between (b) and (c).
- Not yet answered Merge is whatever branch protection on
-
Native Dependabot PRs — own the lane, or adopt them?
- Not yet answered GitHub's own Dependabot security-update PRs would duplicate and race DREW. Two clean options: (a) disable native security-update PRs so DREW is the sole remediator and opens its own
drew/deps/*branch (simplest, and DREW's value-add — fixing breaking changes + shepherding to merge — is fully expressed); or (b) DREW adopts the existing Dependabot branch, taking over from where the native bot stops. Leaning (a) for a clean ownership boundary; revisit if the team wants to keep native PRs visible. Either way DREW reads the same alerts API, and the open-PR throttle keys on DREW-authored PRs only.
- Not yet answered GitHub's own Dependabot security-update PRs would duplicate and race DREW. Two clean options: (a) disable native security-update PRs so DREW is the sole remediator and opens its own
-
Adjudicator identity — can DREW review its own PRs?
- Answered — affirmative No — the adjudicator is a separate GitHub App (
ax-adjudicator). GitHub refuses self-approval (a PR's author cannot approve it), so theax-drewidentity cannot satisfy a required review on its own PRs even if we wanted it to — the reviewer must be a second identity. The constraint is welcome: it enforces separation of duties (author bot and reviewer bot share no token, no container, no transcript), and their disagreements become auditable signal in#drew-auditrather than an internal contradiction silently resolved (§2g). Permissions forax-adjudicator:Pull requests: write(submit reviews + comments),Contents: read,Checks/Statuses/Actions: read— no write to contents, no merge, no workflows.
- Answered — affirmative No — the adjudicator is a separate GitHub App (
-
Adjudicator thresholds — what does "very conservative" mean concretely?
- Not yet answered The promotion gate from A1 → A2 needs numbers, not adjectives. Candidate shape: auto-approve only when (1) the diff is in a deterministically recognizable mechanical class (for
drew/deps/*: manifest/lockfile + verifier-approved call-site fixes, no new transitive deps unaccounted for); (2) the deterministic verifier passed; (3) full CI green; (4) the adjudicator's own multi-pass review produced zero findings; and (5) the shadow-phase false-pass rate is below a target (e.g. zero approvals-a-human-would-have-blocked over the most recent N ≥ 50 shadow reviews). Open: the value of N, the target rate, whether "stale calibration" (model or pipeline changed) resets the clock, and whether severity-critical bumps always defer regardless of thresholds.
- Not yet answered The promotion gate from A1 → A2 needs numbers, not adjectives. Candidate shape: auto-approve only when (1) the diff is in a deterministically recognizable mechanical class (for
-
Adjudicator approval — does a bot review satisfy branch protection, and do we want it to?
- Not yet answered Mechanically, a GitHub App with
Pull requests: writecan submit an approving review, and plain "require 1 approval" branch protection counts it — but rulesets can be configured otherwise (require Code Owners, restrict who counts as a reviewer), and the org's settings need auditing before this is assumed. The policy question is separate: even if it counts, do we want DREW's merges gated on a bot approval (option c in §3.12) versus dropping the review requirement for the bounded class (option b)? (c) preserves an independent second judgment on every merge and keeps one rule for the whole repo; (b) is more honest about where the real gate is (CI + verifier). Decide after A1 produces a false-pass record worth arguing from.
- Not yet answered Mechanically, a GitHub App with
-
CI/CD lane permissions — who can edit
.github/workflows?- Not yet answered The trunk's security model deliberately withholds
Workflows: writefrom theax-drewApp (§3.4) — which means the CI/CD track's optimization PRs (C2) cannot even push a branch that edits.github/workflows/*. Options: (a) a separate, narrowly-granted App (ax-drew-ci) holdingWorkflows: write, used only by the C2 lane, with every PR human-merged (the write is to a branch; the gate is the merge); (b) DREW drafts the workflow diff as an issue/comment artifact and a human applies and opens the PR (zero new grants, more friction); (c) scope C2 to non-workflow speedups only (cache configs,nextestadoption injustfile, test code) and leave workflow edits to humans guided by C1's telemetry. Leaning (b) to start, (a) if the lane proves out — the trunk's posture that workflow write is radioactive shouldn't be quietly reversed by a side lane.
- Not yet answered The trunk's security model deliberately withholds
-
CI/CD coverage invariant — how do we prove "same coverage, faster"?
- Not yet answered "Faster" is easy to measure; "without losing coverage" is the hard, load-bearing half. Candidate: a deterministic coverage-set diff — enumerate the executed test set (e.g.
cargo nextest list/ per-job test manifests) and the enabled lint set (cargo clippy's effective lint table) onmainvs the optimization branch, and require the diff to be empty-or-additive as a CI check on every C2 PR. Open questions: flaky-test quarantine interaction (removing a flaky test is a coverage change and must be its own reviewed decision, never a side effect of a speedup), whether sharding changes execution order in ways that mask order-dependent tests, and whether feature-flag matrix reductions count as shrinkage (they do — the matrix is part of the set).
- Not yet answered "Faster" is easy to measure; "without losing coverage" is the hard, load-bearing half. Candidate: a deterministic coverage-set diff — enumerate the executed test set (e.g.