DREW

Dependency Remediation & Engineering Worker — an autonomous remediation agent that polls the repo's Dependabot alerts, picks the highest-severity open one, bumps the vulnerable dependency (fixing any breaking changes the bump introduces), opens a PR, and shepherds it through CI to merge — throttled to one open PR at a time. The trunk is live as a supervised prototype (7 merged bumps); three tracks branch off it: CI/CD pipeline watch, the Adjudicator review bot, and the deferred incident-triage lane.

Status: Prototype live (supervised) — PR #2300 v1 job: Dependabot remediation Brain: Claude Code harness (headless CLI) Runs on: Mac mini (M4, 16 GB) · Docker Track record: 8 PRs opened · 7 merged Status snapshot: 2026-06-10

§0Principles

DREW is a coding agent — use the harness, don't rebuild it

BARD is a read-only analyst and hand-rolls a tiny tool loop on the raw Anthropic SDK (py/bard/bard/agent.py). That's right for SQL-only work. DREW must clone a repo, edit manifests, run cargo/just, fix the call-site breakage a version bump introduces, and drive git/gh through CI — which is the Claude Code loop. A plain version-bump bot can't repair breaking changes; the harness can. Reimplementing it on the raw SDK would be rebuilding Claude Code, badly. DREW's brain is the harness; only the orchestration around it is custom. → §2a, §2b

Automating PRs without automating review just moves the bottleneck

A bot that opens PRs but stops there hasn't removed human toil — it has relocated it to the scarcest resource, the reviewer: every parked drew/deps/* PR waits on exactly the person DREW was built to relieve. So the deliverable is the full path to merge. On the trunk, safety comes from scope, not a per-PR human — a mechanical, revertable change class (manifests/lockfiles + minimal call-site fixes), clamped capabilities (repo-scoped App token, network-layer egress allowlist), and merge gated on full CI green plus a deterministic out-of-agent verifier. Where a required human review remains, the Adjudicator track exists to satisfy it under conservative, defer-by-default thresholds. The worst case stays one mechanical, CI-passing bump that a single git revert undoes. → §2g, §2e, §2d

Throttled, ephemeral, auditable

DREW holds one open PR at a time and works the backlog highest-severity-first, so at most one unattended merge is ever in flight. Each remediation runs in a fresh, disposable, hard-killable container with a clean clone — nothing persists — and every tool call + full transcript lands in #drew-audit + S3. Trust is earned in phases: dry-run → supervised live runs → lights-out merge on the safest dependency class, widening from there. → §2c, §1

§1Progress / Tracker

Live snapshot — phase status and the bar below are computed from GitHub PR state at build time (2026-06-10). 1/12 phases done · 8%.

Done — 1 In progress — 3 Not started — 8
TRUNK CICD ADJUDICATOR TRIAGE Supervisor — poll, rank, throttle, guards1Supervisor — poll,rank, throttle, guards Worker cage + egress jail + harness wiring2Worker cage + egressjail + harness wiring Remediation engine — bump + build + fix, dry-run3Remediation engine —bump + build + fix,dry-run Supervised live runs — operator-triggered, human-gated4Supervised live runs —operator-triggered,human-gated Lights-out — scheduled supervisor, merge on green5Lights-out — scheduledsupervisor, merge ongreen Pipeline telemetry — measure before touchingC1Pipeline telemetry —measure beforetouching Optimization PRs — human-merged, coverage-provenC2Optimization PRs —human-merged,coverage-proven Regression watchdog — ratchet the winsC3Regression watchdog —ratchet the wins Shadow reviewer — advisory comments, no authorityA1Shadow reviewer —advisory comments, noauthority Conservative auto-approve — DREW's bounded class onlyA2Conservativeauto-approve — DREW'sbounded class only Widen — other mechanical PR classes, defer-by-defaultA3Widen — othermechanical PR classes,defer-by-default Incident triage — diagnose & draft a fix (future)T1Incident triage —diagnose & draft a fix(future)

Trunk — Dependabot remediation (v1)

The proving ground. Staged so that merge authority is the last thing granted — the supervisor and the cage came first with no write token, the bump engine ran dry, live runs are operator-triggered and human-approved, and only phase 5 grants lights-out merge.

1Supervisor — poll, rank, throttle, guardsin progress

A supervisor that polls the Dependabot alerts API (GET /repos/architect-xyz/ax/dependabot/alerts?state=open), ranks open alerts highest-severity-first (critical → high → medium → low, tie-broken by CVSS then age, grouped by resolution), and enforces the core throttle: if the open DREW-PR count is at its cap (DREW_MAX_OPEN_PRS, default 1), do nothing this cycle. It dedups (skip alerts that already have a drew/deps/* branch), spawns one worker for the top group, and owns the audit trail. It does not think — it routes.

Built (prototype, #2300): the local-first CLI form — just alerts / just next / just remediate / just resume — authenticating as the ax-drew GitHub App (JWT → ~1h installation token, never the operator's gh auth), with artifacts under .drew-audit/. Remaining for prod: the long-lived daemon posture — scheduled polling, Redis guards (per-alert lock, concurrency gate), secrets from AWS Secrets Manager, and the #drew-audit Slack + S3 trail (BARD's guard/ skeleton).

awaiting sign-off prod daemon posture — scheduled supervisor, Redis guards, Secrets Manager, #drew-audit + S3 trail
  • #2300 DREW prototype — supervisor CLI, ranking, resolution groups, open-PR throttle, App auth
2Worker cage + egress jail + harness wiringin progress

The disposable unit: a container that takes a fresh shallow clone and runs claude -p (headless) on a metered Anthropic API key (§3.1). Its only route to the network is iron-proxy — default-deny, permitting api.anthropic.com, github.com/api.github.com, and the package registries a bump needs (static.crates.io + index, registry.npmjs.org) and nothing else. Tool surface constrained via --allowedTools + --permission-mode; PreToolUse blocks dangerous bash, PostToolUse audits every call; output captured as --output-format stream-json.

Built (prototype, #2300): the worker container (non-root, --cap-drop ALL, no-new-privileges, --init, cpu/mem caps, named + torn down on Ctrl-C) and the mandatory egress jail (py/drew/egress/) — doctor/remediate refuse to run unless the jail is up; ships in warn mode. Remaining: flip warn → enforce after a clean run, boundary credential-swap (GitHub token + x-api-key), and read-only rootfs / tmpfs-noexec hardening (§2e).

awaiting sign-off egress jail flipped warn → enforce after a clean remediation
awaiting sign-off credential-swap live — worker holds only opaque proxy tokens for GitHub + Anthropic
  • #2300 worker container, hardening flags, mandatory iron-proxy egress jail (warn mode)
3Remediation engine — bump + build + fix, dry-rundone

The core skill, built to run with nothing pushed. Given one alert group, the worker resolves the fix (security_vulnerability.first_patched_version, §3.11), decides direct-vs-transitive, edits the manifest if direct, runs the precise lockfile bump, then just rs/format + cargo check — and, when the bump breaks a call site, edits source until it compiles (the coding-agent value a plain bump bot can't deliver, §2d). Output is a diff + build result — no branch, no PR. just remediate runs exactly this; dry-run is the default everywhere.

  • #2300 dry-run remediation engine — resolve, bump, build, fix; diff + build result, nothing pushed
4Supervised live runs — operator-triggered, human-gatedin progress

First live writes, with a human in the loop twice over: every run is operator-triggered (just remediate --live) and every merge passes the repo's required review. The worker opens a ready PR, immediately arms auto-merge (gh pr merge --auto), shepherds only to drive CI green (≤ max_shepherd_attempts), then exits with a parseable hand-off — DREW_STATUS: MERGED | PARKED | BAILED — and the supervisor re-attaches to parked PRs via drew resume, picking the shepherd loop back up in media res. This phase replaced the planned draft-PR shadow: ready PRs + required review give the same calibration with a real merge path.

The live track record below is the calibration data. Remaining before phase 5: the deterministic out-of-agent verifier — built and validated against these very diffs (it must pass clean bumps and flag scope violations), because in phase 5 that checker, not a human, is the last gate before merge. Also the tuning run for the context-window kill threshold, which starts at 50% (§3.6).

awaiting sign-off deterministic out-of-agent verifier built + validated against the live diffs (no false-passes)
  • #2305 bump axios to 1.16.0 — first live remediation
  • #2306 bump rustls-webpki to 0.103.13 — first cargo-side bump
  • #2307 bump js-cookie to 3.0.7
  • #2308 bump uuid to 11.1.1
  • #2311 bump protocol-buffers-schema to 3.6.1
  • #2313 bump postcss to 8.5.10
  • #2314 bump esbuild to 0.25.0
  • #2315 bump ws to 8.20.1
5Lights-out — scheduled supervisor, merge on greennot started

Grant unattended operation: the supervisor runs on a schedule (no human trigger), and merge completes without a per-PR human. Mechanically auto-merge is already armed today — what changes is the gate: either the required-review rule is relaxed for the drew/deps/* class (checks-only branch protection), or the Adjudicator track supplies the approving review under its conservative thresholds (§3.12 — this is now the adjudicator's job to earn). It lands narrow — lockfile-only / patch-level bumps first (5.1), widening to minor, then major / breaking-change fixes (5.2) as the record justifies. Exhaust the shepherd attempts or trip the verifier → bail: leave the PR open, comment why, post to #drew-audit, and move on. Hard prerequisites: the phase-1 daemon posture, the phase-2 enforce-mode jail + credential-swap, and the phase-4 verifier.

CI/CD — pipeline watch & speed

Monitor and improve the CI/CD pipelines — make rust-test and rust-clippy faster while provably maintaining coverage. Ambient benefit to every developer on every PR. Needs only the cage, the App's existing Actions: read, and the audit trail — not merge authority. branches from phase 2

C1Pipeline telemetry — measure before touchingnot started

Read-only first: the supervisor pulls per-workflow / per-job / per-step timings, queue waits, and cache hit rates from the Actions API (Actions: read — the App already has it), persists the series, and builds the picture: where do rust-test and rust-clippy actually spend their time, which steps regressed and when, what's the p50/p95 wall-clock per PR. Deliverable: a trend report in #drew-audit and a ranked hotspot list (cold caches, redundant rebuilds, serial bottlenecks, oversized runners idling). No opinion without data — this phase is the data.

C2Optimization PRs — human-merged, coverage-provennot started

DREW proposes pipeline improvements as PRs — cache keying, test sharding / cargo nextest, clippy invocation scope, runner sizing, dependency pre-builds — each carrying before/after timing evidence from C1 and a coverage proof: a deterministic diff showing the executed-test set and enabled-lint set did not shrink (§3.18). Every CI change is human-merged — pipeline definitions gate the whole repo, and the ax-drew App deliberately lacks Workflows: write (§3.17 — open), so this lane cannot even push a workflow edit without an explicit new grant. Speed claims are verified on main after merge, not just on the PR branch.

C3Regression watchdog — ratchet the winsnot started

Continuous monitoring so improvements don't silently erode: alert #drew-audit when main's CI wall-clock regresses past a sustained threshold, bisect to the offending commit, and file the issue with the evidence attached. The ratchet — not one-off optimization — is what makes this track ambient infrastructure rather than a single project.

Adjudicator — autonomous PR review

The answer to the moved bottleneck (principle 2). The adjudicator is a separate review bot: it reviews PRs and auto-approves only under very conservative thresholds — if there is any doubt, it must defer to a human. Approval authority is earned the same way merge authority was: shadow first. branches from phase 2

A1Shadow reviewer — advisory comments, no authoritynot started

The adjudicator reviews every DREW PR — multi-pass: correctness of the breaking-change fixes, diff-scope discipline, supply-chain checks (new transitive deps, install scripts, typosquats) — and posts its findings as comments only. It cannot approve, request changes, or merge. Calibration is the whole point: every verdict is compared against the human reviewer's eventual decision, and the number that matters is the false-pass rate — how often the adjudicator would have approved something a human caught. No authority is requested until that rate is measured, not estimated (§2g).

A2Conservative auto-approve — DREW's bounded class onlynot started

A separate GitHub App identity (ax-adjudicator, §3.14) gains Pull requests: write and may submit an approving review — but only when every threshold passes: the diff is in the bounded mechanical class, the deterministic verifier passed, full CI is green, the review found zero findings, and the shadow-phase false-pass rate is below target. Any doubt → defer: the adjudicator posts an explicit "deferred to human" comment with its analysis and does nothing else. Approval is the only authority granted — merge still flows through branch protection + DREW's armed auto-merge, so the human review gate on phase 5 is satisfied without being deleted.

A3Widen — other mechanical PR classes, defer-by-defaultnot started

Extend beyond DREW's own PRs to other mechanical, low-blast-radius classes — generated-code refreshes, docs-only changes, config bumps with schema-validated diffs — one class at a time, each with its own shadow calibration before any approval authority. The stance stays defer-by-default: the adjudicator exists to clear the obvious, not to replace review. Human-authored feature work is out of scope indefinitely.

Incident triage — diagnose & draft a fix

The original DREW mission, deferred behind the Dependabot lane because it is the harder, higher-blast-radius job. Inverted posture: draft-only, human-gated, never auto-merged — untrusted prod logs are an injection surface the dependency lane doesn't have. branches from phase 2

T1Incident triage — diagnose & draft a fix (future)not started

A Slack Socket-Mode listener on #ax-incidents, read-only senses (incident.io / ClickHouse logs / Sentry), an automated triage+diagnose pipeline (the incident-demo/incident-prod methodology), and — crucially — a posture inverted from the Dependabot lane: open-ended root cause over untrusted prod logs, so it is draft-only, human-gated (✅ in-thread), and never auto-merged. It reuses the supervisor / worker / egress-jail / audit stack proven on the trunk; what it adds is the read-only MCP senses, the diagnosis pipeline, and a shadow-triage calibration run. Mind the md_pubmarketdata_publisher name translation, and the stack-wide-vs-local gate that keeps DREW from "fixing" code for an infra hiccup. Egress re-adds incident.io + Sentry + the ClickHouse log MCPs for this lane only.

Notebook

Reference design — the detailed mechanics behind the tracker.

§2aWhy DREW is not BARD-shaped

This framing drives every downstream decision, so it comes first. BARD and DREW look like siblings — both are LLM bots wired into the AX stack — but they sit on opposite sides of one line: read-only analysis vs. autonomous code change.

BARD analystDREW engineer
JobAnswer BI questions over Postgres/ClickHouseRemediate a vulnerable dependency end-to-end
Tool surface5 narrow, fully-constrained read toolsFilesystem, grep, build, git, gh — open-ended
LLM plumbingHand-rolled loop on raw Anthropic SDK (agent.py)The Claude Code harness (it already is this loop)
Side effectsNone — sql_safety rejects non-SELECTEdits manifests, opens a PR, merges it on green CI

For BARD's job, a bespoke loop on the raw SDK is exactly right: the tool surface is tiny and every tool is independently guarded. For DREW's job, the tool surface is a coding agent — edit a manifest, run the build, fix the breakage a bump introduces, drive a PR through CI. That loop, with its context management, permissioning, and sandbox, is precisely what Claude Code already implements. The bump itself a script could do; repairing the call sites a major-version bump breaks is the part that needs the agent.

The takeaway. DREW borrows BARD's operational skeleton — a supervisor service, Redis concurrency/timer guards, secrets from AWS, firewalled posture, container deploy — but its brain is the harness, not a raw message loop. Hand-rolling the coding loop would be reimplementing Claude Code badly. The only custom code is the orchestration around the harness.
Prior art — CARL & GOPHER. Two earlier RFCs (alee/carl-rfc → PR #1760; alee/gopher-design-doc → PR #1807, both closed unmerged, Apr–May 2026) designed the same supervisor + sandboxed-headless-Claude-Code shape for exactly this job — Dependabot remediation — and both deferred incident triage to "a separate RFC". DREW v1 is that Dependabot bot, realized; incident triage is the deferred RFC, now track triage. They independently reached DREW's harness conclusion (headless CLI, not an SDK loop), worked out the sandbox/egress/budget machinery in detail, and auto-merged safe bumps — the posture DREW adopts. Mined into §2e, §2f, and catalogued in §3.10. Like them it bills a metered API key (the Max-subscription path DREW first chose lost its flat-rate edge once Anthropic began metering headless subscription use — §3.1). Where DREW diverges: it rides the harness (not a raw loop), runs on a Mac mini (they assumed EC2), and adds the coding-agent ability to fix the breaking changes a bump introduces rather than only landing already-compatible versions.

§2bThe harness decision

Given §2a, the real question is not "raw SDK vs harness" (the harness wins for a coding agent) but which form of the harness, and how to wrap it. Three candidates:

ApproachCoding loopIsolationBilling / ToS
Raw Anthropic SDK (BARD's way)You build it all by handIn-processAPI key only
Claude Code CLI (claude -p)Built-inSubprocess — fresh, hard-killable, resource-cappedAPI key or Claude subscription is ToS-clean
Claude Agent SDKBuilt-in (same engine)In-process (a hang risks the supervisor)API key only — subscription/OAuth tokens are a ToS violation

Two findings dominate, and they both point the same way for DREW:

  1. Billing no longer forces the hand — isolation does. The original plan rode a Claude Max subscription, which is ToS-clean only with the CLI (not the Agent SDK), making auth the deciding fork. But as of Anthropic's 2026-06-15 change, subscription headless use (claude -p / Agent SDK) draws from a monthly Agent-SDK credit metered at the same per-MTok rates as the API — the flat-rate edge is gone (§3.1). DREW therefore runs on a metered API key, which is ToS-clean with both the CLI and the Agent SDK. So the choice falls to finding #2 — subprocess isolation — plus the API key being a swappable header credential the egress proxy can rewrite (§2e). decided: API key → CLI on isolation grounds
  2. For a security-sensitive autonomous agent, subprocess isolation is a feature. The Agent SDK's edge is in-process control (dynamic permission callbacks, typed messages). But what DREW most needs is to spawn each remediation as a throwaway, network-jailed, resource-capped, hard-killable unit and tear it down — so the supervisor stays clean and can docker kill a runaway. That argues for the CLI-as-subprocess inside a fresh container. The "dynamic permission" capability is recovered with PreToolUse hooks (which run in headless mode) plus a scoped GitHub token, so the in-process callback isn't needed.
Decision. A thin custom supervisor (Python, BARD-style) invokes the Claude Code CLI in headless mode (claude -p --output-format stream-json), one fresh sandboxed container per remediation, on a metered Anthropic API key. The raw SDK is rejected (rebuilds the loop); the Agent SDK is now ToS-viable on the API key but still held in reserve — it trades away the subprocess isolation the security model leans on, so it's only worth it if DREW ever needs tight in-process orchestration.

§2cArchitecture: supervisor / worker

The same split BARD and NATE already use — a cheap, long-lived service that does the constrained routing, and an expensive, disposable brain that does the open-ended work. The supervisor is dumb and never dies; the worker is smart and always dies.

Dependabot alerts API ──(poll, every N min)──┐
                                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  DREW SUPERVISOR   (long-lived Python service, like `bard slackbot`)     │
│   • poll open alerts; rank by severity → CVSS → age                      │
│   • THROTTLE: open DREW deps PRs at cap?  → wait this cycle              │
│   • Redis: per-alert lock, concurrency gate = 1, wall-clock timers       │
│   • dedup: alert already has a drew/deps/* branch → skip                 │
│   • spawns ONE worker container for the top alert group                  │
│   • relays PR link + merged/parked/bailed status to #drew-audit          │
└───────────────────────────────┬──────────────────────────────────────────┘
                                │ docker run --rm  (fresh, jailed)
                                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  DREW WORKER   (ephemeral container = `claude -p` harness)               │
│   fresh shallow clone of adx @ main                                      │
│   bump → just rs/format → cargo check → fix call-site breakage           │
│   deterministic verifier → gh pr create → arm auto-merge → /shepherd     │
│   exits with DREW_STATUS: MERGED | PARKED | BAILED                       │
│   tools: Read/Grep/Glob, Bash(git/cargo/just/gh), Edit                   │
│   egress: allowlist proxy (Anthropic · GitHub · crates.io/npm)           │
│   GitHub App token: contents+PR write, merge via branch protection       │
└──────────────────────────────────────────────────────────────────────────┘

The supervisor holds all durable state and all secrets; the worker receives a narrowly-scoped slice for the duration of one remediation and is destroyed (--rm) afterward. The worker also decides its own stop/shepherd has no natural end, so the worker exits with a parseable DREW_STATUS line (MERGED / PARKED / BAILED) and that exit is the event the supervisor reacts to; on PARKED it schedules drew resume, which re-attaches to the parked PR's branch and picks the loop back up in media res. The throttle, concurrency gate, timers, and dedup live in Redis in prod, following BARD's guard/ module.

§2dThe alert → merged-PR pipeline

One alert group, one worker, one PR — the supervisor has already picked the highest-severity open alert and confirmed the open-PR throttle has room. The worker's job is to turn that alert into a merged, CI-green bump. The agent's leverage is steps 4 and 6 — fixing the breakage a bump introduces, and driving CI to green; a script handles the rest.

#StepWhat happens
1ResolveRead the alert: ecosystem / package.name / manifest_path / severity and security_vulnerability.first_patched_version. Decide direct vs transitive dependency; the target version is the advisory's first patched version (§3.11), not the newest release.
2Branchdrew/deps/<crate>-<ver> off main in the fresh clone.
3BumpDirect → edit Cargo.toml then cargo update -p <crate> --precise <ver>; transitive → cargo update -p <crate> --precise <ver> (lockfile only). just rs/format.
4Build & fixcargo check (warm cache, §2f). If the bump broke call sites, edit source until it compiles — the coding-agent step. Bumps needing real redesign hit the ceiling → bail & escalate.
5Verify(deterministic, out-of-agent) diff touches only the manifest/lockfile + plausibly-related source; no unexpected new dependency or secret; cargo check passed. Fail → bail & escalate.
6PR & shepherdgh pr create (ready, repo-convention body), arm auto-merge, then /shepherd: watch CI, fix failures (≤ N attempts), push, repeat — never sit polling pending checks or a reviewer. Exit MERGED / PARKED / BAILED.
7ReportOne line to #drew-audit: alert, CVE/GHSA, bump, PR link, and merged / parked / escalated — needs a human.
Breaking-change ceiling. DREW fixes the call-site breakage a bump introduces — that is the whole point of using the harness — but it does not chase a major version that needs real redesign. When the shepherd loop exhausts its attempts or cargo check won't come clean within budget, DREW leaves the PR open with a comment explaining how far it got and escalates, rather than forcing a bad merge.
Native Dependabot — own the lane. GitHub's own Dependabot security-update PRs would duplicate (and race) DREW. Disable native security-update PRs so DREW is the sole remediator — or have DREW adopt an existing Dependabot branch instead of opening its own (§3.13 — open). It still reads the same alerts API either way.
Deterministic verifier — required before lights-out (GOPHER §5.8), not deferred. In the incident design this was a later nicety because a human reviewed every draft. Once no human reviews before merge, the out-of-agent checker is the last gate and is load-bearing: the agent can be confidently wrong, a dumb checker cannot. It runs in step 5 and again before merge, and a failure blocks the merge outright. It is built and validated in phase 4 against DREW's own live diffs before any unattended merge authority is granted.

§2eSecurity model & blast radius

An autonomous agent with shell + gh write access to a trading-system repo that merges its own PRs is a serious blast-radius question. The injection surface is far smaller than the incident lane — DREW reads dependency metadata and advisories, not untrusted prod logs — but the merge authority is larger. The defenses are layered so that no single failure is catastrophic.

ThreatControl
Exfiltration / C2 over the networkNetwork-layer egress allowlist via iron-proxy (lifted from the CARL/GOPHER RFCs) — the worker's only route out is a default-deny MITM proxy permitting Anthropic, GitHub, and the package registries (static.crates.io + index, registry.npmjs.org) and nothing else; it TLS-terminates, so rules can be path-scoped (e.g. only /repos/architect-xyz/ax/* on api.github.com), and it filters MCP tools/list so denied tools never reach the agent. Built and mandatory in the prototype (py/drew/egress/) — ships in warn mode to learn the long tail, then enforce. The harness sandbox's hostname filter is defense-in-depth, not the wall.
Unreviewed code landing on mainDREW does merge — so safety is scope + gate, not a human on each PR. Bounded change class (dependency manifest/lockfile + minimal call-site fixes), full CI as the gate (not best-effort cargo check), the deterministic out-of-agent verifier before merge (§2d), and the open-PR throttle so at most one unattended merge is ever in flight — trivially git revert-able. Merge is gated by branch protection (required status checks), not granted by the token. Phased rollout (supervised live runs → patch bumps → wider) caps early exposure.
Malicious / hijacked upstream package (supply chain)Pin to the advisory's first_patched_version, never "latest" — DREW won't pull an unrelated newer release. The bumped code only ever executes inside CI's own sandbox, never on the worker host; the worker's egress jail means a hostile crate's build script can't phone home from DREW's box either. New transitive dependencies introduced by a bump are flagged by the verifier — and are a first-class check in the Adjudicator's review pass (§2g).
Prompt injection via advisory / package metadataSmaller surface than the incident lane, but advisory text and changelogs are still untrusted input to a shell-capable agent. Contained by: the bounded token (no workflows/admin), the egress jail, the ephemeral clone (can't reach other repos or secrets), a system-prompt rule to treat all fetched data as evidence, never instructions, and iron-proxy credential-swap — the worker holds only opaque proxy tokens that iron-proxy rewrites to the real GitHub token and Anthropic API key at egress (both are swappable header credentials, with require: true), so an exfiltrated env yields tokens useless outside the sandbox.
Runaway loop / slow CIWall-clock timeout via docker kill plus a cap on shepherd fix-attempts (§3.6) — though the normal stop is the worker's own DREW_STATUS exit, with the timeout as backstop for a hang. The open-PR throttle already means there is never more than a single worker. Billing is a metered API key, so USD is a real per-remediation cost — DREW meters its own estimate + realized spend (§3.6) — but wall-clock and the attempt cap, not dollars, remain the runaway controls.
Container escape / host pivotHardened worker container (CARL §4 recipe): --rm, non-root, drop all caps + --security-opt no-new-privileges, --init, cpu/mem limits, named container torn down on Ctrl-C, dedicated sandbox bridge only (no metadata service, no Tailscale, no Redis). Still to land: read-only root fs, tmpfs /tmp·/home·/workspace with noexec, no docker.sock audit. Creds injected via --env-file that the supervisor unlinks after start — never baked into the image.
Agent going off the rails (degraded loop)Context-window monitor (GOPHER §9.5, reframed): parse stream-json usage, track peak context-%, and hard-kill at 50% of the model's window to start (the conservative edge of RULER's ~50–65% usable band — §3.6). The supervised-live phase (phase 4) also logs context-% vs. outcome so the 50% starting point can be tuned from DREW's own data rather than GOPHER's assumed 33%.
Lingering state / lack of audit trailFresh --rm container + fresh clone per remediation; nothing persists. Every tool call (PostToolUse hook) plus the gzipped stream-json transcript lands under .drew-audit/ today, S3 + a one-line #drew-audit notice in prod — alert, bump, PR link, merged/parked/escalated — so every merge is visible. (No /drew pause kill switch — the timers and open-PR throttle are sufficient containment.)
Don't rely on the harness sandbox for the network boundary. Claude Code's built-in network filter is hostname-based and does not inspect TLS. The trustworthy egress wall is iron-proxy at the container/VM layer with an explicit allowlist + credential-swap. On a Mac mini, Docker runs a Linux VM — do the jailing Linux-side, where namespaces and the proxy give a real boundary.
Why the API key strengthens the jail. An earlier draft ran on a Max OAuth credential (~/.claude), which isn't a swappable header key — so credential-swap could only cover the GitHub token, and the Anthropic side fell back to jailing egress to api.anthropic.com plus guarding ~/.claude. Moving to a metered ANTHROPIC_API_KEY (§3.1) closes that gap: like CARL/GOPHER, iron-proxy now swaps the x-api-key header too, so the worker holds opaque proxy tokens for both credentials and an exfiltrated env is useless outside the sandbox.

§2fDeployment on the Mac mini

The deployment borrows ax-bard's software story (Tailscale + AWS Secrets Manager + Docker) but on a single Mac mini (M4, 10-core, 16 GB, 512 GB) rather than EC2 — see the host trade-off in §3.9.

  • Runtime: Docker Desktop (or colima). The supervisor is an always-on container; workers are docker run --rm siblings on an isolated bridge network whose default route is the egress proxy.
  • Auth: a metered Anthropic API key injected into the worker (claude -p uses it automatically) — from Secrets Manager in prod, a local .env in dev. A swappable header credential, so iron-proxy can credential-swap it at egress like the GitHub token (§3.1, §2e).
  • Secrets: loaded from AWS Secrets Manager at supervisor startup (BARD-style) — the GitHub App credentials from which a per-remediation installation token is minted — and handed to a worker only for the life of one remediation.
  • Admin access: Tailscale for inbound/management, mirroring ax-bard.
  • Lifecycle: the supervisor runs under launchd (Mac-native) or as a restart-always container.
  • Build cache: a warm CARGO_HOME registry + shared target/ via sccache, as named volumes mounted read-mostly into the ephemeral worker (GOPHER §7). Doubly central here — every remediation runs cargo update + a rebuild, so a cold cache would dominate the budget. Key it on (rust-toolchain hash, Cargo.lock hash) and accept a rebuild when either moves — note a dependency bump moves Cargo.lock by design, so DREW pays a partial-rebuild on each fix; size the timer for that. Directly relieves the 16 GB pressure below.
Why a Mac mini is fine. The brain is the Claude Code harness talking to Anthropic; the local box only needs to clone the repo, run cargo check, and shell out to git/gh. The M4's 10 cores handle that comfortably, and its Docker-Linux-VM gives a clean place to put the egress jail.
Watch the 16 GB under Docker. DREW's one memory-heavy local act is cargo check on the ax workspace inside the Docker Linux VM (give the VM ~12 GB, leaving ~4 GB for macOS). One worker fits; two concurrent would thrash. Two mitigations make 16 GB sufficient for v1: (1) the open-PR throttle already pins concurrency at one, so a second worker never starts; (2) CI is the real compile authority — the PR runs full CI on push and is the merge gate, so DREW's local cargo check is a best-effort pre-flight that fails fast before opening a doomed PR. Step up to the 24 GB model only if you later want parallel workers or fuller local builds.

§2gThe Adjudicator direction — moving the bottleneck, not hiding it

DREW solved "who does the mechanical work" and immediately exposed the next constraint: who reviews it. Every parked DREW PR waits on the same scarce resource the bot was built to relieve. Two bad answers bracket the design space: delete the review gate (unacceptable — branch protection's required review is doing real work) and keep a human on every bump forever (the bottleneck DREW exists to remove).

The adjudicator is the third answer: a second, independent agent whose only job is review. Three design commitments, in priority order:

  1. Defer-by-default. The adjudicator's contract is asymmetric: a wrong defer costs one human review (the status quo); a wrong approve lands unreviewed code on main. So the thresholds are deliberately one-sided — it approves only when every check passes (bounded diff class, deterministic verifier, full CI green, zero review findings, calibrated false-pass rate below target), and any doubt — a finding it can't dismiss, a diff outside the class, a confidence wobble — produces an explicit "deferred to human" comment. It must never be cheaper for the adjudicator to approve than to defer.
  2. Separation of duties. The adjudicator is a different GitHub App identity (ax-adjudicator) from the author bot — GitHub already refuses self-approval, and the security property is worth stating: the agent that wrote the diff and the agent that judges it share no token, no container, no transcript (§3.14). They can disagree; that disagreement is signal, posted to #drew-audit.
  3. Calibration before authority. Like the trunk's dry-run → supervised → lights-out ladder, the adjudicator ships as a shadow reviewer (A1) whose verdicts are scored against human outcomes. The promotion criterion is a measured false-pass rate on a meaningful sample — not a vibe that the comments look smart.

Scope grows the same way trust did on the trunk: DREW's own drew/deps/* PRs first (the most mechanical, best-understood class — and the one whose author's bounded scope the adjudicator can verify deterministically), then other mechanical classes (A3). Human feature work stays out of scope indefinitely — the adjudicator clears the obvious so humans can spend review where judgment is actually needed.

§2hThe CI/CD direction — ambient leverage

The second growth direction points DREW at the pipelines themselves: rust-test and rust-clippy sit on every PR's critical path, so minutes shaved there compound across every developer, every day — including DREW's own shepherd loop, which waits on the same CI to go green. The lane is deliberately shaped like the trunk's trust ladder:

  • Measure first (C1). Read-only telemetry from the Actions API the App can already reach. No optimization is proposed without a baseline and a hotspot ranking — the failure mode of "CI tuning" is cargo-culted cache tweaks that help nothing and break subtly.
  • Coverage is the invariant (C2). Every optimization PR must carry a machine-checkable proof that the executed-test set and enabled-lint set did not shrink (§3.18). "Faster because it does less" is a regression wearing a speedup's clothes — the proof is what keeps the lane honest, and it's deterministic, so it can live in CI itself.
  • Humans merge pipeline changes — structurally. Workflow definitions gate everything else in the repo, including DREW's own merge path, so this lane never gets auto-merge. Better: the ax-drew App deliberately lacks Workflows: write, so the cage cannot push a workflow edit even if the agent tries — how C2's PRs get authored at all is an open permissions question (§3.17).
  • Then ratchet (C3). One-off wins erode; the durable value is the watchdog that notices main's wall-clock regressing, bisects to the commit, and files the issue with evidence. That's the "monitor" half of the mission, and it runs forever.

§3Design Questions

  1. Auth / billing — subscription or metered API key?
    • Answered — affirmative Metered API key. The earlier choice — a Max subscription, for flat-rate cost predictability — was overtaken by Anthropic's 2026-06-15 change: subscription headless use (claude -p / Agent SDK) now draws from a monthly Agent-SDK credit metered at the same per-MTok rates as the API, so the flat-rate edge is gone. With that gone, the API key wins on three counts: it's simpler to operate (no claude setup-token, no 1-year OAuth expiry, no macOS-Keychain→Linux bridge); it's a swappable header credential, so iron-proxy can credential-swap it at egress exactly like the GitHub token (§2e), closing a gap the OAuth path left open; and it's ToS-clean with both the CLI and the Agent SDK, so billing no longer dictates the harness form — the CLI is chosen on subprocess-isolation grounds (§2b). Metered means real per-remediation cost, so DREW reports its own estimate + realized spend (§3.6), BARD-style.
  2. Autonomy — how far does DREW go unsupervised?
    • Answered — affirmative Shepherd → auto-merge, earned in stages. DREW opens a ready PR, arms auto-merge, and shepherds CI to green — no human drives the mechanics. Today (phase 4) two human gates remain: the operator triggers each run, and branch protection's required review approves each merge. Phase 5 removes the trigger (scheduled supervisor); the required-review gate is resolved either by checks-only protection on the drew/deps/* class or by the Adjudicator track supplying the approving review under its conservative thresholds (§3.12). Defensible because the change class is bounded and mechanical, full CI is the gate, the deterministic verifier backstops it, and the open-PR throttle caps a bad merge at a single git revert (§2e). The incident-triage track keeps the opposite posture — draft-only, human-gated ✅, never auto-merged — because its blast radius and injection surface are far larger.
  3. Egress enforcement — harness sandbox, or network-layer proxy?
    • Answered — affirmative iron-proxy at the container layer is the real boundary (the harness sandbox is defense-in-depth only — it's hostname-based and doesn't terminate TLS). Chosen over hand-rolled squid/mitmproxy because it ships default-deny allowlisting, TLS-terminating path-scoped rules, MCP tools/list filtering, and boundary credential-swap out of the box — and both CARL and GOPHER already converged on it. Built and mandatory in the prototype (py/drew/egress/doctor/remediate refuse to run without it), currently in warn mode. Allowlist for the Dependabot lane: Anthropic, GitHub, and the package registries (static.crates.io + index, registry.npmjs.org) — no incident.io/Sentry/ClickHouse (those return for the triage track). Pin a tagged release; flip warn → enforce after a clean run (§2e).
  4. GitHub credential — fine-grained PAT or a GitHub App?
    • Answered — affirmative GitHub App ax-drew (CARL §3 gives the template) — built; DREW opens PRs as app/afintech-drew, never as the operator. Org-installed but repo-filtered to architect-xyz/ax. Minimal repo permissions: Contents: write, Pull requests: write, Checks: read, Commit statuses: read + Actions: read (the shepherd must read the status rollup and failing workflow-job logs — without these gh pr checks / gh run view --log-failed return 403 "Resource not accessible by integration"; both are read, distinct from the withheld Actions/Workflows: write), Dependabot alerts: read (the trigger feed), Issues: write (escalation comments), Metadata: read — and explicitly not Administration, Workflows, Actions: write, Members, Secrets. Merge needs no extra permission; it is gated by branch protection, not the token (§3.12). Withholding Workflows means GitHub-Actions-ecosystem bumps (which edit .github/) are out of scope for v1 — cargo/npm only — and it constrains the CI/CD track (§3.17). The supervisor mints a ~1h installation token per remediation (never long-lived in the worker); enable App GPG-signed "Verified" commits in prod.
  5. Ingest mechanism — poll the alerts API, or a webhook?
    • Answered — affirmative Poll the Dependabot alerts API (GET /repos/architect-xyz/ax/dependabot/alerts?state=open) on an interval and rank the result. Polling needs no inbound endpoint (outbound only — friendly to the egress jail), is naturally idempotent against the open-PR throttle, and a few-minutes lag is irrelevant for vulnerability remediation. The dependabot_alert webhook is a latency optimization for later, not v1. #drew-audit is output-only — no Slack trigger in this lane (the Socket-Mode listener returns for the triage track).
  6. Per-remediation budget — what are the caps?
    • Answered — affirmative Meter + report; wall-clock + attempt cap. On a metered API key, dollars are a real per-remediation cost, so DREW meters itself like BARD: the supervisor logs an up-front estimate and the realized per-MTok spend (parsed from the worker's stream-json usage) for every remediation, persists both to the audit trail, and calibrates the estimate from realized data. That's reporting, not a hard cap. The actual runaway guards stay time-based: a per-remediation wall-clock timeout enforced by docker kill (now a backstop for a hung worker — the normal stop is the worker's own DREW_STATUS exit) and a cap on shepherd fix-attempts (≤ N CI-failure → fix → push cycles). Exhausting either bails to "escalated — needs a human" with the PR left open. The open-PR throttle already pins concurrency at one. A third lever — context-window degradation — is real but uncalibrated: research shows quality drops well before the window fills (RULER finds only ~50–65% of advertised context is reliably usable for multi-hop work; Chroma Context Rot and "Context Length Alone Hurts…" show 14–85% degradation as input grows even with perfect retrieval), but there is no published agentic-coding threshold and GOPHER's "33%" was an internal guess. So DREW starts with a hard-kill at 50% of the window — generous for a bump task, so it fires only on genuine runaways — and logs peak context-% vs. outcome in the supervised phase (4) to tune that number later.
  7. Repo working copy — fresh clone per remediation, or a warm cache?
    • Answered — affirmative Fresh shallow clone per remediation, destroyed with the container. Ephemerality is a security property (no cross-run state, no lingering secrets). If clone latency becomes a problem, a read-only reference cache (git clone --reference) is an optimization that preserves the property — but default to fresh. Note this is the source clone; the warm CARGO_HOME/target build cache (§2f) is a separate, read-mostly volume.
  8. Scope — which bumps does DREW take, and in what order?
    • Answered — affirmative Widen by risk class. Unattended merge authority lands on the safest class first: lockfile-only / patch-level bumps with no manifest range change and no source edits (phase 5.1), then widens to minor, then major / breaking-change fixes (5.2) as the track record earns it. Ecosystem scope is cargo + npm; GitHub-Actions/workflow bumps are excluded (they need Workflows write, §3.4). The supervised phase (4) attempts every class from the start since a human approves each merge — the live record so far spans npm and cargo bumps.
  9. Host — Mac mini or EC2 (like BARD)?
    • Answered — affirmative A single Mac mini (M4, 16 GB). EC2 was evaluated and would mirror ax-bard exactly (Tailscale + Secrets Manager + VPC egress jail + AMI workflow), but the dollar spread is small and the Mac mini wins on flat capex (~$799 one-time vs ~$256/mo for an always-on m7g.2xlarge), M4 build speed, and full ownership. EC2 Mac (mac2.metal) is ruled out — no macOS need and ~$474/mo with a 24h host minimum. Accepted trade-offs: ops/uptime are on us (a physical SPOF), and the egress jail is built Linux-side in the Docker VM rather than VPC-native. The 16 GB cargo check concern is handled by the one-at-a-time throttle + CI-as-compile-authority (§2f).
  10. Prior art — what's worth lifting from CARL & GOPHER?
    • Answered — affirmative Mined. The two closed Dependabot-bot RFCs (§2a) targeted exactly DREW's v1 job, so the overlap is near-total. Adopted: Dependabot-alert polling + severity ranking; auto-merge of green bumps; iron-proxy egress + credential-swap + MCP tool-filtering (§2e); the container-hardening recipe (read-only fs, tmpfs noexec, no docker.sock, drop-caps, env-file-then-unlink); the GitHub App minimal-permission set + signed commits (§3.4); the context-window monitor — reframed from GOPHER's 33% to a 50% hard-kill, tuned from live data (§3.6); warm cargo/sccache keyed on toolchain+lock hash (§2f); S3 transcripts + #drew-audit; and metered-API-key billing with per-run self-metering (their BudgetTracker shape), plus the shared guard/ ConcurrencyGate. Promoted (was deferred): the deterministic out-of-agent verifier (§2d) — with unattended merge there is no human reviewing each PR, so the dumb checker becomes load-bearing rather than a nicety. Dropped: CARL/GOPHER's /drew pause + auto-pause — the timers and open-PR throttle already contain a runaway. Diverged: the harness (not a raw SDK loop), the Mac mini host (not EC2), and the agent's ability to fix breaking changes (they bumped only compatible versions). Worth re-reading: GOPHER §11 "things to flag" — the cargo-cache-invalidation and transitive-advisory gotchas apply directly.
  11. Target version — the advisory's first patched version, or latest?
    • Answered — affirmative Pin to security_vulnerability.first_patched_version, the minimal bump that clears the advisory — not the newest release. Jumping to "latest" maximizes the surface of unrelated breaking changes (more call-site churn, more CI risk) and can pull in a newer release that itself carries a fresh, not-yet-flagged vulnerability. The minimal bump is the smallest reviewable diff and the most likely to pass CI untouched. If the minimal patched version is yanked or itself unbuildable, step up to the next viable release and note it in the PR body.
  12. Merge gate — how does auto-merge reconcile with branch protection?
    • Not yet answered Merge is whatever branch protection on main already allows — DREW gets no bypass. The worker arms gh pr merge --auto at PR-open, so the merge completes the moment every required gate passes; where a required human review is configured (today's posture), the PR parks until a human approves — which degrades cleanly to supervised operation rather than failing. The open decision has sharpened into the Adjudicator track: the original options were (a) keep a required reviewer on DREW's PRs (soft human gate, slower) or (b) checks-only protection for the drew/deps/* class (true autonomy, no review at all). The adjudicator offers (c): keep the required-review rule, let a calibrated bot satisfy it under conservative thresholds, defer to a human on any doubt (§3.16). Start with (a) — the status quo — and let the adjudicator's shadow record (A1) decide between (b) and (c).
  13. Native Dependabot PRs — own the lane, or adopt them?
    • Not yet answered GitHub's own Dependabot security-update PRs would duplicate and race DREW. Two clean options: (a) disable native security-update PRs so DREW is the sole remediator and opens its own drew/deps/* branch (simplest, and DREW's value-add — fixing breaking changes + shepherding to merge — is fully expressed); or (b) DREW adopts the existing Dependabot branch, taking over from where the native bot stops. Leaning (a) for a clean ownership boundary; revisit if the team wants to keep native PRs visible. Either way DREW reads the same alerts API, and the open-PR throttle keys on DREW-authored PRs only.
  14. Adjudicator identity — can DREW review its own PRs?
    • Answered — affirmative No — the adjudicator is a separate GitHub App (ax-adjudicator). GitHub refuses self-approval (a PR's author cannot approve it), so the ax-drew identity cannot satisfy a required review on its own PRs even if we wanted it to — the reviewer must be a second identity. The constraint is welcome: it enforces separation of duties (author bot and reviewer bot share no token, no container, no transcript), and their disagreements become auditable signal in #drew-audit rather than an internal contradiction silently resolved (§2g). Permissions for ax-adjudicator: Pull requests: write (submit reviews + comments), Contents: read, Checks/Statuses/Actions: read — no write to contents, no merge, no workflows.
  15. Adjudicator thresholds — what does "very conservative" mean concretely?
    • Not yet answered The promotion gate from A1 → A2 needs numbers, not adjectives. Candidate shape: auto-approve only when (1) the diff is in a deterministically recognizable mechanical class (for drew/deps/*: manifest/lockfile + verifier-approved call-site fixes, no new transitive deps unaccounted for); (2) the deterministic verifier passed; (3) full CI green; (4) the adjudicator's own multi-pass review produced zero findings; and (5) the shadow-phase false-pass rate is below a target (e.g. zero approvals-a-human-would-have-blocked over the most recent N ≥ 50 shadow reviews). Open: the value of N, the target rate, whether "stale calibration" (model or pipeline changed) resets the clock, and whether severity-critical bumps always defer regardless of thresholds.
  16. Adjudicator approval — does a bot review satisfy branch protection, and do we want it to?
    • Not yet answered Mechanically, a GitHub App with Pull requests: write can submit an approving review, and plain "require 1 approval" branch protection counts it — but rulesets can be configured otherwise (require Code Owners, restrict who counts as a reviewer), and the org's settings need auditing before this is assumed. The policy question is separate: even if it counts, do we want DREW's merges gated on a bot approval (option c in §3.12) versus dropping the review requirement for the bounded class (option b)? (c) preserves an independent second judgment on every merge and keeps one rule for the whole repo; (b) is more honest about where the real gate is (CI + verifier). Decide after A1 produces a false-pass record worth arguing from.
  17. CI/CD lane permissions — who can edit .github/workflows?
    • Not yet answered The trunk's security model deliberately withholds Workflows: write from the ax-drew App (§3.4) — which means the CI/CD track's optimization PRs (C2) cannot even push a branch that edits .github/workflows/*. Options: (a) a separate, narrowly-granted App (ax-drew-ci) holding Workflows: write, used only by the C2 lane, with every PR human-merged (the write is to a branch; the gate is the merge); (b) DREW drafts the workflow diff as an issue/comment artifact and a human applies and opens the PR (zero new grants, more friction); (c) scope C2 to non-workflow speedups only (cache configs, nextest adoption in justfile, test code) and leave workflow edits to humans guided by C1's telemetry. Leaning (b) to start, (a) if the lane proves out — the trunk's posture that workflow write is radioactive shouldn't be quietly reversed by a side lane.
  18. CI/CD coverage invariant — how do we prove "same coverage, faster"?
    • Not yet answered "Faster" is easy to measure; "without losing coverage" is the hard, load-bearing half. Candidate: a deterministic coverage-set diff — enumerate the executed test set (e.g. cargo nextest list / per-job test manifests) and the enabled lint set (cargo clippy's effective lint table) on main vs the optimization branch, and require the diff to be empty-or-additive as a CI check on every C2 PR. Open questions: flaky-test quarantine interaction (removing a flaky test is a coverage change and must be its own reviewed decision, never a side effect of a speedup), whether sharding changes execution order in ways that mask order-dependent tests, and whether feature-flag matrix reductions count as shrinkage (they do — the matrix is part of the set).