Required a 30-day trial gate on every Claude Code agent before scaling beyond the pilot squad, Apoorv Mittal

Context

Leadership wanted "AI everywhere by Q4". The pilot squad had three Claude Code agents working, a build-doctor, a PR-review assistant, and an RBAC validator. Other squads were asking for access. Two engineers had already started building agents in their own time, off-process.

The pull was real, the pressure was real, and the risk was the obvious one: shipping demos that don't survive contact with production.

The decision

Every agent had to clear a 30-day trial gate inside the pilot squad before any second squad could adopt it. The gate was three numbers:

Criterion	Target
Eval-set pass rate	≥ 90% on a frozen ground-truth set
Incident-attributable rate	< 1 incident / quarter
False-positive rate (where applicable)	< 5%

If an agent missed any of the three at day 30, it was killed. Not iterated on, not parked in a "v0.5" branch, killed, with the learnings written up as a one-pager and shared.

The eval set was the most important artifact. We took 200 historical PRs across the codebase, hand-labelled the correct outcome for each, and ran every candidate review-agent against it weekly.

What played out

Three agents failed the gate.

A migration-suggester hit 78% eval pass, too aggressive, kept proposing refactors that were technically correct but socially wrong (touching files three teams away from the PR author).
A PR-review agent had a 12% false-rejection rate on legitimate refactors. It was learning patterns that didn't generalise. We killed it; the learnings became a one-pager titled Why "looks like a refactor" is the wrong heuristic for review approval.
An on-call summariser survived eval but tripped the incident bar in week three (silent failure when the upstream API rate-limited). We added a watchdog and re-entered the gate; it passed second time.

Two agents passed cleanly and went team-wide. Both still in production.

What I'd do differently

I'd publish the gate criteria as an internal RFC on day one. We did it informally, the gate was a doc in my notes, then a slide, then a doc shared with three teams. By the time the third agent got killed, the team felt like the gate was a moving target rather than a published contract. It wasn't moving, but the perception was the same as if it had been.

The fix is cheap: post the gate before the first agent enters it. Pin the eval set in a versioned repo. Keep a public scoreboard.

I'd also reserve a budget line for the kill writeups up front. They're the most valuable artifact of the program, and they only get written if someone has time to write them. We got two of three; the third lived as tribal knowledge for three months before I forced it into a doc.

Context

The pull was real, the pressure was real, and the risk was the obvious one: shipping demos that don't survive contact with production.

The decision

Every agent had to clear a 30-day trial gate inside the pilot squad before any second squad could adopt it. The gate was three numbers:

Criterion	Target
Eval-set pass rate	≥ 90% on a frozen ground-truth set
Incident-attributable rate	< 1 incident / quarter
False-positive rate (where applicable)	< 5%

If an agent missed any of the three at day 30, it was killed. Not iterated on, not parked in a "v0.5" branch, killed, with the learnings written up as a one-pager and shared.

The eval set was the most important artifact. We took 200 historical PRs across the codebase, hand-labelled the correct outcome for each, and ran every candidate review-agent against it weekly.

What played out

Three agents failed the gate.

A migration-suggester hit 78% eval pass, too aggressive, kept proposing refactors that were technically correct but socially wrong (touching files three teams away from the PR author).

A PR-review agent had a 12% false-rejection rate on legitimate refactors. It was learning patterns that didn't generalise. We killed it; the learnings became a one-pager titled Why "looks like a refactor" is the wrong heuristic for review approval.

An on-call summariser survived eval but tripped the incident bar in week three (silent failure when the upstream API rate-limited). We added a watchdog and re-entered the gate; it passed second time.

Two agents passed cleanly and went team-wide. Both still in production.

What I'd do differently

The fix is cheap: post the gate before the first agent enters it. Pin the eval set in a versioned repo. Keep a public scoreboard.