This post is a companion to the Claude Code agents case study. That piece walks the architecture, how the agents are structured, what they enforce, why they're shaped the way they are. This piece is the post-mortem on which ones survived contact with engineers, and which ones got merged-into-trash inside a month.
I shipped roughly thirty Claude Code agents across production codebases in the last twelve months. About a third turned into infrastructure people complained when it broke. About a third did their job and faded into the background. The last third I deleted. This is what I learned sorting them.
The agents that paid off
The pattern was the same in every case: deterministic-where-possible programs that happen to call an LLM somewhere in the middle. Not "ask the model to figure it out." Tight scope, machine-readable output, a clear failure mode.
The PR description writer sticks. Engineers don't have to draft descriptions; they just push, and CI runs an agent that reads the diff, the linked ticket, and the conventional-commits style guide, then writes a description and edits it onto the PR. The diff is the truth, the ticket is the why, the style guide is the constraint. Easy. The agent does it in twenty seconds; engineers were spending five to ten minutes each.
The schema-drift monitor sticks. During a four-month parallel build
of v2, v1's schema kept evolving, fields added on Tuesdays, types
renamed on Fridays, nobody coordinating. The monitor introspects v1's
live schema hourly, diffs against v2's rootSchema.graphql, and posts
to Slack on drift. A separate PR-time variant blocks merges that
introduce breaking or dangerous changes. It's a hundred lines around
a single Anthropic call that reasons about which schema deltas matter.
Engineers stopped having to remember; the monitor remembered.
The SQL-safety enforcer sticks. I wrote about it at length in the NL-SQL case study, but the short version is: it's a pre-merge agent that statically analyses every new SQL string in a diff and refuses to merge if the query is missing a tenant filter, or contains a write, or could fan out unboundedly. It found three real bugs in the first month. None of them were caught by review.
What these agents share: they replaced a process, not a person. A process that humans were doing inconsistently, slowly, and with forgiveable but expensive errors. Once the agent was reliable, the humans gladly handed it over.
The agents that didn't
A handful of agents I shipped were honestly not worth the maintenance.
The "review my PR for me" agent is the obvious one. I built one, shipped it, watched it leave smart-sounding-but-wrong comments on three PRs in a row, and deleted it inside a fortnight. The fundamental problem is that PR review is high-context and low-tolerance: a comment that sounds plausible but is wrong wastes more time than the comment saves, because the engineer now has to explain why the bot was wrong instead of just merging. I'd build a narrow review agent, does this diff change a public API without a CHANGELOG entry, that kind of thing, but the "reviewer" framing was a trap.
The flake hunter I scaled back. I had an agent that ran on red CI builds, looked at the test, and decided whether it was a flake. It was right about 70% of the time, which sounds good until you realise the 30% of "no, this is a real failure" cases were the ones engineers most needed to know about. I now have it post a recommendation but never act, and the recommendation is right often enough to be useful as a hint.
The "explain this old code" agent I deleted. It generated walls of text that sounded confident, were sometimes accurate, and were never checked. The cure for code nobody understands is to have a senior engineer read it, write the explanation, and put it in the repo. I tried to shortcut that. I shouldn't have.
The one rule I won't break again
Every agent has an exit code, a kill switch, and an eval set.
The exit code is non-negotiable: if the agent is in CI, it has to make a decision, and that decision has to be a number. "Pass" and "fail." Maybe "skip with reason." No "the model said something."
The kill switch is the env var that turns the agent off in five seconds when it misbehaves. It's not a code change, it's not a deploy, it's a flag. Every agent gets one before I ship. I had to write this rule down in capital letters after the second time I rolled an agent back by opening a PR.
The eval set is the smaller deal day-to-day, but it's the one that saves you when you upgrade Claude. When Sonnet 4.6 came out I was able to swap models on every agent in an afternoon, because each agent had a folder of canonical inputs and expected outputs that I could re-run against the new model. The agents whose evals I'd been lazy about? I spent days finding out that two of them now hallucinated more aggressively on a quirk of the new prompt template.
What I'm working on next
The frontier I'm exploring now is agents that maintain themselves. Most of my current agents are static: a prompt, a tool allowlist, an eval set. The next interesting layer is agents that watch their own production behavior (false positives, false negatives, latency drift) and propose updates to their own prompts when they're degrading. I've got two of these in early prototype. They'll either be the most useful thing I've built or a self-licking ice cream cone. I'll write up which one.
If you're shipping agents into production, I'd love to compare notes. The contact page is the easiest way.