Replacing manual engineering checks with Claude Code agents, Apoorv Mittal

There is a version of "AI in engineering" where you ask the model to write the feature for you, push the result, and hope for the best. This is not that version.

The version I've been shipping into production codebases looks more like a build system: small, opinionated, deterministic-where-possible programs that happen to call an LLM somewhere in the middle. They run on every commit, every PR, every CI failure, every schema migration. They have exit codes. They have logs. They have kill switches. Engineers complain when they break.

Most of them are Claude Code agents, small .claude/agents/*.md files with prompts plus tool allowlists, invoked from CI or git hooks. A few are stand-alone Node scripts that hit the Anthropic API directly. The distinction barely matters by the time the dust settles.

This is what I've learned shipping ~30 of them.

Principles

Three rules I've ended up applying without exception:

An agent has to be cheaper to write than the rule it replaces. If the agent is more code than the lint rule it replaces, write the lint rule.
An agent has to fail loud. "Probably fine" is not an acceptable exit state. Either it produces a clear pass/fail, or it doesn't run.
Every agent has a kill switch. A repo-level config flag, a CI variable, a .claude/agents/<name>.md rename. If it's costing more time than it's saving, anyone on the team can disable it without asking.

The third one is the hardest. Most failed agent rollouts I've seen violate it.

The taxonomy

The agents I've shipped fall into four buckets.

Build & quality

These run on every commit, PR, and pre-push. They're the cheap, fast, high-volume agents.

Pre- and post-commit Prettier and ESLint enforcement
TypeScript any-type detection (with a curated allowlist)
Dead-code scans (unused exports, unreachable branches)
End-of-turn type-check gating, the agent refuses to declare a turn finished if tsc --noEmit is red

The pattern: a deterministic script does most of the work; the agent's job is to read the failures, decide whether they're worth surfacing, and write a one-line PR comment if so.

PR review & delivery

These run on every PR. Higher latency tolerance, more LLM-driven.

Automated PR review (style, design-system adherence, missing tests)
Review-comment resolution (the agent posts patches that address specific reviewer comments and asks the reviewer to confirm)
PR-description generation from the diff and commit messages
Pre-push quality gates (run the same checks as CI, locally, before the push)
Hotfix and release-checklist workflows
CODEOWNERS sanity checks

Migration & investigation

These run ad-hoc, kicked off by a human. The big one-time agents.

Resolver-by-resolver porting agents (the GraphQL API rewrite leaned on one of these, see the case study)
Bulk-module scaffolders (e.g., porting twenty service classes to a new adapter pattern)
Query-extraction agents (find every Drizzle query in the codebase, group by table, surface missing tenant filters)
CI-failure diagnostics (read the failing log, find the most likely cause, post a comment)
Flake-hunter (run a test suite N times, summarise which tests were non-deterministic)
Performance baselines + bundle-size diffing on every PR

Architectural & data-safety guardrails

These are the ones I'm most proud of. They encode rules that would otherwise live in a wiki page and decay.

DDL-safety scanner on every schema migration (no DROP TABLE, no unbacked NOT NULL-without-default, no missing indexes on FKs)
Multi-tenant RBAC scope validator (every new query must filter by tenant_id or be on an explicit allowlist)
Audit-trail wiring on money mutations (any function whose name or return type touches cents, amount, price, etc., must call logAuditEvent somewhere in its call tree)
SQL-safety enforcement (string-templated SQL is a build error; parameterised queries only)
Protected-component edit blocking (changes to a small set of high-blast-radius files require a second-set-of-eyes label)
Accessibility + translation-drift scanners

One agent in detail: the SQL-safety enforcer

A walkthrough of the smallest, dumbest, most consequential agent I've shipped.

The problem: people writing SQL by hand keep reaching for string interpolation when the parameterised-query API is one character less ergonomic. We had three SQL-injection-class bugs over eighteen months. None reached production, but each one wasted reviewer cycles and shook team confidence.

The agent: runs on every PR. Walks the diff. For each new or modified file under src/db/, finds string-templated SQL, anything that looks like:

const rows = await db.execute(`
  SELECT * FROM users WHERE id = ${userId}
`);

…and refuses to merge until it's switched to:

const rows = await db.execute(sql`
  SELECT * FROM users WHERE id = ${userId}
`);

The difference is a single-character sql-tag prefix, which our query helper compiles to a parameterised query. The agent posts a PR comment with the offending lines, the suggested patch, and a link to the internal docs explaining why.

The full agent definition is about forty lines:

---
name: sql-safety
allowed-tools:
  - Bash(rg:*)
  - Bash(git diff:*)
trigger:
  - on: pull_request
    paths: ["src/db/**/*.ts", "src/server/db/**/*.ts"]
---

You are a SQL-safety reviewer. For every changed file matching the
trigger paths, find SQL strings constructed via template literals
without our `sql` tag.

Report findings as a single PR comment. Do NOT post inline review
comments, they create noise. Do NOT comment if there are zero findings.

For each finding, include:
- file path and line number
- the offending snippet
- the suggested rewrite
- a link to docs/internal/sql-safety.md

If you cannot determine whether a string literal is SQL with high
confidence, ignore it. False positives erode trust in this agent more
than missed catches.

Why it works: small surface area, narrow remit, deterministic outputs (run twice, get the same comment), explicit "false-positive aversion" instruction. Team complains when it breaks. That's the signal.

Rollout

I've now run the same rollout three times — across two engineering organisations and a personal project. The pattern that's worked:

Ship one agent that nobody asked for, but that solves a real pain. (For me, twice in a row, this has been bundle-size diffing on every PR, instantly visible value, zero risk.)
Wait two weeks. People will start adding it manually to other repos.
Roll out a second agent that addresses pain people have asked for in 1:1s, flake hunting, PR description generation. Pair with one adopter to write the prompt.
Codify the kill switch in writing. Document how to disable any agent. Mention this every time you launch a new one.
Roll out the architectural guardrails last. These are the ones people have to live with on every PR; they need to trust the agents already.

The order matters. I've watched two attempts to roll out architectural guardrails first, before anyone had built trust in the cheap quality- of-life agents. Both times the guardrails got disabled within a month.

Outcomes

Outcome	Note
~30 agents shipped	Across 3 production codebases + 2 personal
SQL-injection-class bugs (18mo before)	3
SQL-injection-class bugs (12mo after)	0
Average PR review turnaround	Down meaningfully (mid double-digit %)
`any`-type creep on the largest repo	Halted; net reduction over six months
Translation drift between locales	Detected at PR time instead of at release
Engineer-reported productivity	Up; specifically on PR description writing

What I'd do differently

I'd write evaluations earlier. A handful of these agents drifted in behavior across model upgrades, and I had to scramble to write retroactive evals. For the next batch I'm building eval sets the same week I write the agent, a small fixture repo with seeded "should pass" and "should fail" cases, run on every model bump.

I'd treat agent prompts as code from day one. Code review, version history, and a CHANGELOG. The two times I've hit a regression in the wild have both been because someone tweaked an agent prompt without telling the team, and the failure mode wasn't obvious until a week later.

I'd resist the temptation to make agents conversational. The most durable agents are the dumb ones, pass/fail, no chit-chat. The conversational ones are the first to degrade and the last to be trusted.