---
title: "The review gap: your team now writes more code than it can review"
date: "2026-06-08"
excerpt: "AI didn't remove your bottleneck. It moved it to code review. The fix isn't reading faster: machines check the code, humans review the intent."
author: "Marcin Ostrowski"
---

At nerds.family, agents have been shipping code to production daily since December 2025. People assume the hard part was getting AI to write good code. It wasn't. The hard part is that somebody has to read it.

The industry data says the same thing. Ankit Jain of Aviator (a company that sells review tooling, so calibrate accordingly) [cites Faros AI data](https://www.latent.space/p/reviews-dead) from over 10,000 developers: teams with high AI adoption complete 21% more tasks and merge 98% more pull requests, while total review time grows 91%. Do the division and review time per PR stays roughly flat. That's the trap. Nobody's reviews got slower, so nothing feels broken on any single PR. The queue just doubled. CodeRabbit, another review vendor, [measured](https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report) about 1.7x more flagged issues per AI-co-authored PR than per human one. Both numbers come from people selling the cure, so treat them as a ceiling. The direction matches what I see in real codebases.

Review was struggling before AI showed up. PRs sat for two days, approvals were really skims, and a 500-line diff got "reviewed" between two meetings. AI just multiplied the load on a process we already didn't have time for.

## Reading faster is not the fix

Avery Pennarun has [a rule of thumb](https://apenwarr.ca/log/20260316) that every layer of approval makes a process 10x slower. A fix that takes 30 minutes to write takes half a day once it enters the review queue. AI makes the writing nearly free and leaves the queue exactly where it was. His conclusion is that the only way to sustainably go faster is fewer reviews, and he immediately adds the caveat that matters: you can't just remove review stages without something to replace them. That way lies the Ford Pinto.

So the question isn't whether to keep reviewing. It's what replaces the single overloaded gate. Pointing an AI reviewer at the queue is the obvious move, and we do run one in our pipelines. It catches real issues. But it inherits failure modes correlated with whatever generated the code, and an approval from a machine is responsibility deferred, not trust earned. It buys time. The shape of the problem stays.

## One gate was doing three jobs

The way most teams run it, code review is one checkpoint doing three jobs at once: hygiene (does it build, is it tested, is it safe to deploy), conventions (is this how we do things here), and judgment (is this the right change). A senior engineer reading a diff does all three in their head, under time pressure, while their own work waits.

The way out of the review gap is to unbundle them. The first two are mechanical, and machines should settle them before any human looks at anything. Judgment is not mechanical, and pretending a linter covers it is how teams get burned. No tool in your pipeline knows your billing rules.

## Machines go first

On our codebases, by the time a pull request reaches a human it has already passed the gates: linters loaded with the team's own rules, security scanning, a check that fails the build when changed code lacks test coverage, a check that blocks migrations that would lock a table at noon. The full suite runs locally before push. Ours takes 1 minute 32 seconds, and the speed is a feature, because nobody skips a gate they don't feel.

Conventions get the same treatment. Every "we don't do it this way here" comment you've ever left in review is a rule that should live in the repo, loaded by the agent before it writes and enforced by a hook after. A convention argument in review means you're paying a senior engineer to repeat what a config file should say. I described the whole setup in [How do you know the software is working?](/blog/how-do-you-know-the-software-is-working), and the conventions are [public on GitHub](https://github.com/marostr/superpowers-rails).

A human should never be the first reader of AI-generated code. That's the principle. But notice what the gates actually settled: hygiene and conventions. Not correctness.

## Who reviews the tests?

Here's the circularity nobody warns you about. If the agent writes the code and the agent writes the tests, the tests can encode the same misunderstanding as the code. Green suite, wrong behavior. A coverage gate proves tests exist, not that they assert anything worth asserting.

The escape is to treat acceptance criteria as the reviewed artifact. The spec says what must be true when the work is done, in checkable terms, and a human signs off on that before code exists. Then I read assertions, not implementations. Whether a test asserts the right thing is a judgment call, and judgment is the job we kept. We learned this the expensive way at Domestika, where part of the engagement was replacing tests that weren't catching real bugs with ones that did. A test suite is a claim about what matters. Somebody has to review the claim.

## Humans review the intent

At nerds.family the loop looks like this: the CPO writes a spec against the codebase, agents write the code, and I review what comes out against the spec. What I'm checking is not syntax. It's whether the change does what the spec says, and whether the spec asked for the right thing. Two people working part-time get more shipped per week than the product saw from a full-time developer before. That's our own count, on one product, and it's a throughput number, not a quality number. The honest quality story is in the gates and in the rollback guardrails, not in any single human's vigilance.

A fair objection: specs get skimmed too. Move the rubber stamp upstream and you have an upstream rubber stamp. The defenses are unglamorous. Keep scopes small enough that a spec fits on a screen, and keep a template that forces the questions a skim would skip. A spec you can't review in one sitting is two specs.

And one habit makes the whole thing compound. When a PR comes back wrong, the cheap move is to fix the PR. The better move is to fix the harness: encode the missing convention, write the missing check, patch the hole in the spec template. The mechanically checkable rejections disappear as a class. The judgment rejections don't, and that's fine. What's left in human review gets smaller and harder, which is exactly where senior attention belongs.

## What still gets read line by line

Migrations, auth, payments, anything that deletes data or touches money. Code where being wrong once is expensive gets human eyes on every line, and the AI's job there is to keep the diff small enough to read honestly. The list is short, written down, and enforced by file paths rather than vibes, the same way CODEOWNERS works. "Everything is critical" is how you got the rubber-stamp culture in the first place.

There's a real cost in all this that I won't pretend away: review is also how juniors learn a codebase, and shrinking it shrinks that channel. If your team leans on review for teaching, you'll need to replace it deliberately, not discover the gap in a year.

## What I'd actually watch

Most of what I've described we run on a small product with two people, and I won't pretend a 25-person team is the same animal. The gates transfer as they are; we've installed the same guardrails on years-old codebases with existing teams. The hard part at scale is spec discipline across many authors, and whoever tells you that part is solved is selling something.

What makes this tractable is that the review gap is measurable. Watch the review queue: time from PR opened to first human action, time to merge, and your revert rate while both improve. If your AI rollout produced more code but less trust, this is almost certainly where it's stuck, and unlike most of an AI adoption, the payback shows up in weeks. The queue is already there. You can watch it shrink.