Reviewing AI-Generated Code: The 3-Pass Method

March 19, 2026

AI-generated code has a specific failure mode that human-written code doesn’t: it looks correct. The formatting is clean. The variable names make sense. The structure follows patterns you’d expect. And sometimes the logic is completely wrong.

That’s why reviewing AI output is a different skill than reviewing a colleague’s pull request. You can’t trust your gut. You have to actually read it.

Read the diff, not the conversation

When I review Claude Code output, I don’t scroll back through the conversation to see what it “thought” about. I don’t read its explanation of what it did. I look at the diff.

git diff HEAD~1

The diff is the truth. The conversation is the AI’s narrative about the truth. Those are different things.

If I can’t understand what changed by reading the diff alone, either the change is too big (I scoped the task wrong) or the change is too scattered (the AI touched files it shouldn’t have). Both are problems I need to fix at the scoping level, not the review level.

This is the same approach I use across all my projects in my daily review workflow. Diff first, always.

Why review matters more with AI

When a human developer writes code, they’re constrained by what they know. A junior dev writes junior code — you can see the rough edges and know where to look.

AI writes senior-looking code with junior-level understanding. It’ll use the right design pattern but apply it to the wrong problem. It’ll write a perfectly structured function that handles an edge case that doesn’t exist while ignoring one that does.

On Scouter, I had Claude Code build an API rate limiter. The code was clean. Middleware registered correctly. Returned proper 429 responses. But it was counting requests per IP address in a setup that sits behind a load balancer — so every request had the same IP. The code looked professional. It would have rate-limited all users as a single user.

That’s the kind of bug you only catch by reading the diff with your brain engaged.

The 3-pass review

I review every diff three times. Sounds slow. It’s faster than debugging production.

Pass 1: Structure

What files changed? Are they the right files? Did the AI touch things I didn’t ask it to touch?

src/middleware/rateLimit.ts    (expected — this is the task)
src/middleware/index.ts        (expected — needs to register the middleware)
src/config/defaults.ts         (unexpected — why is it changing config?)
test/middleware/rateLimit.test.ts  (expected — tests)

That unexpected file change is a red flag. Sometimes it’s fine — the AI needed a config value and put it in the right place. Sometimes it’s the AI “improving” something adjacent to the task. I check it before moving on.

This is why scoping tasks tightly matters so much. A well-scoped task produces a diff with no surprises in the file list.

Pass 2: Logic

Now I read the actual code changes. I’m asking:

Does the logic match what I asked for?
Are the edge cases handled?
Are there off-by-one errors, null checks missing, race conditions?
Does this match how the rest of the codebase handles similar problems?

That last one is critical. AI doesn’t have an intuitive sense of your codebase’s patterns. It knows patterns from training data. If your codebase uses a specific error handling approach, the AI might use a different (equally valid) one. Now you have inconsistency.

In Triumfit, I noticed Claude Code was using try/catch blocks for async error handling in one file while every other file in the project uses .catch() chains. Both work. But consistency matters more than either individual choice.

Pass 3: Style and cleanup

Naming conventions match the project?
No unnecessary comments? (AI loves adding comments like // Rate limit middleware above a function called rateLimitMiddleware.)
No dead code left in?
Import order consistent?

This pass is fast. 30 seconds on most diffs. But it catches the stuff that makes a codebase feel messy over time.

Red flags in AI output

After reviewing hundreds of AI-generated diffs, these are the patterns that make me look closer:

Over-engineering. You asked for a function, it built a class with an abstract interface. You asked for a simple check, it built a validation framework. AI tends to generalize when you wanted something specific.

Features you didn’t request. “While I was adding the export button, I also added a share button and improved the toolbar layout.” No. I asked for an export button. The diff should contain an export button. Everything else is noise that needs to be reviewed, tested, and maintained.

Wrong patterns for your codebase. The AI writes idiomatic code — for some codebase. Maybe not yours. If your React app uses hooks everywhere and the AI writes a class component, that’s a flag.

Confident-sounding test code that doesn’t test anything meaningful. AI is very good at writing tests that pass. It’s less good at writing tests that would catch bugs. Look at what the tests actually assert, not just that they exist.

Unnecessary abstraction. If the task was “add a button that calls an API endpoint,” and the diff includes a new service layer, a factory pattern, and a config file, that’s the AI optimizing for imagined future requirements instead of the actual task.

Reviewing across multiple projects

When you’re reviewing diffs across 6+ projects in a morning — which is most mornings for me — review fatigue is real.

My rules:

Review the hardest diffs first. Business logic, API endpoints, data migrations — these get my fresh-brain attention. Styling changes and copy updates get reviewed later.

Don’t batch too many big diffs. If three of my morning tasks are large features, I scope the other five as small changes. Twelve large diffs in a row means sloppy reviews by diff eight.

Use the structure pass as a filter. If the file list looks clean and matches what I expected, I can move to the logic pass faster. If the file list has surprises, I slow down immediately.

Take the common mistakes seriously. Most review failures come from the same handful of patterns. Once you know what to look for, you find them faster.

The cost of skipping review

I’ve shipped bugs by accepting diffs too quickly. A Lucid export feature that crashed on entries with images — because I skimmed the diff and didn’t notice it wasn’t handling the image attachment path. A Scouter webhook handler that silently dropped events with unexpected payload shapes — because the error handling looked right at a glance.

Both bugs took longer to fix than the review would have taken to catch them.

The temptation to skip review is strong. The AI wrote clean-looking code. The tests pass. The feature works when you click through it once. But “works when you try it” and “works in production” are different standards.

The bottom line

AI writes code fast. Your job isn’t to write code anymore. Your job is to review it — and you need to be good at it.

Three passes. Structure, logic, style. Every diff. No exceptions.

The boring discipline of reading diffs is the thing that separates “I use AI to code” from “I ship reliable software with AI.” They’re not the same thing.