Why Your AI Keeps Going Off Track

February 18, 2026

You wrote a prompt. The agent wrote code. The code doesn’t do what you wanted.

This happens to everyone. It still happens to me, usually when I’m rushing. The good news: it’s almost always diagnosable, and the fix is almost always in your spec, not in the AI.

Here are the 5 reasons AI output goes off track, how to figure out which one hit you, and what to do about each.

Reason 1: Vague spec, creative interpretation

This is the most common one. By far.

You wrote: “Add a search feature to the app.”

The agent built: a full-text search system with debounced input, highlighted results, search history, and a suggestion dropdown.

You wanted: a filter input on the existing list that hides non-matching rows.

The agent wasn’t wrong. “Search feature” could mean any of those things. Without constraints, the agent picks the most complete interpretation. It’s trying to be thorough. That thoroughness becomes over-engineering when you had something simpler in mind.

How to diagnose: Read the diff and ask yourself — did the agent build something reasonable for what I asked? If yes, the problem isn’t the agent. It’s the ask.

The fix: Be specific about scope. Not “add search” but “add a text input above the prospects table in ProspectsTable.tsx that filters visible rows by name and email. Client-side filtering only. No search API. No search history. No debounce needed — the list is under 200 items.”

That’s 3 extra sentences. They prevent a 200-line diff you didn’t want.

Reason 2: No file references, wrong patterns

Your codebase has conventions. The agent can discover them by reading your code, but it doesn’t always read the right code.

I hit this on Triumfit. I asked the agent to add a new workout screen. The app had a consistent pattern: each screen was a functional component in src/screens/, used the shared useTheme() hook, and wrapped content in a ScreenContainer component. The agent created the screen in src/views/ as a class component with inline styles. Technically functional. Completely inconsistent with the codebase.

How to diagnose: The output works but doesn’t match your project’s patterns. Different file locations, different component styles, different naming conventions.

The fix: Point the agent at reference files. “Create a new screen following the pattern in src/screens/WorkoutDetail.tsx. Same file structure, same imports, same use of ScreenContainer and useTheme().”

One line of reference saves an entire review cycle. I talk about this in how to write specs that AI can execute — reference files are the most efficient spec tool you have.

Reason 3: Task too large, loses the thread

Claude Code handles large tasks surprisingly well. But “surprisingly well” isn’t “perfectly,” and the failure mode is specific: the agent gets the first 70% right and then starts making compromises on the last 30%.

I watched this happen on a Scouter task where I asked for a new dashboard page with a data table, filters, chart, and export functionality. The table and filters were solid. The chart used a library I didn’t want. The export had a bug in the date formatting. The agent had been working for 15 minutes and was deep in the implementation — the early decisions were good but the later ones showed fatigue (if you can call it that).

How to diagnose: The first part of the diff is clean. The later parts are sloppier. Or the agent made early decisions that constrained later work in ways it didn’t foresee.

The fix: Break the task up. That dashboard page should have been 4 tasks: table, filters, chart, export. Each one gets its own spec, its own session, its own review. The overhead of 4 specs instead of 1 is maybe 5 extra minutes. The time saved in review and correction is much more. Scoping tasks for background agents covers how to find the right task size.

Reason 4: Conflicting instructions, picks one

This one is sneaky. Your spec says two things that contradict each other, and the agent picks one without telling you.

Real example from Logline. My spec said: “Match the existing card style” and also “use the new design system tokens from src/theme/tokens.ts.” The existing cards didn’t use the new design tokens — they were from before the design system update. The agent had to choose: match the old cards or use the new tokens. It chose the tokens. The new card looked different from every other card in the app.

How to diagnose: The output follows part of your spec but ignores another part. When you re-read the spec, you realize the two parts don’t agree.

The fix: Before you blame the agent, re-read your own spec looking for contradictions. If you find yourself thinking “well, obviously I meant…” — the spec is the problem. The agent can’t read your intent, only your words. Remove the ambiguity. Pick one instruction. If you need both, explain how to reconcile them: “Use the new design tokens from tokens.ts, but match the existing card’s layout and spacing. The visual result should look consistent with existing cards even though the implementation uses the new token system.”

Reason 5: No constraints, over-engineers

The agent defaults to thoroughness. Without constraints, it adds error handling you didn’t need, loading states for instant operations, accessibility features for internal tools, and abstractions for things that will never be reused.

None of this is bad in isolation. But it makes the diff bigger, the review longer, and the code more complex than the feature warrants.

How to diagnose: The output works and does what you asked — plus a bunch of things you didn’t ask for. The diff is 3x larger than expected.

The fix: Add explicit constraints to your spec.

### Constraints
- No loading states (data is already in memory)
- No new dependencies
- No abstraction layers — this is a one-off component
- No error boundaries (the parent already handles errors)
- Keep it under 80 lines

Constraints are the most underused part of spec writing. They tell the agent what not to build, which is just as important as telling it what to build. I’ve found that 3–4 constraint lines cut unnecessary diff size by 40–50%.

The meta-lesson

Bad output is almost always bad input.

I know that’s uncomfortable. It’s easier to blame the tool than the instruction. But after hundreds of delegated tasks, the correlation is clear: tight specs produce good output. Vague specs produce creative interpretation. Every time.

The good news is that this is entirely in your control. You can’t make the AI smarter. You can make your specs better. And better specs have a 1:10 payoff — 2 minutes of spec clarity saves 20 minutes of correction.

The diagnosis workflow

When output misses the mark, I run through this checklist:

Read the diff. What did it actually do?
Re-read the spec. What did I actually ask for?
Find the gap. Where does the output diverge from the intent?
Categorize. Which of the 5 reasons caused it?
Fix the spec, not the code. Correct the spec and re-run. Don’t manually fix the output — you’ll hit the same problem next time if the spec isn’t fixed.

Step 5 is counterintuitive. When the output is 80% right, the temptation is to just fix the 20% by hand. Resist it. Fix the spec. Re-run the task. Now you have a spec that works, and you can reuse that pattern for similar tasks.

This is the review skill that matters most — not reading code, but reading your own specs critically. Reviewing AI-generated code covers the full review process. And if you want to write specs that get it right on the first pass, start with the common mistakes post to see the patterns that trip people up most.

The AI is consistent. Give it the same input, it produces similar output. If the output is wrong, change the input.