The demo is always a fresh repo. The agent scaffolds an app in minutes, the audience applauds, and somebody in the room decides to point the same agent at the company’s ten-year-old Rails monolith on Monday. By Friday the verdict is in: “AI can’t handle our codebase.”
I half agree. Out of the box, on a legacy app, agents flail. But the diagnosis is usually wrong about why. The model is the same one that aced the demo; what changed is the repo, which has been teaching it things that aren’t true.
The codebase is the prompt
I’ve written before that AI wakes up every session with no memory, trusting only what’s in front of it. On a legacy app, what’s in front of it is whatever slice of a decade of sediment its searches pulled into context, and every file in that slice reads as instruction. Each pattern is a vote for how things are done here. A legacy codebase votes for everything at once, and which votes the agent sees on a given task is close to a lottery.
I know what that sediment looks like from the inside. At Domestika we rescued thirteen years of Rails: react-on-rails next to classic views, two API styles on different conventions, internal libraries reinventing the framework, and dead code you couldn’t distinguish from live code without asking a human. That rescue happened before my agent era, but imagine pointing an agent at it. Which of the four view patterns does it imitate? Whichever one its search happened to surface.
That’s the mechanism behind almost every legacy-agent failure I see, and it splits into four specific lies the repo tells.
The four lies
Dead code votes too. The agent can’t know that this service object was abandoned in 2019 and the team is embarrassed by it. It looks like code, so it gets imitated. Humans carry a mental map of which directories are radioactive, and the agent gets no map.
Mixed patterns read as permission. Three ways of doing authorization in one repo doesn’t tell the agent “we’re mid-migration, use the new one.” It tells the agent “any of these is fine.” Where a human asks a colleague, the agent imitates whatever landed in its context, and your oldest pattern is the most common in the repo, so it lands there most often. The pattern you’re trying to leave is the one most likely to be copied.
Missing tests mean no feedback loop. Agents are wrong a lot, which is survivable when something tells them so before you do. On a well-tested codebase the suite is that feedback, and the agent iterates against it. On a legacy app with thin or dishonest coverage, the agent’s first wrong assumption sails through to a human reviewer, who now distrusts everything. The agent isn’t worse here. It’s unsupervised.
The invariants live in people. “Never touch rows in that table directly, billing recalculates them overnight” is written nowhere, so a new hire learns it at lunch and the agent learns it from an incident.
Prepping the repo
The fix follows from the diagnosis: change what the repo teaches. This is concrete work with a short list, and we’ve done it on real products rather than demos. The harness we built at nerds.family runs on an existing production codebase, and when we installed the same setup at Gyfted, on a stack we don’t usually touch, the guardrails held well enough that their CPO ships landing pages to production without a developer in the loop. The agents were the same ones everyone has. The difference was that the repo stopped lying.
Declare the current patterns. A conventions file the agent auto-loads (CLAUDE.md, AGENTS.md, or your tool’s equivalent) that says “authorization goes through policies, the old before_action checks are legacy, never add new ones” outweighs a thousand votes from old code, because it loads into the agent’s context before it writes. This was the single most important thing for making agents reliable anywhere, and on legacy apps it does double duty: it’s the map of what’s radioactive.
Delete the dead code, or quarantine it. Every line of abandoned code is a vote you’re letting it cast. The hard part is knowing what’s dead, and that identification cost is exactly why the quarantine rule exists: an explicit “do not imitate anything in legacy/” line in the conventions file buys most of the benefit while the team works up the courage to delete.
Give the agent a feedback loop before you give it tasks. On a legacy app that means characterization tests around the paths that matter, money first. Honest coverage rather than full coverage. The goal is that when the agent breaks something important, a machine says so in seconds. Verification machinery is what converts agent speed from a threat into an asset.
Write the invariants down. The tribal knowledge that lives in lunch conversations goes into the repo as plain markdown the agent loads. The first draft takes an afternoon of asking “what would a new hire get wrong here?” and it pays back on every session forever. The unwritten rule the agent breaks next month was always going to be broken by somebody. The agent just got there first.
The honest timeline
Two warnings from doing this for real. The prep is genuinely upfront work: weeks, not hours, on a big app, and it competes with feature pressure. The way to win that argument is that every item on the list makes the codebase better for the humans too, agent or no agent. This is the debt that was already slowing everyone down, with a new reason to finally pay it.
And the failure you prevent is expensive to discover late. A team that points agents at an unprepped legacy app gets the Friday verdict, writes off the whole approach, and joins the 90% reporting that AI changed nothing. The repo was the problem, and the repo was fixable.