In late 2025 I set myself a small experiment. I would build a feature end-to-end using AI agents wherever they could replace me. Schema, data model, services, UI, tests. I would only intervene when the agents were stuck or wrong.
The result was instructive in both directions. The agents wrote working code faster than I could have. They also produced a system that the team correctly told me was unmaintainable.
What separated the parts they got right from the parts they got wrong was not difficulty. It was a kind of question they could not answer.
What the agents did well
Better than expected. Not just for marketing copy.
Boilerplate, plumbing, routine refactors, test scaffolding, idiomatic translations between languages and frameworks — these were close to solved. The agents were faster than me and made fewer typos. For anything where the right answer was already implicit in the codebase's existing patterns, they extended those patterns competently.
They were also good at small acts of judgment. Naming a function. Choosing whether to extract a helper. Picking between two equivalent APIs. These felt like the kind of micro-decisions I had assumed only a human would make, and the agents made them at human-or-better quality, in seconds.
It is not a controversial claim anymore that current models can write production-grade code in a constrained context. I'll skip the proof.
What broke
The system worked. The team killed it anyway.
The reason was specific. Every individual file the agents produced was reasonable. Every method was tested. The names were fine. But the overall architecture — what was a service, what was a module, where the boundaries between domains lived, which abstractions deserved their own primitive and which were premature — was wrong in a way that would compound.
The agents had inferred the architecture from the easiest available signal: the structure of the code already in front of them. That signal pointed in a direction that worked for the current feature and would have made the next three features painful. The team saw it in the first review. I had not seen it in two weeks of building.
The lesson was not that the agents were bad. The lesson was that the question which seams should this system have is a different question than write the code for these seams, and the second is roughly solved while the first is roughly untouched.
Why naming the abstraction is hard
A system's architecture is a bet on what is going to change. A good abstraction makes the things that will change cheap to change, and accepts that things which won't change can stay tangled. A bad abstraction does the opposite — clean separations around things that never vary, and intertwined logic in the places that need to evolve.
The bet has to be made before the code is written, on the basis of information that doesn't fully exist yet. It rests on:
- What the team thinks will change in the next year
- Which constraints from the business or regulator are firm and which will move
- Where the people who will own this code believe their attention should go
- What other systems will eventually integrate with this one
None of that is in the codebase. None of it is in the prompt. Most of it is not even written down anywhere — it lives in the heads of the people who have been working on the broader problem, and it changes month to month as they learn.
An agent given the codebase produces an answer that is locally correct and structurally a guess. The guess is sometimes right. When it is wrong, the cost shows up two quarters later when the next change becomes hard.
Where the team came in
The team's review was not "this code is bad." It was "this is the wrong shape for what's coming." Their corrections were almost never about syntax or implementation. They were about which seams to cut: this should be a service, not a module. This is two domains pretending to be one. This abstraction should not exist yet — wait until the third caller appears. This abstraction should already exist — you've duplicated it three times and called it different names.
These corrections were cheap to apply once named, and almost impossible to specify in a prompt that would have produced them automatically. The team was holding context the agents did not have and could not have inferred — context about where the product was going, what other teams were building, which decisions were tentative and which were locked.
I had assumed the bottleneck was code generation. The bottleneck was abstraction selection. AI moved the bottleneck. It did not eliminate it.
What this changes about how I staff
The first-order observation is that engineers who are good at naming abstractions are now worth more, not less, than they were a year ago. The leverage of that skill went up because the work below it became cheap. I will pay more for someone who can look at a half-built system and say "this is the wrong shape" than for someone who can ship features quickly. The shipping is largely automated.
The second-order observation is messier. The classic path to becoming someone who can name abstractions is to write a lot of code and feel the pain of bad ones. If juniors stop writing the volume of code that historically built that judgment, where does the next generation of abstraction-namers come from? I do not have a clean answer. The most honest version is: I think we are about to find out, and the firms that figure it out first will have a structural advantage for a decade.
What I still don't know
- Whether agents can be made to ask the right structural questions unprompted. I see early signs they can be coached into it. I have not seen a case where the unprompted question matched what an experienced human would ask.
- How to evaluate an agent on architectural taste. The metrics that score code generation do not score abstraction. I do not know what the eval would look like.
- How small the maintainable team gets. Three engineers and a fleet of agents can plausibly do what twelve engineers used to do. Two cannot. The lower bound is somewhere in that gap and I think it is moving.
The unifying observation is that AI shifted the leverage in software engineering from typing to deciding, and the deciding part of the job was always undervalued because it was hard to see. It is no longer hard to see. It is the only part left.