productivity – Codemanship's Blog

Thanks to AI, Your Waterfall Is Showing

Here’s my hypothesis (and I’ve seen real-world examples with client teams that make me ask this question):

Dev teams bring “AI” coding assistants into their daily workflows. Very quickly, much more code starts hitting the feedback loops in their process: testing, code review, integration, user feedback.

This starts to overwhelm those loops. Delays get longer, more Bad Stuff leaks through into production, systems get less stable and teams end up spending more and more time playing whack-a-mole with issues.

Far from making these teams faster overall, the traffic jams get worse and their journeys take even longer.

So some teams adapt*. They reduce batch sizes going into the feedback loops: fewer changes being tested, reviewed and integrated at a time, in tighter feedback loops.

We know that if you tighten these feedback loops, three things tend to happen:

1. Lead times shrink

2. Reliability improves

3. Cost of change goes down

My hypothesis is that when we see positive systemic impact from “AI” code generation, it’s actually more attributable to the team adapting to it, and not directly to the “AI” itself.

“AI” code generation load-tests your dev process and forces you to address the worst bottlenecks.

Basically, “Your Waterfall is showing”.

* And, of course, some teams run in the exact opposite direction, getting even more Waterfall. Silly Billies!

It just so happens that I specialise in helping development teams build the technical skills needed to shrink lead times, improve reliability and lower cost of change – with and without AI.

I know, right! What a happy coincidence!

Visit my website to find out more.

Finally! Proof That Agentic AI Scales (For Creating Broken Software)

Some of the marketing choices made by the “AI” industry over the last few years have seemed a little… odd.

The latest is a “breakthrough” in “agentic AI” coding heralded by Cursor, in which they claim that a 3+ million-lines-of-code (MLOC) web browser was generated by 100 or so agents in a week.

It certainly sounds impressive, and many of the usual AI boosters have been amplifying it online as “proof” that agentic software development works at scale.

But don’t start ordering your uniform to fight in the Butlerian Jihad just yet. They might be getting a little ahead of themselves.

Did 100 agents generate 3 MLOC in about a week? It would appear so, yes. So that part of the claim’s probably true.

Did 100 agents generate a working web browser? Well, I couldn’t get it to work. And, apparently, other developers couldn’t get it to work.

Feel free to try it yourself if you have a Rust compiler.

And while you’re looking at the repo – and it surprises me it didn’t occur to them that anybody might – you might want to hop over to the Action performance metrics on the Insights page.

An 88% job failure rate is very high. It’s kind of indicative of a code base that doesn’t work. And looking at the CI build history on the Actions page, it appears it wasn’t working for a long time. I couldn’t go back far enough to find out when it became a sea of red builds.

Curiously, near to the end, builds suddenly started succeeding. Did the agents “fix” the build in the same way they sometimes “fix” failing tests, I wonder? If you’re a software engineering researcher, I suspect there’s at least one PhD project hiding in the data.

But, true to form, it ended on a broken build and what does indeed appear to be broken software.

The repo’s Action usage metrics tell an interesting story.

The total time GitHub spent running builds on this repo was 143,911 minutes. That’s 4 months of round-the-clock builds in about a week.

This strongly suggests that builds were happening in parallel, and that strongly suggests agents were checking in changes on top of each other. It also suggests agents were pulling changes while CI builds were in progress.

This is Continuous Integration 101. While a build is in progress, the software’s like Schrödinger’s Cat – simultaneously working and not working. Basically, we don’t know if the changes being tested in that check-in have broken the software.

The implication is, if our goal is to keep the code working, that nobody else should push or pull changes until they know the build’s green. And this means that builds shouldn’t be happening in parallel on the same code base.

Your dev team – agentic or of the meat-puppet variety – may be a 100-lane motorway, but a safe CI pipeline remains a garden gate.

The average job queuing time in the Action performance metrics illustrates what happens when a 100-lane motorway meets a garden gate.

And the 88% build failure rate illustrates what happens when motorists don’t stop for it.

The other fact staring us in the face is that the agents could not have been doing what Kent Beck calls “Clean Check-ins” – only checking in code that’s passing the tests.

They must have been pulling code from broken builds to stay in sync, and pushing demonstrably broken code (if they were running the tests, of course).

In the real world, when the build breaks and we can’t fix it quickly, we roll back to the previous working version – the last green build. Their agentic pile-on doesn’t appear to have done this. It broke, and they just carried on 88% of the time.

Far from proving that agentic software development works at scale, this experiment has proved my point. You can’t outrun a bottleneck.

If the agents had been constrained to producing software that works, all their check-ins would have had to go in single file – one at a time through the garden gate.

That’s where those 143,911 total build minutes tell a very different story. That’s the absolute minimum time it would have taken – with no slip-ups, no queueing etc -to produce a working web browser on that scale.

Realistically, with real-world constraints and LLMs’ famous unreliability – years, if ever. I strongly suspect it just wouldn’t be possible, and this experiment has just strengthened that case.

Who cares how fast we can generate broken code?

The discipline of real Continuous Integration – that results in working, shippable software – is something we explore practically with a team CI & CD exercise on my 3-day Code Craft training workshop. If you book it by January 31st 2026, you could save £thousands with our 50% off deal.

Productivity Theatre

The value proposition of Large Language Models is that they might boost our productivity as programmers (when we use them with good engineering discipline). And there’s no doubting that there are things we can do faster using this technology.

It would be a mistake, though, to assume that we can do everything faster using them.

I’ve watched many developers prompting, say, Claude or Cursor asking them to perform tasks that they could have done much faster – and more reliably – themselves using “classical” tools or just typing the damn code instead of a prompt.

For example, there’s been times when I’ve seen developers writing prompts like “Claude, please extract lines 23-29 into a new method called foo that returns the value of x” when their IDE could do that with a few keystrokes.

In these moments, the tool isn’t making them more productive. It’s making them less productive. So we might, when we find ourselves doing it – and I certainly have – pause to reflect on why.

It could be that we just don’t know the easier way. You might be surprised at how many developers haven’t even looked at the refactoring menu in their IDE, for example. Or that we know there’s an easier way, but don’t want to take the time to learn it.

In the latter case, it’s true that it would probably take them longer the first time. So they continue doing it the long way. Arrested development – often under time pressure, or perceived time pressure – is a common condition in our profession.

But in many cases, it seems performative. We know there’s a quicker, easier way, but we feel we need to show that it can be done – a bit like those people who insist you can cook anything in a microwave. Yes, technically you can, but is that always the best or the easiest option?

Someone calling themselves an “AI engineer” or “AI-native” might feel the need to signal to the people around them that they can indeed cook anything in the proverbial microwave.

And then it ceases to be about productivity. It’s about making a point, and demonstrating prowess to peers, superiors and random strangers on LinkedIn. The technology has become part of their professional identity.

Sacrificing real productivity in service to a specific technology or a technique is nothing new, of course. Software developers have been applying the “if all you’ve got is a hammer” principle for many decades – “I don’t know how we’re going to solve this problem, but we’re going to do it with microservices” sort of thing.

Quite often, these decisions – conscious or unconscious – seem to be career and status-driven. If “AI-native” is hot in the job market, that’s what we want on our CV. “AI when it makes sense” is not hot right now. It may be rational, but it’s less in-demand.

I’m still very much unfashionably rational, having sustained a long career by avoiding getting pigeonholed in the many fads and fashions that have come and gone. I’m interested in what’s real and in what works.

You never know. One day that might catch on.

If you want to hone your “classical” software engineering skills for the times when those are the better option, as well as learn how to apply engineering principles to “AI”-assisted development in an evidence-based approach that more and more developers are discovering gets better results – if it’s better results you’re after, of course – then check out my training website for details of courses and coaching, and oodles of free learning resources.

The Great Filter (Or Why High Performance Still Eludes Most Dev Teams, Even With AI)

In my post about The Gorman Paradox, I compare the lack of any evidence of “AI”-assisted productivity gains to be found out here in the Real World^TM with the famous Fermi Paradox that asks, if the universe is teeming with intelligent life, where is everybody?

It’s been over 3 years, and we’ve seen no uptick in products being added to the app stores. We’ve seen no rising tide on business bottom lines. We’ve seen no impact on national GDPs.

There is a likely explanation, and it’s the most obvious one: “AI”-assisted coding doesn’t actually make the majority of dev teams more productive. For sure, it produces more code. But, on average, it creates no net additional value.

The DORA data does find some teams reaping modest gains in terms of software delivery lead times without sacrificing reliability, and – interestingly – the data shows that those high-performing teams using “AI” were already high-performing without it.

The majority of teams showed that “AI” actually slowed them down, and these were the teams who were already pretty slow before “AI”. Attaching a code-generating firehose to the process just made them marginally slower.

The differentiator? Are the high-performing teams super-skilled programmers? Are they getting paid more? Are they putting something in the office water supply?

It turns out that what separates the teams who get a negative boost from the teams who get a positive boost is that the latter have addressed the bottlenecks in their development process.

Blocking activities, like detailed up-front design, after-the-fact testing, Pull Request code reviews, and big merges to the main branch, have been turned into continuous activities.

Teams work in much smaller batches and in much tighter feedback loops, designing, testing, inspecting and merging many times an hour instead of every few days.

Work doesn’t sit in queues waiting for someone’s attention. There are very few traffic lights between the developer’s desktop and the outside world to slow that traffic down.

And this means that changes can make it into the hands of users very rapidly, with highly automated, highly reliable, frictionless delivery pipelines that – as the supermarket ads used to say – get the peas from the farmer’s field to your table in no time at all.

The just-in-time grocery supply chains of supermarkets are a good analogy for the processes high-performing teams are using. Supermarkets don’t buy a year’s supply of fresh peas once a year. They buy tomorrow’s supply today, and their formidable logistical capabilities get those peas on the shelves pronto.

Those formidable logistical capabilities didn’t just appear, either. They’re the product of many decades of investment. Supermarket chains have sunk billions into getting better at it, so they can maximise cash flow by minimising the amount of working capital they have committed at any time.

They don’t want millions of pounds-worth of produce sitting in warehouses making them no money.

And businesses don’t want millions of pounds-worth of software changes sitting in queues waiting to be released. They want them out there in the hands of users, creating value in the form of learning what works and what doesn’t. Software that can’t be used has no value.

Walk into any large organisation and take a snapshot of how much investment in developed code is “in progress”. For some, it literally is million of pounds-worth – tens or hundreds of thousands of pounds, multiplied by dozens or hundreds of teams.

The impact on a business of being able to out-learn the competition can be so profound, we might ask ourselves “Why isn’t everybody doing this?” Can you imagine a supermarket chain deciding not to bother with JIT supply? They wouldn’t last long.

It’s come into focus even more sharply with the rise of “AI”-assisted software development. It’s quite clear now that even modest productivity gains lie on the other side of the spectrum with teams who have addressed their bottlenecks and have low-friction delivery pipelines.

I see a “Great Filter” that continues to prevent the large majority of dev teams making it to that Nirvana. It requires a big, ongoing investment in the software development capability needed.

We’re talking about investment in people and skills. We’re talking about investment in teams and organisational design. We’re talking about investment in tooling and automation. We’re talking about investment in research and experimentation. We’re talking about investment in talent pipelines and outreach. We’re talking about investment in developer communities and the profession of software development.

Typically, I’ve seen that companies who manage to progress from the bottleneck-ridden ways of working to highly iterative, frictionless methods needed to invest 20-25% of their entire development budget in building and maintaining that capability.

And building that kind of capability takes years.

You can’t buy it. You can’t install it. You can’t have it flown in fresh from Silicon Valley.

And, like organ transplants, any attempt to transplant that kind of capability into your business will be met with organisational anti-bodies protecting the status quo.

And that, folks, is The Great Filter.

Most organisations are simply not prepared to make that kind of commitment in time, effort and money.

Sure, they want the business benefits of faster lead times, more reliable releases, and a lower cost of change. But they’re just not willing to pony up to get it.

On a daily basis, I see people online warning us not to “get left behind by AI”. The reality is that the people who really are getting left behind are the ones who think that the bottlenecks and blockers they’ve struggled with in the past will magically get out of the way of the code-generating firehose.

Low-performing teams, now grappling with the downstream chaos caused by “AI” code generation, will probably always be the norm. And the value of this technology will probably never be realised by those businesses.

If you’re on of the few who are serious about building software development capability, my training courses in the technical practices that enable rapid, reliable and sustained evolution of software to meeting changing needs are half price if you confirm your booking by Jan 31st.

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

In this series, I’ve explored the principles and practices that teams seeing modest improvements in software development outcomes have been applying.

After more than four years since the first “AI” coding assistant, GitHub Copilot, appeared, the evidence is clear. Claims of teams achieving 2x, 5x, even 10x productivity gains simply don’t stand up to scrutiny. No shortage of anecdotal evidence, but not a shred of hard data. It seems when we measure it, the gains mysteriously disappear.

The real range, when it’s measured in terms of team outcomes like delivery lead time and release stability, is roughly 0.8x – 1.2x, with negative effects being substantially more common than positives.

And we know why. Faster cars != faster traffic. Gains in code generation, according to the latest DORA State of AI-Assisted Software Development report, are lost to “downstream chaos” for the majority of teams.

Coding never was the bottleneck in software development, and optimising a non-bottleneck in a system with real bottlenecks just makes those bottlenecks worse.

Far from boosting team productivity, for the majority of “AI” users, it’s actually slowing them down, while also negatively impacting product or system reliability and maintainability. They’re producing worse software, later.

Most of those teams won’t be aware that it’s happening, of course. They attached a code-generating firehose to their development plumbing, and while the business is asking why they’re not getting the power shower they were promised, most teams are measuring the water pressure coming out of the hose (lines of code, commits, Pull Requests) and not out of the shower (business outcomes), because those numbers look far more impressive.

The teams who are seeing improvements in lead times of 5%, 10%, 15%, without sacrificing reliability and without increasing the cost of change, are doing it the way they were always doing it:

Working in small batches, solving one problem at a time
Iterating rapidly, with continuous testing, code review, refactoring and integration
Architecting highly modular designs that localise the “blast radius” of changes
Organising around end-to-end outcomes instead of around role or technology specialisms
Working with high autonomy, making timely decisions on the ground instead of sending them up the chain of command

When I observe teams that fall into the “high-performing” and “elite” categories of the DORA capability classifications using tools like Claude Code and Cursor, I see feedback loops being tightened. Batch sizes get even smaller, quality gates get even narrower, iterations get even faster. They keep “AI” on a very tight leash, and that by itself could well account for the improvements in outcomes.

Meanwhile, the majority of teams are doing the opposite. They’re trying to specify large amounts of work in detail up-front. They’re leaving “AI agents” to chew through long tasks that have wide impact, generating or modifying hundreds or even thousands of lines of code while developers go to the proverbial pub.

And, of course, they test and inspect too late, applying too little rigour – “Looks good to me.” They put far too much trust in the technology, relying on “rules” and “guardrails” set out in Markdown files that we know LLMs will misinterpret and ignore randomly, barely keeping one hand on the wheel.

As far as I’ve seen, no team actually winning with the technology works like that. They’re keeping both hands firmly on the wheel. They’re doing the driving. As AI luminary Andrej Karpathy put it, “agentic” solutions built on top of LLMs just don’t work reliably enough today to leave them to get on with it.

It may be many years before they do. Statistical mechanics predicts it could well be never, with the order-of-magnitude improvement in accuracy needed to make them reliable enough (wrong 2% of the time instead of 20%) calculated to require 10²⁰ times the compute to train. To do that on similar timescales to the hyperscale models of today would require Dyson Spheres (plural) to power it.

Any autonomous software developer – human or machine – requires Actual Intelligence: the ability to reason, to learn, to plan and to understand. There’s no reason to believe that any technology built using deep learning alone will ever be capable of those things, regardless of how plausibly they can mimic them, and no matter how big we scale them. LLMs are almost certainly a dead end for AGI.

For this reason I’ve resisted speculating about how good the technology might become in the future, even though the entire value proposition we see coming out of the frontier labs continues to be about future capabilities. The gold is always over the next hill, it seems.

Instead, I’ve focused my experiments and my learning on present-day reality. And the present-day reality that we’ll likely have to live with for a long time is that LLMs are unreliable narrators. End of. Any approach that doesn’t embrace this fact is doomed to fail.

That’s not to say, though, that there aren’t things we can do to reduce the “hallucinations” and confabulations, and therefore the downstream chaos.

LLMs perform well – are less unreliable – when we present them with problems that are well-represented in their training data. The errors they make are usually a product of going outside of their data distribution, presenting them with inputs that are too complex, too novel or too niche.

Ask them for one thing, in a common problem domain, and chances are much higher that they’ll get it right. Ask them for 10 things, or for something in the long-tail of sparse training examples, and we’re in “hallucination” territory.

Clarifying with examples (e.g., test cases) helps to minimise the semantic ambiguity of inputs, reducing the risk of misinterpretation, and this is especially helpful when the model’s working with code because the samples they’re trained on are paired with those kinds of examples. They give the LLM more to match on.

Contexts need to be small and specific to the current task. How small? Research suggests that the effective usable context sizes of even the frontier LLMs are orders of magnitude smaller than advertised. Going over 1,000 tokens is likely to produce errors, but even contexts as small as 100 tokens can produce problems.

Attention dilution, drift, “probability collapse” (play one at chess and you’ll see what I mean), and the famous “lost in the middle” effect make the odds of a model following all of the rules in your CLAUDE.md file, or all the requirements for a whole feature, vanishingly remote. They just can’t accurately pay attention to that many things.

But even if they could, trying to match on dozens of criteria simultaneously will inevitably send them out-of-distribution.

So the smart money focuses on one problem at a time and one rule at a time, working in rapid iterations, testing and inspecting after every step to ensure everything’s tickety-boo before committing the change (singular) and moving on to the next problem.

And when everything’s not tickety-boo – e.g., tests start failing – they do a hard reset and try again, perhaps breaking the task down into smaller, more in-distribution steps. Or, after the model’s failed 2-3 times, writing the code themselves to get themselves out of a “doom loop”.

There will be times – many times – when you’ll be writing or tweaking or fixing the code yourself. Over-relying on the tool is likely to cause your skills to atrophy, so it’s important to keep your hand in.

It will also be necessary to stay on top of the code. The risk, when code’s being created faster than we can understand it, is that a kind of “comprehension debt” will rapidly build up. When we have to edit the code ourselves, it’s going to take us significantly longer to understand it.

And, of course, it compounds the “looks good to me” problem with our own version of the Gell-Mann amnesia effect. Something I’ve heard often over the last 3 years is people saying “Well, it’s not good with <programming language they know well>, but it’s great at <programming language they barely know>”. The less we understand the output, the less we see the brown M&Ms in the bowl.

“Agentic” coding assistants are claimed to be able to break complex problems down, and plan and execute large pieces of work in smaller steps. Even if they can – and remember that LLMs don’t reason and don’t plan, they just produce plausible-looking reasoning and plausible-looking plans – that doesn’t mean we can hit “Play” and walk away to leave them to it. We still need to check the results at every step and be ready to grab the wheel when the model inevitably takes a wrong turn.

Many developers report how LLM accuracy falls of a cliff when tasked with making changes to code that lacks separation of concerns, and we know why this is too. Changing large modules with many dependencies brings a lot more code into play, which means the model has to work with a much larger context. And we’re out-of-distribution again.

The really interesting thing is that the teams DORA found were succeeding with “AI” were already working this way. Practices like Test-Driven Development, refactoring, modular design and Continuous Integration are highly compatible with working with “AI” coding assistants. Not just compatible, in fact – essential.

But we shouldn’t be surprised, really. Software development – with or without “AI” – is inherently uncertain. Is this really what the user needs? Will this architecture scale like we want? How do I use that new library? How do I make Java do this, that or the other?

It’s one unknown after another. Successful teams don’t let that uncertainty pile up, heaping speculation and assumption on top of speculation and assumption. They turn the cards over as they’re being dealt. Small steps, rapid feedback. Adapting to reality as it emerges.

Far from “changing the game”, probabilistic “AI” coding assistants have just added a new layer of uncertainty. Same game, different dice.

Those of us who’ve been promoting and teaching these skills for decades may have the last laugh, as more and more teams discover it really is the only effective way to drink from the firehose.

Skills like Test-Driven Development, refactoring, modular design and Continuous Integration don’t come with your Claude Code plan. You can’t buy them or install them like an “AI” coding assistant. They take time to learn – lots of time. Expert guidance from an experienced practitioner can expedite things and help you avoid the many pitfalls.

If you’re looking for training and coaching in the practices that are distinguishing the high-performing teams from the rest – with or without “AI” – visit my website.

The AI-Ready Software Developer #20 – It’s The Bottlenecks, Stupid!

For many years now, cycling has been consistently the fastest way to get around central London. Faster than taking the tube. Faster than taking the train. Faster than taking the bus. Faster than taking a cab. Faster than taking your car.

All of these other modes of transport are, in theory, faster than a bike. But the bike will tend to get there first, not because it’s the fastest vehicle, but because it’s subject to the fewest constraints.

Cars, cabs, trains and buses move not at the top speed of the vehicle, but at the speed of the system.

And, of course, when we measure their journey speed at an average 9 mph, we don’t see them crawling along steadily at that pace.

“Travelling” in London is really mostly waiting. Waiting at junctions. Waiting at traffic lights. Waiting to turn. Waiting for the bus to pull out. Waiting on rail platforms. Waiting at tube stations. Waiting for the pedestrian to cross. Waiting for that van to unload.

Cyclists spend significantly less time waiting, and that makes them faster across town overall.

Similarly, development teams that can produce code much faster, but work in a system with real constraints – lots of waiting – will tend to be outperformed overall by teams who might produce code significantly slower, but who are less constrained – spend less time waiting.

What are developers waiting for? What are the traffic lights, junctions and pedestrian crossings in our work?

If I submit a Pull Request, I’m waiting for it to be reviewed. If I send my code for testing, I’m waiting for the results. If I don’t have SQL skills, and I need a new column in the database, I’m waiting for the DBA to add it for me. If I need someone on another team to make a change to their API, more waiting. If I pick up a feature request that needs clarifying, I’m waiting for the customer or the product owner to shed some light. If I need my manager to raise a request for a laptop, then that’s just yet more waiting.

Teams with handovers, sign-offs and other blocking activities in their development process will tend to be outperformed by teams who spend less time waiting, regardless of the raw coding power available to them.

Teams who treat activities like testing, code review, customer interaction and merging as “phases” in their process will tend to be outperformed by teams who do them continuously, regardless of how many LOC or tokens per minute they’re capable of generating.

This isn’t conjecture. The best available evidence is pretty clear. Teams who’ve addressed the bottlenecks in their system are getting there sooner – and in better shape – than teams who haven’t. With or without “AI”.

The teams who collaborate with customers every day – many times a day – outperform teams who have limited, infrequent access.

The teams who design, test, review, refactor and integrate continuously outperform teams who do them in phases.

The teams with wider skillsets outperform highly-specialised teams.

The teams working in cohesive and loosely-coupled enterprise architectures outperform teams working in distributed monoliths.

The teams with more autonomy outperform teams working in command-and-control hierarchies.

None of these things comes with your Claude Code plan. You can’t buy them. You can’t install them. But you can learn them.

And if you’re ticking none of those boxes, and you still think a code-generating supercar is going to make things better, I have a Bugatti Chiron Sport you might be interested in buying. Perfect for the school run!

The AI-Ready Software Developer #18 – “Productivity”. You Keep Using That Word.

It’s 20 years since I created a website with the banner “I Care About Software” as part of a loose “post-agile” movement that sought to step back from the tribes and factions that had grown to dominate software development at the time.

Regardless of whether we believed X, Y or Z was the “best way”, could we at least agree that the outcomes matter?

It matters if the software does what the user expects it to do. It matters if it does it reliably. It matters that it does it when they need it. It matters that when they need it to do something else, they don’t have to wait a year or three for us to bring them that change.

Unlike many other professions, and with few exceptions, we’re under no compulsion to produce useful, usable, reliable software or to be responsive to the customer’s needs. It’s largely voluntary.

We don’t usually get fined when we ship bugs. We won’t be sanctioned if the platform goes down for 24 hours. We won’t get struck off some professional register if the lead time on changes is months or years (or never).

(Of course, eventually, if we’re consistently bad, we can go out of business. But historically, another job – where we can screw up another business – hasn’t been difficult to find, even with a long trail of bodies behind us.)

And we don’t usually get a bonus for releases that go without incident, or a promotion for consistently maintaining short lead times.

In this sense, we have less incentive to do a good job than a takeaway delivery driver.

A friend once kindly introduced me to the project managers in her company to give them the old “better, sooner, for longer” pitch. I talked about teams I’d worked with who had built the capability to deliver and deliver and deliver, week after week, year after year, with no drama and no fires to put out.

They actually said the quiet part out loud: “But we get paid to put out the fires!”

For software developers, the carrot and the stick usually have very little to do with actual outcomes that customers and end users might care about. This is evidenced by the fact that so few teams keep even one eye on those outcomes.

The average development team doesn’t actually know how much of their time is spent fixing bugs instead of responding to user needs. They don’t know what their lead times are, or how they might be changing over the lifetime of the product or system. They’re often the last to know when the website’s down.

Most damning of all, the average development team has no idea what the users’ needs or the business goals of the product actually are. And that’s where the value that we all talk about really is, you’d have thought.

And so it’s entirely possible – inevitable, even – for the priorities of dev teams and of the people paying for and using the software to become very misaligned.

I’m always struck by the chasm that can grow between them, with developers genuinely believing they’re doing a great job while users just roll their eyes. You’d be surprised how often teams are blissfully unaware of how dissatisfied their customers are.

So, before you start that 2-year REPLACE ALL THE THINGS WITH RUST project, stop to ask yourselves “What impact would this have on overall outcomes?”

If your goal is to make your software more memory-safe, are there other ways that might be less radical or disruptive? (You might be surprised what you can do with static analysis, for example.)

Is it possible to do it a bit at a time, under the radar, to minimise the impact on customer-perceived value?

Will it really solve any problem the business actually has at all? I’m a fan of asking what the intended business outcomes are. You’d be amazed how often technical initiatives explode on contact with that question.

Which brings me to the topic de jour. The Gorman Paradox asks why, if “AI” coding assistants are having the profound impact on development team productivity many report – 2x, 5x, 10x, 100x (!) – we see no sign of that in the app stores, on business bottom lines, or in the wider economy? Where’s all this extra productivity going?

I also have to ask why the reports of productivity gains using “AI” vary so widely, with anecdotal reports of increases in excess of 1000%, and measured variances in the range of -20% to +20%.

The words doing all the work here are “anecdotal” and “measured”, I suspect. But also, in precisely what is being measured.

Optimistic findings are usually based on measurements of things the customer doesn’t care about – lines of code, commits, Pull Requests etc.

The pessimistic – or certainly less sensational – findings are usually based on measurements of things the customer does care about, like lead times, reliability and overall costs.

It’s well-understood why producing more code faster – faster than we can understand it and test it – tends to overwhelm the real bottlenecks in the software development process. So there’s no great mystery about how “AI” code generation can actually reduce overall system performance.

What has been mysterious is why some teams see it, and most teams don’t.

They attach a code-generating firehose to their process and can’t understand why the business is complaining that they’re not getting the power shower they were promised.

There is a candidate for a causal mechanism. Most teams don’t see the impact on systemic outcomes because they’re not looking.

So when a developer tells you that, say, Claude Code has made them 10x more productive, they’re not lying. (Well, okay, maybe some of them are.) They just have a very different understanding of what “productivity” means.

If we’re to survive as professionals in this “age of AI”, I recommend pinning your flag to the mast of user needs and business outcomes.

Most importantly, we should be measuring our success by the business goals of the software, or the feature, or the change. If the goal is to, say, increase our share of the vegan takeaway market, the ultimate test is whether in reality we actually do.

This is the ultimate definition of “Done”.

We claim to develop software iteratively, but that implies we’re iterating towards some goal. If iterations don’t converge, we get (literal) chaos – just a random walk through solution space. Which would be a sadly accurate summary of the majority of efforts, with most teams unable to articulate what the goals actually are. If, indeed, there are any.

Aligning Teams Around Shared Goals (Is A Very Good Idea)

Software development peeps: if I asked you what is the ultimate business goal of the software you’re working on, would you know? Are you sure it even has one?

I’m gonna tell you a story from my early days as a contractor working in London. I took over the lead in a team of 8 developers, and very quickly could see that things were going badly.

Putting aside all the technical obstacles they were wrestling with, like a heavy reliance on manual regression testing, and everybody trying to merge the day before a scheduled release etc, the thing that really struck me was the huge amount of time being spent on arguing.

Arguing about the tech stack. Arguing about the architecture. Arguing about the approach. The team was split into factions, all pulling in different directions, providing no net forward momentum.

In my experience, this is what happens when teams don’t have a clear direction to align on. What this team needs, I thought to myself, is a goal – a magnetic north to get them pointing roughly in the same direction.

So I went back to the business – because our business analysts couldn’t answer the question (and hadn’t asked, evidently) – and asked “What are you hoping to get from this new system?”

I’m going to change their goal to protect the innocent (it’s a bit of a giveaway). Let’s imagine they replied, “We’re looking to grow our vegan customer base”.

Now the team’s arguments had their magnetic north. How will this help grow our vegan customer base? You’d be surprised, in that light, how many of the contentions simply evaporated. (Or maybe you wouldn’t.)

You might be less surprised by the profound shift in the team’s focus. Use cases got dropped. New use cases were explored. The technical architecture got simplified. Communication improved, both within the team and with other dev teams, ops, and the business.

Because now we all had something in common to talk about.

Software products and systems don’t exist in a vacuum. They’re almost always part of something bigger. And if religion teaches us anything, it’s that people like to feel part of something bigger than themselves.

The shift in focus from delivering software to solving a problem can completely rewrite priorities and realign teams. And it completely pulls the rug from under what most developers think of as “productivity”.

Why does this matter more today?

We’re currently seeing our industry go through quite possibly it’s worst navel-gazing episode, certainly since I’ve been in it. I’ve never seen so many developers obsessing over the “how”, and not giving a moment’s thought to the “what”, let alone the “why”.

Who cares how fast we can climb the wrong mountains?

And finally, let’s pause to reflect on that word, “iterative”. Iteration without convergence is chaos – literally.

That rather begs the question, converging on what, exactly?

The Gorman Paradox: An Explanation?

Yesterday I ruminated on why – if the claims about “AI” productivity gains in programming are to be believed – we see no evidence of significant numbers of “AI”-generated or “AI”-assisted software making it out into the real world (e.g., on app stores).

I get a sense that there may be some kind of “Great Filter” that prevents projects from evolving to that advanced stage. And I have a feeling I might know what it is.

I’m imaging software development as an iterative, goal-seeking algorithm.

What would its time complexity look like? I reckon the factors would be:

Batch size – how much changes in each iteration
Feedback “tightness” – how much uncertainty is reduced in each iteration
Cost of change – how able are we to act on that feedback?

I suspect “coding”, as a factor, would shrink to nothing at scale.

Basically, batch size, feedback loops and cost of change are doing the heavy lifting

I could go even further. Maybe the cost of change, at limits, becomes simply a function of how long it takes to understand the code and how long it takes to test it (and I’d include things like code review in testing).

Far from helping, attaching a code-generating firehose to development has already proven to work against us in these respects if we loosen our grip on batch sizes to gain the initial benefits.

And if we don’t loosen our grip – if we keep the “AI” on a tight leash – coding, as a factor, still shrinks to nothing in the Big O. Even the most high-performing teams see modest improvements at best in lead times. Most teams slow down.

All this might explain why the productivity gains of “AI” coding assistants vanish at scale, and why we see no evidence of significant numbers of “AI”-assisted projects making it out of the proverbial shed.

When user experience, reliability, security and maintainability matter, we’re forced to drink from the firehose one small mouthful at a time, taking deep breaths between so as not to let it overwhelm us. When you’re drinking from a firehose, the limit isn’t the firehose.

For sure, teams are using this technology on code bases where those things matter, but we’re already seeing from tech companies who’ve boasted publicly about how much of their code is “AI”-generated what the downstream consequences can be.

So, for real productivity gains, that constrains “AI” coding assistants to projects where those things don’t matter anywhere near as much. Personal projects, prototypes, internal tools, one-offs etc. I don’t think anybody disputes that this technology is great for those kinds of things. But they don’t often make it out of the shed.

At least, I very much hope they don’t.

I’ve done a lot of research and experimentation to try to establish how to get better results using LLMs, but I can’t hand-on-heart promise that they’ll do much more than mitigate harms. They’re very much focused on batch sizes, feedback loops and cost of change – the stuff we already know works, “AI” or not.

I have reasons to suspect that teams who are showing modest gains using “AI” have actually tightened up their feedback loops to adapt to the firehose, which could be thought of as a kind of stress test for development processes. It’s entirely possible that this is what’s giving them those small gains, and not the “AI” at all.

Is “AI-First” a Strategy, an Ideology, or a Performance?

I was recently observing a team doing their day-to-day work. Their C-suite had introduced an “AI-first” policy over the summer, mandating that development teams use “AI” as much as possible on their code.

Starting in November, this mandate turned into a KPI for individual developers, and for teams: % of AI-generated code in Pull Requests. (And, no, I have no idea how they measure that. But I understand that tool use is being tracked. More tokens, nurse!)

The underlying threat didn’t need to be said out loud. “Use this technology more, or start looking for a new job.”

Developers are now incentivised to find reasons to use “AI” coding assistants, and they’re doing it at any cost. All other priorities rescinded. Crew expendable.

By now, we probably all know Goodhart’s Law:

When a measure becomes a target, it ceases to be a good measure.

I have a shorter version: be careful what you wish for.

The history of software development is littered with the bones of teams who were given incentives to adopt dysfunctional behaviour.

The classic “Lines of Code”, “Function Points”, “Velocity” and other easily gameable measures of “productivity” have forced thousands upon thousands of teams to take their eyes off the prize – i.e. business outcomes – and focus their efforts on producing more stuff – output.

Introducing mandates about how that stuff must be produced is a step up the dysfunction ladder.

So I had the privilege of watching a Java developer write the following prompt, which I jotted down for posterity.

Please extract the selected block of code into a new method called 'averageDailySales'

Using their IDE, that would have been just Ctrl+Alt+M and a method name. And, importantly, it would have worked first time. They ended up taking a second pass to fix the missing parameter the new method needed.

The whole 2-hour session was a masterclass in trying to cook a complete roast dinner in a sandwich toaster. The goal was very clearly not to solve the problem, but to use the tool.

I’m not saying that a tool like Claude Code or Cursor would add no value in the process. I’m saying that developers should be incentivised to use the right tool for the job.

But the “AI-first” mandate has encouraged some of the developers to drop all the other tools. They’ve gone 100% “AI”. No IDE in sight.

An Integrated Development Environment is a Swiss Army Knife of tools for viewing, navigating, manipulating (including refactoring), executing, debugging, profiling, inspecting, testing, version controlling and merging code. Well, the ones I use are, anyway.

Could IDEs be better? For sure. But when it comes to, for example, extracting a method, they are still my go-to. It’s usually much faster, and it’s much, much safer. I’ll take predictable over powerful any day.

Using refactoring as an example, if my IDE doesn’t have the automated refactoring I need – e.g., there’s no Move Instance Method in PyCharm – then I’ll let Claude have a crack at it, with my finger poised over the reset button.

Because my focus is on achieving better outcomes, I’ve necessarily landed on a hybrid approach that uses Claude when that makes sense – and, if you read my blog regularly, you’ll know I’m still exploring that – and uses my IDE or some boring old-fashioned deterministic command line tool when that makes sense. And, right now, that’s most of the time.

I feel no compulsion to drink exclusively from the firehose “just because”.

But then, I’m the only shareholder. And that’s probably what “AI-first” policies are really about: optics. There’s something about this that genuinely feels performative. It’s not about using “AI”, it’s about being seen to use “AI”. Look at us! We’re cutting edge!

There’s no credible evidence that “AI” ten-times’s dev team productivity. But there’s plenty of evidence that it can 10x a valuation.

The fact that, according to the more credible data, the technology slows most teams down – less reliable software gets delivered later and costs more – doesn’t seem to matter.

It’s quite revealing, if you think about it. Perhaps it never mattered?

I contracted in a London firm that would proudly announce in each year’s annual report how much they’d invested in technology. It didn’t seem to matter what return they got on that investment, just as long as they spent that £30 million on the latest “cool thing”.

When my team tried to engage with the business on real problems, the push-back came from the IT Director himself. That, apparently, was “not what we do here”. We’re here to chew bubblegum and spend money. And we’re all out of bubblegum.

So, in that sense, t’was ever thus. But, as with all things “AI” these days, it’s a question of scale. Watching team after team after team drop everything to try and tame the code-generating firehose, while real business and real user needs go unaddressed, is quite the spectacle. It’s a hyper-scaled dysfunction.

Of course, eventually, reality’s going to catch up with us. I was interviewed for a Financial Times newsletter, The AI Shift, a few weeks ago, and it was clear that the resetting of expectations has spread far beyond the the dev floor. People who aren’t software developers are starting to notice.

If, like me, you’re interested in what’s real and what works in developing software – with or without “AI” – you might want to visit my training and coaching site for details of courses and consulting in principles and practices that are proven to shorten lead times, improve reliability of releases and lower the cost of change.

I mean, if that’s your sort of thing.

And if you’re curious about what really seems to work when we’re using “AI” coding assistants, I’ve brain-dumped my learnings from nearly 3 years experimenting with and exploring the code-generating firehose. You might be surprised to hear that it has very little to do with code generation, and almost everything to do with the real bottlenecks in development.

Then again, you might not.