November 2025 – Codemanship's Blog

To Build A High-Performing Team, You Need To Get Inside The Bubble

Some thoughts about the persistent 1:10:100 ratio of developers who are genuinely good, competent and “meh”…

Imagine building a team of 4 devs, and you want to tip the balance towards the 1%.

The odds of a team of 4 having 2 genuinely good developers are about 1:1,700.

The odds of that team having 3 genuinely good developers are 1:252,500.

The odds seem stacked against such teams existing. But that’s only if developers are selected from the pool at random. And that’s not how it usually works.

In the bubble of the 1% – which equates to about 4,500 developers in the UK, and 300,000 worldwide – people either know you, or they know someone who knows you. It’s a small-world network.

And in small-world networks, probabilities can be dramatically skewed. The odds of a randomly-selected developer being genuinely good may be 1:100, but the odds of a genuinely good developer knowing other genuinely good developers are pretty high.

So the odds of a 2/4 and even a 3/4 good team increase to the point where they’re quite probable indeed. If you find someone inside that bubble, you can much more easily build a high-performing team around them. Birds of a feather, and all that.

I occasionally get asked by founders to help them with their first developer hire. For all kinds of understandable reasons, they will often start out looking for a cheaper person because money’s tight. This typically rules out the 1% and the 10% and leaves them with “meh” options – inexperienced, or the “1 year’s experience 10 times” folks.

From many years working with software start-ups, I see this is a pivotal hire. Which band you go for – the 1, the 10 or the 100 – will likely determine the future trajectory of software development in the business: Good, Competent, or Meh.

Dev culture, once it takes root, is hard to shake. So you want that first hire to be setting a good example for future hires. Mentoring, in particular, is a large part of what good developers do, in my experience. I certainly wish, as an entry-level developer back in the Steam Age, that had been my first professional experience. I can’t help feeling that’s how it should work at entry level.

The code base also sets the tone. If you start out with a Big Ball of Mud, it takes much more effort to climb out of it. But also… Monkey see, monkey do. Less experienced devs will tend to imitate the style, with nobody there to tell them “This isn’t how it should be done”. (They may even be telling them “This is how it’s done!”)

But most valuable of all, a good developer will very likely know – or be able to reach – other good developers, raising your odds of building a high-performing team by orders of magnitude.

Why would a founder want a high-performing team? Their calling card is short delivery lead times and reliable releases, and the ability to sustain the pace on the same product for as long as that product’s a going concern.

They can rapidly, reliably and sustainably evolve software to meet rapidly changing needs.

Which is nice.

We Don’t Out-Deliver The Competition. We Outlearn them.

I’ve long considered software development as a process of removing uncertainty.

The customer asks us for “Instagram, but for cats” – which could have infinite possible interpretations – and our job is to whittle those possibilities down to a single interpretation. Computers kind of insist on that.

How could this process be represented more essentially, so I can see the wood for the trees of such a complex thing?

Let’s play a game.

There are two teams, A and B, and they are both tasked with guessing a random 4-digit number. Their guesses must be submitted with numbers they have to carve on to stone tablets.

Team A guesses all 4 digits at a time. They painstakingly carve “0000” on to a tablet and submit that to learn whether it’s right or wrong. In this case, “0000” is wrong. So they painstakingly carve another 4-digit number, “0001”, which is also wrong.

If the guess is wrong, the tablet is destroyed, and they have to start all over again. Let’s say that the time to carve one digit is 1 hour. So it takes team A 4 hours to make one guess.

Team B take a different approach. They guess one digit at a time, still carving them into stone tablets. They start by guessing that the first digit is “0”, which is wrong.

When their guess is wrong, the tablet is also destroyed, and they must start a new one.

Which team – A or B – would you bet on to guess the 4-digit number first?

Worst case, team A could take 40,000 hours. Worst case for team B is 40 hours. The odds of team A guessing right in the first 40 hours are 1,000:1 against. I’d bet on team B.

Now, let’s 10x team A’s “productivity” by giving them a machine that can carve one digit in just 6 minutes. Each guess now takes them 24 minutes instead of 4 hours.

Which team would you bet on now?

The odds of team A guessing right in the first 40 hours using the 10x machine are 100:1 against. I’d still bet on team B.

We’d be mistaken to confuse “numbers guessed” with “numbers guessed correctly” as our measure of productivity here.

What’s giving team B such a massive advantage is not the speed at which they produce tablets, but the speed at which they reduce uncertainty.

When team A make their first guess, the odds of it being right are 1:10,000. On their second guess, having ruled out one 4-digit number, they are 1:9,999.

When team A make their first guess, their odds of guessing all 4 digits correctly are also 1:10,000. But on their second guess, having ruled out all 4-digit numbers beginning with “0”, they are 1:9,000.

Basically, with each guess, team B reduce the uncertainty by a factor of 10%. Team A reduce it by a tiny fraction of that with each of their guesses. To put it another way, team B outlearns team A, even as team A out-delivers them by a factor of 10.

The takeaway for me is that – in software development, considered as a process of reducing uncertainty – it’s the batch size and the feedback loops doing the heavy lifting.

If your team wants to build the skills they’ll need to outlearn the competition, solving one problem at a time in tight feedback loops, visit my training site for details of courses and on-the-job coaching.

Manual Refactoring: Python – Introduce Parameter Object & Move Instance Method

Two refactorings I can’t live without are Introduce Parameter Object and Move Instance Method.

I often find myself introducing new classes to separate concerns using them in a little dance I call “chunking”.

In IntelliJ and Rider or Resharper, these are available as automated refactorings, which saves some time. (In Rider, Introduce Parameter Object is helpfully called “Transform Parameters” for no good reason.)

But when I’m working in dynamic languages – which suffer from a lack of type information – I have to do these refactorings by hand.

Half of the courses I run, I’m either demonstrating in Python or JavaScript, so this comes up a lot. I thought it might be helpful to document these manual refactorings for future reference.

In this example, I’ve been asked to change this code that generates quotes for fitted carpets so that rooms can have different shapes, meaning that there will be different ways of calculating the area of carpet required.

class CarpetQuote:
    def calculate(self, width, length, price_per_sq_m, round_up):
        area = width * length

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

My solution would be to introduce a class for calculating the room’s area that knows its dimensions. (If you were just thinking “switch statement”, give yourself a wobble.)

I want to introduce a parameter to the calculate method for the room. And I want to do it in teeny, safe steps.

Step #1 – Add a new room parameter

class CarpetQuote:
    def calculate(self, width, length, price_per_sq_m, round_up,  room=None):
        area = width * length

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

By giving room a default value, this code still runs and passes the tests.

Step #2 – Instantiate room in the client code (the tests) as a new class

class Room:
    pass


class CarpetQuoteTest(unittest.TestCase):
    def test_quote_for_carpet_no_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(122.50, quote.calculate(3.5, 3.5, 10.0, False, Room() ))

    def test_quote_for_carpet_with_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(130.0, quote.calculate(3.5, 3.5, 10.0, True, Room() ))

Step #3 – Pass in width and length as constructor parameters of Room

class Room:
    def __init__(self, width, length):
        pass


class CarpetQuoteTest(unittest.TestCase):
    def test_quote_for_carpet_no_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(122.50, quote.calculate(3.5, 3.5, 10.0, False, Room(3.5, 3.5) ))

    def test_quote_for_carpet_with_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(130.0, quote.calculate(3.5, 3.5, 10.0, True, Room(3.5, 3.5) ))

Step #4 – Assign width and length to fields (member variables) of Room

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width

Room is now ready to be used in the calculate method.

Step #4 – Replace reference to calculate‘s width and length parameters with references to room‘s fields

class CarpetQuote:
    def calculate(self, width, length, price_per_sq_m, round_up,  room=None):
        area = room.width * room.length

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

We can now do a little cleaning up.

Step #5 – Remove unused width and length parameters from calculate (Safe Delete)

class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room=None):

Step #6 – Remove redundant default value for room parameter

class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):

Okay, that’s some hanging chads dealt with. Let’s look at moving the area calculation to where it now belongs.

Step #7 – Extract area calculation into a separate method

This involves cutting the calculation code and pasting it into the new method as a return value, and replacing that code with a call to the new method.

class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):
        area = self.area(room)

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

    def area(self, room):
        return room.width * room.length

We can now see that the area method has very obvious Feature Envy for room.

Step #8 – Move the area method to the Room class

First, I cut the area method and paste it into Room. I then switch the target of the call to area from self to room.

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width
        
    def area(self, room):
        return room.width * room.length


class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):
        area = room.area(room)

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

Then I switch the references to room.length and room.width to self.length and self.width. Remember, room and self are the same object.

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width

    def area(self, room):
        return self.width * self.length

The room parameter is now unused. Let’s delete it.

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width

    def area(self):
        return self.width * self.length


class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):
        area = room.area()

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

Now there’s no need to expose the width and length fields.

Step #9 – “Hide” width and length

Let’s rename these fields to indicate that they should not be accessed from outside Room.

class Room:
    def __init__(self, width, length):
        self._length = length
        self._width = width

    def area(self):
        return self._width * self._length

Now it’s easy to substitute different implementations of room in the CarpetQuote‘s calculate method. Job done!

One final note: every code snippet here was taken after I’d seen it pass the tests. That’s 13 test runs – and 13 commits – to do this refactoring.

…In case you were wondering what I mean by “small steps”.

(Of course, in IntelliJ or Rider, it would have been a lot fewer steps. That’s the pay-off for automated refactorings, and why I’ll choose my IDE with that in mind.)

What Is A “Good Software Developer”, Anyway?

This is something that’s on my mind a lot, as a trainer and coach. For sure, it’s highly subjective, and depends very much on the context in which they’re working.

So I’m going to talk about what I’m looking for when clients ask me to find them a “good software developer”.

Top of the list, is an expert command of Rust.

I’m kidding, of course.

Top of the list, invariably, is communication skills. The best developers work to understand and be understood clearly and effectively. So much of what goes wrong in software development can be traced back to poor (or no) communication. Written, verbal, aural. They have good comprehension, they can articulate complex ideas simply to different audiences, and – most importantly – they actually communicate. I’ve known some great communicators who did everything they could to avoid having to do it.

Closely related, I’m also interested in their empathy and emotional intelligence. More on that later.

In terms of technical skills and knowledge, I look for someone who is near enough to what the team needs in terms of problem domain, programming languages, tech stacks, tools and so on. The question is, how long might it take them to get up to speed?

If they’re a C# developer with other good qualities in abundance, it’s probably worth investing a few weeks letting them wrap their head around Java. If they have no physics background, and you’re working on software for particle accelerators, it could be years before they’ve wrapped their head around it.

This is less about whether they’re a “good developer”, but rather whether they’re good enough for the problem they’ll be working on. So it’s about fit – are they right for the part?

Next – and this might surprise you – is refactoring. It’s such an important skill regardless of what your development approach is. The ability to reshape code to accommodate changes without breaking it is worth its weight in gold. It also requires a fairly advanced ability to reason about code and about design.

Long-form refactoring, in particular, hints at a maturity as a programmer and software designer that’s sadly hard to find. If I hired someone who hadn’t been introduced to refactoring yet, training them would be the next step.

Having interviewed hundreds of developers, I know that good refactoring skills are a “tell”. They’re a good omen about the candidate’s technical skills more generally.

More generally, someone who works in tight feedback loops, testing and reviewing code continuously, committing often, and integrating on the trunk many times a day, will tend to generate better outcomes: shorter lead times, more reliable releases, and a lower cost of changing the software. (It never ceases to surprise me when employers say that doesn’t interest them much.)

The best developers I’ve seen have been able to stand back and see the bigger picture and how their work fits into it. As I sometimes say, they see the arrows, not just the boxes. They’re not like those actors who only read their lines in the script, or think about their scenes, and take no interest in the story or the other characters.

They’re effective operating at multiple levels, taking an interest not just in their code, but in what other devs on the team are doing – watch them pull changes from the repo, do they read them? – and in what other teams working on connected stuff are doing, and what’s happening in UX design, product management, enterprise architecture and all the other stuff going on around them.

In this sense, they’re also pretty situationally aware.

My version of a “good developer” is focused on achieving outcomes, not on producing output. They aim to create value, not just bash out features or close tickets. And in that sense, they take an active interest in understanding the problems they’re trying to solve. They see themselves as part of a wider team that’s not just delivering software, but solving business problems.

The really good ones play a part in steering the ship towards better outcomes and greater value. They’re not just waiters taking orders. They’re helping to design the menu.

The developers I recommend to clients can, and often have, worn multiple hats in their careers. They’re part programmer, part tester, part architect, part product manager, part UX designer, and other things. I would say they are “T-shaped” developers.

This is especially important if we want to avoid over-specialisation on the team, where experts become potential bottlenecks. It also helps enormously when developers have walked a mile in other people’s shoes when it comes to understanding our impact on the wider process. I could argue that if more us were “T-shaped”, fewer of us would believe that “AI” code generators are increasing productivity. We’d know there’s a lot more to it than that.

And they keenly understand that the unit of value creation is the team. They bring value holistically, considering the impact of their work on team members and team outcomes, as well as looking for ways to help others on the team add more value.

As Dan North recently pointed out, some of the most valuable software developers may not even show up under their own name in your Jira or GitHub stats. (I’ve worked with Tim Mackinnon, and I can confirm that he is an exceptionally good software developer.)

One of the ways that good software developers can add value is by mentoring other developers on their team. They may well be at a point where they bring the most value by growing more good software developers. So, I look for significant experience of that.

Again, this is my take, but no matter how technically gifted you are, if you sniff at the prospect of mentoring, I’d think twice about recommending you to clients who need and expect that.

And finally, they need to catch on quick. Software development’s a learning process, and the ability to learn is kind of critical. This isn’t just about being smart. It’s also about being open and willing to learn, to be out of your comfort zone. And to be prepared to fail. Every skill mastered involves going through a stage of being bad at it. It helps a lot if the culture of the team offers them the psychological safety to try and to fail.

Some developers bring that psychological safety with them. I’ve had candidates literally try e.g., TDD, or a new programming language, in the interview. Someone who’s prepared to have a crack at something, even in those kinds of pressured situations, has usually turned out to be a real asset to the team.

In interviews, I’m often picking up on the candidate’s level of risk aversion. Will they be willing to take the occasional leap into the unknown? Will they be prepared to question the status quo? Or will they be focusing their efforts on covering their own backsides and protecting the hierarchy?

I appreciate this is usually a learned behaviour. A developer who has experienced serious consequences for failure is less likely to be willing to take those leaps. And we can all try to do what we can to bring a bigger sense of adventure out of people by being supportive, and by developing systems that minimise potential consequences. (I mean, exactly how does a junior developer get access to a live database? The failure there is ours.)

The fact remains, though, that a risk-averse developer, more afraid of failing, is less likely to try, to ask “dumb” questions, and therefore less likely to learn. It stunts our growth.

This is why, when I consider the team as a whole, it’s important for developers – especially senior developers – to have enough emotional intelligence to recognise when they are the ones holding others back by mocking people when they’re trying, or punishing them when they fail. It’s something we all have to work on.

To sum up, when I’m looking for a “good developer”, I’m looking for:

A good communicator
Someone technically near enough
Good refactoring skills (ideally, long-form)
Someone who solves one problem at a time, in tight feedback loops
Someone who sees the bigger picture
Someone who is situationally aware of what’s going on around them
Someone who is outcome-oriented
Someone who can wear multiple hats
Someone who adds value to the team
Someone who catches on fast, and isn’t afraid to try new things
Someone with a decent amount of emotional intelligence and empathy

The Future of Software Development is Software Developers

<ShamelessPlug>

If you're looking to get your development team AI-ready, my hands-on instructor-led training in the principles and practices that enable teams to rapidly, reliably and sustainably evolve working software - with and without AI - is HALF PRICE if you book by January 31st.

</ShamelessPlug>

I’ve been a computer programmer all-told for 43 years. That’s more than half the entire history of electronic programmable computers.

In that time, I’ve seen a lot of things change. But I’ve also seen some things stay pretty much exactly the same.

I’ve lived through several cycles of technology that, at the time, was hailed as the “end of computer programmers”.

WYSIWYG, drag-and-drop editors like Visual Basic and Delphi were going to end the need for programmers.

Wizards and macros in Microsoft Office were going to end the need for programmers.

Executable UML was going to end the need for programmers.

No-Code and Low-Code platforms were going to end the need for programmers.

And now, Large Language Models are, I read on a daily basis, going to end the need for programmers.

These cycles are nothing new. In the 1970s and 1980s, 4GLs and 5GLs were touted as the end of programmers.

And before them, 3GLs like Fortran and COBOL.

And before them, compilers like A-0 were going to end the need for programmers who instructed computers in binary by literally punching holes in cards.

But it goes back even further, if we consider the earliest (classified) beginning of electronic programmable computers. The first of them, COLOSSUS, was programmed by physically rewiring it.

Perhaps the engineers who worked on that machine sneered at the people working on the first stored-program computers for not being “real programmers”.

In every cycle, the predictions have turned out to be very, very wrong. The end result hasn’t been fewer programmers, but more programs and more programmers. It’s a $1.5 trillion-a-year example of Jevons Paradox.

And here we are again, in another cycle.

“But this time it’s different, Jason!”

Yes, it certainly is. Different in scale to previous cycles. I don’t recall seeing the claims about Visual Basic or Executable UML on the covers of national newspapers. I don’t recall seeing entire economies betting on 4GLs.

And there’s another important distinction: in previous cycles, the technology worked reliably. We really could produce working software faster with VB or with Microsoft Access. This is proving not to be the case with LLMs, which – for the majority of teams – actually slow them down while making the software less reliable and less maintainable. It’s a kind of LOSE-LOSE in most cases. (Unless those teams have addressed the real bottlenecks in their development process.)

But all of this is academic. Even if the technology genuinely made a positive difference for more teams, it still wouldn’t mean that we don’t need programmers anymore.

The hard part of computer programming isn’t expressing what we want the machine to do in code. The hard part is turning human thinking – with all its wooliness and ambiguity and contradictions – into computational thinking that is logically precise and unambiguous, and that can then be expressed formally in the syntax of a programming language.

That was the hard part when programmers were punching holes in cards. It was the hard part when they were typing COBOL code. It was the hard part when they were bringing Visual Basic GUIs to life (presumably to track the killer’s IP address). And it’s the hard part when they’re prompting language models to predict plausible-looking Python.

The hard part has always been – and likely will continue to be for many years to come – knowing exactly what to ask for.

Edgar Dijkstra called it nearly 50 years ago: we will never be programming in English, or French, or Spanish. Natural languages have not evolved to be precise enough and unambiguous enough. Semantic ambiguity and language entropy will always defeat this ambition.

And while pretty much anybody can learn to think that way, not everybody’s going to enjoy it, and not everybody’s going to be good at it. The demand for people who do and people who are will always outstrip supply.

Especially if businesses stop hiring and training them for a few years, like they recently have. But these boom-and-bust cycles have also been a regular feature during my career. This one just happens to coincide with a technology hype cycle that presents a convenient excuse.

There’s no credible evidence that “AI” is replacing software developers in significant numbers. A combination of over-hiring during the pandemic, rises in borrowing costs, and a data centre gold rush that’s diverting massive funds away from headcount, are doing the heavy lifting here.

And there’s no reason to believe that “AI” is going to evolve to the point where it can do what human programmers have to do – understand, reason and learn – anytime soon. AGI seem as far away as it’s always been, and the hard part of computer programming really does require general intelligence.

On top of all that, “AI” coding assistants are really nothing like the compilers and code generators of previous cycles. The exact same prompt is very unlikely to produce the exact same computer program. And the code that gets generated is pretty much guaranteed to have issues that a real programmer will need to be able to recognise and address.

When I write code, I’m executing it in my head. My internal model of a program isn’t just syntactic, like an LLM’s is. I’m not just matching patterns and predicting tokens to produce statistically plausible code. I actually understand the code.

Even the C-suite has noticed the correlation of major outages and incidents proceeding grand claims about how much of that company’s code is “AI”-generated.

The folly of many people now claiming that “prompts are the new source code”, and even that entire working systems can be regenerated from the original model inputs, will be revealed to be the nonsense that it is. The problem with getting into a debate with reality is that reality always wins. (And doesn’t even realise it’s in a debate.)

So, no, “AI” isn’t the end of programmers. I’m not even sure, 1-3 years from now, that this current mania won’t have just burned itself out, as the bean counters tot up the final scores. And they always win.

To folks who say this technology isn’t going anywhere, I would remind them of just how expensive these models are to build and what massive losses they’re incurring. Yes, you could carry on using your local instance of some small model distilled from a hyper-scale model trained today. But as the years roll by, you may find not being able to move on from the programming language and library versions it was trained on a tad constraining.

For this reason, I’m skeptical that hyper-scale LLMs have a viable long-term future. They are the Apollo Moon missions of “AI”. In the end, quite probably just not worth it. Maybe we’ll get to visit them in the museums their data centres might become?

The foreseeable future of software development is one where perhaps “AI” – in a much more modest form (e.g., a Java coding assistant built atop a basic language model) – is used to generate prototypes, and maybe for inline completion on production code and those sorts of minor things.

But, when it matters, there will be a software developer at the wheel. And, if Jevons is to be believed, probably even more of us.

Employers, if I were you, I might start hiring now to beat the stampede when everyone wakes up from this fever dream.

And then maybe drop me a line if you’re interested in skilling them up in the technical practices that can dramatically shrink delivery lead times while improving reliability and reducing the cost of change, with or without “AI”. That’s a WIN-WIN-WIN.

The Age of Coding “Agents”? Or The Age of “LGTM”?

I’ve watched a lot of people using “AI” coding assistants, and noted how often they wave through large batches of code changes the model proposes, sometimes actually saying it out loud: “Looks good to me”.

After nearly 3 years of experimentation using LLMs to generate and modify code, I know beyond any shadow of a doubt that you need to thoroughly check and understand every line of code they produce. Or there may be trouble ahead. (But while there’s music… etc)

But should I be surprised that so many developers are happily waving through code in such a lackadaisical way? Is this anything new, really?

I’ve watched developers check in code they hadn’t even run. Heck, code that doesn’t even compile.

I’ve watched developers copy and paste armfuls of code from sites like Stack Overflow, and not even pause to read it, let alone try to understand it or even – gasp – try to improve it.

I’ve watched developers comment out or delete tests because they were failing. I’ve watched teams take testing out of their build pipeline to get broken software into production.

We’ve been living in an age of “LGTM” for a very long time.

What’s different now is the sheer amount of code being waved through into releases, and just how easy “AI” coding assistants make it for the driver to fall asleep at the wheel.

And when we put our coding assistant into “agent” mode – or, as I call it, “firehose mode” – that’s when things can very quickly run away from us. Dozens or hundreds of changes, perhaps even happening simultaneously as parallel agents make themselves busy on multiple tasks at once.

Even if there were no issues in any of those changes – and the odds against that are extremely remote – when code’s being churned out faster than we’re understanding, it creates a rapidly-growing mountain of comprehension debt.

When the time comes – or should I say, when the times come – that the coding assistant gets stuck in a “doom loop” and we have to fix problems ourselves, that debt has to be repaid with interest.

Agents have no “intelligence”. They’re old-fashioned computer programs that call LLMs when they need sophisticated pattern recognition and token prediction. LLMs don’t follow instructions or rules. Use them for just a few minutes and you’ll see them crashing through their own guardrails, doing things we’ve explicitly told them not to do, and forgetting to do things we insist that they should.

The intelligence in this set-up is us. We’re the ones who can follow rules and instructions. We’re the ones who understand. We’re the ones who reason and plan. And we’re the ones who learn.

In 2025, and probably for many years to come, we are the agents. We’re the only ones qualified for the job.

My advice – based on the best available evidence and a lot of experience using these tools over the past 3 years – remains the same when you’re working on code that matters.

I recommend working one failing test at a time, one refactoring at a time, one bug at a time, and so on.

I recommend thoroughly testing after every step, and carefully reviewing the small amount of code that’s changed.

I recommend committing changes when the tests go green, and being ready to revert when they go red.

I recommend a fresh context, specific to the next step. I recommend relying on deterministic sources of truth – the code as it is (not the model’s summary of it), the actual test results, linter reports, mutation testing scores etc.

I strongly advise against letting LLMs mark their own homework or rely on their version of reality.

And forget “firehose mode” for code that matters. Keep it on a very tight leash.

What’s In A Name?

The idea of “separation of concerns” originated from a need to make it possible for programmers to reason about a piece of code without the need to understand what’s going on inside its dependencies (and its dependencies’ dependencies).

In this sense, the primary benefit of modular design is to reduce cognitive load when working with any part of the system.

But that can only happen if every reference to other parts (e.g., function calls) “says what it does on the tin”, so we can form correct expectations about its behaviour within the context we’re reasoning about.

Ideally, to understand what a dependency does, we shouldn’t need to understand how it does it.

When names are unclear, or even misleading, we form the wrong expectations about what a dependency will do, and are forced to “look inside the box” to understand it.

In the same way that this increases the context size for an LLM and therefore the risk of errors, it increases cognitive load for programmers with the same end result. I see folks complaining often about having to have a bunch of source files open in order to understand what one piece of code is going to do.

It might be helpful here if I put forward a definition of code comprehensibility and a rough way of measuring it.

I see comprehensibility as the likelihood that the target audience (e.g., other team members) will correctly predict what a piece of code will do in specific cases.

ratings = [4, 6, 4, 5]

average_rating = sum(ratings)/len(ratings)

What will the value of average_rating be after that assignment?

Let’s refuctor that code to make it a little less obvious.

r = [4, 6, 4, 5]

ar = s(r)/l(r)

Now you need to know what the functions s and l do. That may mean looking it up, if there’s documentation available – more cognitive load.

Or it may mean actually peeking inside their implementations – even more cognitive load.

We could ask 10 developers to predict what the result will be. If 8 of them predict correctly, we might roughly gauge the comprehensibility of this code for that sample of people – remember who the audience is – is 80%.

Or we could ask an LLM ten times. Though it’s important to remember that LLMs can’t reason about behaviour. They would literally just be matching patterns. And in that sense, this is a reasonable test of how closely the code correlates with examples within the training data distribution.

So, in summary, naming is very, very important in modular software design. Names help us form expectations about behaviour, and if those expectations are correct then this means we don’t need to go beyond a signpost to understand what’s down that road.

When Should We Do Code Reviews?

“Make it work, make it right, make it fast”

This mantra is the answer to the question “When should we do code reviews?” We do them whenever we see the software working again – whenever the tests are green. (You have tests, right?)

In the TDD cycle, we start by defining what we want the software to do by writing a failing test. Then we do the simplest, quickest thing to get the tests passing.

I stress in my training courses that this is not the time to be agonising over the design. Priority #1 – MAKE IT WORK!

When all the tests are passing, then we have that luxury. We can take a step back and review the code we’ve added or changed – including the test code – and ask ourselves if there’s anything about it that’s going to make changing it later harder than it needs to be.

We consider readability, complexity, duplication, coupling and cohesion. We might even run a linter over it to check for low-level issues we might have missed.

We get the tests passing, and we do our code review. We do it when we go from red->green in the TDD cycle. We do it after every refactoring to validate the end result.

“That sounds like a lot to be doing on every green light, Jason!”

That depends how far apart your green lights are. The tighter the feedback loop, the smaller the diff, the quicker the code review. And you might be surprised at how much of it can be automated, if you’re willing to invest some time.

More than one pair of eyes can really help, too.

The payoff is three-fold:

1. No code review bottleneck when you want to ship

2. Catching design problems straight away means they’re often much cheaper to fix (or to back out of)

3. Reviewing code in micro-batches enables much greater attention to detail – fewer problems slip through the net. How does the saying go? “Show a developer a line of code and they’ll tell you what’s wrong with it. Show them 500, and they’ll say ‘Looks good to me’ “

The side-benefit is that performing systematic code reviews over and over will tend to bake them into your subconscious, turning you into a human linter. You’ll develop, as Keith Braithwaite once put it to me, “good taste in code”.

Do You Know Where Your Load-Bearing Code Is?

Do you know where your load-bearing code is?

90% of the time, TDD is enough to assure that code of the everyday variety is reliable enough.

But some code really, really needs to work. I call it “load-bearing code”, and it’s rare to find a software product or system that doesn’t have any code that’s critical to its users in some way.

In my 3-day Code Craft training workshop, we go beyond Test-Driven Development to look at a couple of more advanced testing techniques that can help us make sure that code that really, really needs to work in all likelihood does.

It raises the question, how do we know which parts of our code are load-bearing, and therefore might warrant going that extra mile?

An obvious indicator is critical paths. If a feature or a usage scenario is a big deal for users and/or for the business, tracing which code lies on the execution path for it can lead us to code that may require higher assurance.

Some teams work with stakeholders to assess risk for usage scenarios, perhaps captured alongside examples that they use to drive the design (e.g., in .feature files), and then when these tests are run, use instrumentation (e.g., test coverage) to build a “heat map” of their code that graphically illustrates which code is cool – no big deal if this fails – and which code might be white hot – the consequences will be severe if it fails.

(It’s not as hard to build a tool like this as you might think, BTW.)

A less obvious indicator is dependencies. Code that’s widely reused, directly or indirectly, also presents a potentially higher risk. Static analysis tools like NDepend can calculate the “rank” of a method or a class or a package in the system (as in, the Page Rank) to show where code is widely reused.

Monitoring how often code’s executed in production can produce a similar, but dynamic, picture of which code’s used most often.

These are all measures of the potential impact of failure. But what about the likelihood of failure? A function may be on a critical path, and reused widely, but if it’s just adding a list of numbers together, it’s not very likely to fail.

Complex logic, on the other hand, presents many more ways of being wrong – the more complex, the greater that risk.

Code that’s load-bearing and complex should attract our attention.

And code that’s load-bearing, complex and changing often is white hot. That should be balanced by the strength of our testing. The hotter the code, the more exhaustively and the more frequently it might need testing.

Hopefully, with a testing specialist in the team, you will have a good repertoire of software verification techniques to match against the temperature of the code – guided inspection, property-based testing, DBC, decision tables, response matrices, state transition tables, model checking, maybe even proofs of correctness when it really needs to work.

But a good start is knowing where your hottest code actually is.

Modular Design: The Secret Sauce

In pretty much every Codemanship training course, I try to stress the fundamental importance of modular design in software development.

When we fail to separate the different concerns in our design in a way that enables us to understand, test, change and reuse them independently of the rest of the design, bad things happen. (Not “bad things can happen”. Bad things happen. End of.)

“What are those bad things, Jason?”

Bad Thing #1 – If it’s not a cleanly separated concern, in order to understand how, say, mortgage repayments are calculated, we might need to understand how to read the result from a web page using XPath, fetch account data from the database, and get the latest base interest rate from a web service. Lack of separation of concerns increases cognitive load, which increases the time taken to make changes, and increases the risk of mistakes.

Bad Thing #2 – If it’s not a cleanly separated concern, to test that the mortgage repayments are calculated correctly, we might need integration tests that hit databases and web services, or even end-to-end tests that go through the UI.

Testing’s the inner feedback loop of software development. If the inner feedback loop’s slow, the outer feedback loops will be very slow – delivery lead times go from days to weeks to months to maybe never. Slow test suites kill businesses every day. I’m not kidding.

Bad Thing #3 – If it’s not a cleanly separated concern, changing the code that calculates mortgage repayments will mean modifying files that handle other concerns. And changing the code that handles those other concerns will mean modifying the files that handle calculating mortgage repayments. This is where teams start tripping over each other’s feet.

Modules that do many things will often have many dependencies, too. We end up with big modules that are tightly-coupled to many other modules, so even tiny changes can have a wide “blast radius” – many files end up being affected. This means more to test, more to review, more to merge (with more merge conflicts to resolve – everybody’s editing the same files). These bigger batch sizes can easily overwhelm downstream bottlenecks in software delivery. Couple that with slow build & test cycles, and it’s a perfect storm.

Bad Thing #4 – Say we want to calculate mortgage repayments on our website, but we’d also like to offer that feature in our Android app. If the code that fetches the account data, gets the latest base interest rate, calculates the repayments, and renders the result in HTML are all in the same module, we have a problem. To use the radio, we need the whole Mercedes.

If we want to use it in a different car, we have to buy a new radio. What teams will often do is duplicate logic in multiple places. And when (yes, WHEN) that logic needs to change, they’ve multiplied the cost and the risk of doing that.

It’s no coincidence teams categorised in the DORA reports as “elite” have high modularity in their designs. You don’t get fast release cycles and short lead times without it. Simple as that.

The next question is, what does that degree of modularity actually look like? It’s much more modular than most developers think.