“Our senior developers already know this stuff, Jason?”

I hear this very often from managers who’ll invest in training entry-level developers, but only entry-level.

Do they, though?

A large-scale study of developer activity in the IDE found that, of the devs who said they did TDD, only 8% were doing anything even close in reality. Most didn’t even run their tests, let alone drive their designs from them and run them continuously. Developers checking in code they haven’t even seen compile is more common than you might think.

They may well believe that they’re doing them, of course. They learned it from someone (who learned it from someone) who learned it from e.g. a YouTube tutorial made by someone who’s evidently never actually seen it being done. (I check every year – there’s a LOT of those, and they get a LOT of views.)

After all this time working with so many different teams in a wide range of problem domains, I can tell you de facto that the practices developers claim they’re doing – TDD, refactoring, “clean code”, CI & CD, modular design etc – usually aren’t being practiced much at all. That’s the norm, I’m afraid.

Unsurprisingly, the employer therefore sees none of the benefits in shrinking lead times, more reliable releases, and a more sustainable pace of delivery. The work remains mostly fighting fires and heroics around every deployment, rapidly eating up your budget on frantic motion-without-progress.

(And now we’re seeing that being amplified by you-know-what!)

Turns out you can’t just say you’re doing it. You have to ACTUALLY DO IT to get the benefits.

The way it often plays out is:

– You send your new hires to Jason (or someone like Jason). Jason teaches them some good habits that we’ve seen over the decades are likely to reduce delivery lead times, improve release stability and lower the cost of change.

– New hires go back to their teams, where – day-in and day-out – they see senior colleagues setting a bad example, and being rewarded for heroically putting out the fires they started.

– They may resolve to find themselves a job where they’ll get to apply what they’ve learned, and not feel pressured to just hack things out like everybody else.

– But, more commonly, they’ll just give up and go with the flow. Or, more accurately, the lack of it. Their careers take the path most-travelled, and you continue to wonder why it’s so hard to find senior developers who can do this stuff.

I would urge you to consider this when deciding who needs training and mentoring. I appreciate, it’s a touchy subject with folks who claim they’re already doing these things. But there are ways you can broach it: a “refresher”, “mentoring the juniors”, etc etc.

It really helps to align teams, and make the learning more “sticky” in day-to-day work. Otherwise, there’s a very real chance your junior developers will be un-taught by their senior peers.

And then, like I said, you don’t get the benefits – just the fires.

101 Uses of An Abstract Test #43: Contract Testing

As distributed systems have become more and more prevalent, I’ve seen how teams spend an increasing amount of time putting out fires caused by dependencies changing.

Team A go to bed with a spiffy working doodad, and in the wee small hours, Team B does a release of their spiffy working thingummy that Team A just happen to rely on. Team A wakes up to a decidedly not-spiffy doodad that has mysteriously stopped working overnight.

Team B’s thingummy may well have passed all their tests before they deployed it, but their tests might not show when a change they’ve made isn’t backwards-compatible with how their clients are using it. For a system to be correct, the interactions between components need to be correct.

We can define the correctness of interactions between clients and suppliers using contracts that determine what’s expected of both parties.

The supplier promises to provide certain benefits to the client – the weather forecast for the next 7 days at their location, for example. But that promise only holds under specific circumstances – the client’s location has to be provided, and must be expressed in Decimal Degrees.

If the supplier changes the contract so that locations must now be provided in Degrees, Minutes and Seconds, it may well pass all their tests, but it breaks the client, who’s now getting error messages instead of weather forecasts.

Now, the client will likely have some integration tests where the end point is real. And those are the tests that enforce expectations about interactions with that end point.

What if we abstracted those tests so that the end could be the real deal, or a stub or a mock? The object or function that’s responsible for the interaction could be supplied to each test via, say, a factory method that’s abstract in the test base class, and can be overridden in subclasses, enabling us to vary the set-up as we wish – real or pretend.

Then we can run the exact same tests with and without the real end point. If all the tests using pretend versions are passing, but the ones using the real thing suddenly start failing, that strongly suggests something’s changed at the other end. If the “unit” tests start failing too, then the problem is at our end.

This gives client developers a heads-up as soon as integration fires start. But the real payoff is when the team at the other end can run those tests themselves before they even think about deploying.

Refactoring – The Most Important, Least-Understood Dev Skill

At the moment, I offer 5 “off-the-shelf” training workshops focused on the core technical practices that enable rapid, reliable and sustained evolution of working software to meet changing needs.

Basically, the practices that have been shown to reduce delivery lead times, while improving release stability and reducing cost of change.

They’re self-supporting (e.g., can’t have continuous testing without good separation of concerns) – so ideally, your team would apply all of them in a “virtuous circle”.

But when I look at the sales history of each workshop, there’s a worrying imbalance.

* Code Craft (the flagship workshop) sells 49% of the time.

* The 2-day introduction to Test-Driven Development, aimed at less experienced developers, sells 32% of the time.

* The 1-day introduction to Unit Testing sells 9% of the time.

* The 2-day Design Principles deep-dive sells 8% of the time.

* And the 2-day Refactoring deep-dive only 2%. In fact, nobody’s booked a refactoring workshop since before the pandemic!

Refactoring, as a skill, exercises many of the “muscle groups” involved in Continuous Delivery, and is one of the most challenging to learn.

It’s also one of the most valuable. Whether you’re doing TDD or not, whether you’re continuously integrating or not, whether you’re agile or not – the ability to safely and predictably reshape code to accommodate change is gold.

Without it, you are far more likely to break Gorman’s First Law of Software Development:

Thou shalt not break shit that was working

Especially when you consider that most developers are working on hard-to-change legacy code most of the time. Refactoring is the skill for working with legacy products and systems.

I promised I wouldn’t be mentioning it this week, but I’ll just subtly hint that this problem is currently accelerating because of… well, y’know.

I routinely cite it as the second most important software development skill. (Can you guess what I believe is the first?)

It’s ironic, then, that it’s one of the rarest and one of the least in-demand, if job specs and training orders are any indication.

For sure, most developers will use the word (typically not knowing what it means), and most developers will claim they do it. But the large majority have never even seen it being done – hence the many misapprehensions about what it is.

At the very least, it would be a step-change for the profession if the average software developer could recognise the most common “code smells” and had a decent set of primitive refactorings in their repertoire to deal with them. I call this “short-form” refactoring.

And ideally, a good percentage of us would be capable of “long-form” refactoring so we can reshape architecture at a higher level safely. The best software architects have learned to think that way. (e.g., Joshua Kerievsky’s excellent book Refactoring To Patterns).

If you’d like to build your team’s Refactoring Fu, visit the website for details.

(Well, a man can hope, can’t he?)

The Seven Deadly Sins of “Go Faster”

Things that will make your dev team take longer to deliver worse software:

1. Adding more people to the team
2. Making them work longer hours
3. Cutting down on work that “slows them down”, like writing automated tests
4. Maximising team utilisation and scheduling more work to be done in parallel
5. Micromanaging the details
6. Minimising real-time, synchronous communication so they’re not “interrupted”
7. Keeping them inline. No questioning the plan, and failure is not an option!

Things that will help your dev team take less time to deliver better software:

1. Keep the team small
2. Keep the team rested
3. Test more often (you’ll be needing those fast tests)
4. Solve one problem at a time, as a team
5. Trust the team to make decisions when they need to be made
6. Maximise synchronous communication to minimise waiting
7. Keep them curious, questioning and unafraid to try

“If these really work, Jason, why don’t more organisations do them?”

Well, that’s just one of life’s little mysteries, isn’t it? If healthy diets and daily exercise really worked, we’d all be doing it, right?

Cause or Correlation?

The 2025 DORA State of AI-Assisted Software Development report shows a trend where teams that were already high-performing appeared to make modest improvements in delivery lead times and release stability using “AI” coding assistants, while teams that were less than high-performing showed noticeable losses as the code-generating firehose overwhelmed the bottlenecks in their system.

The question on my mind today, after spending time this morning with a team I’ve worked with for several years, is whether this is cause or correlation.

It struck me, several months after our last coaching sessions, that they were running their tests more often, committing more often, and being more systematic about code inspections in today’s session.

And they’ve been mutation testing diffs to make sure all the code is needed, and if it is, it’s meaningfully tested.

They’re also linting locally for low-level issues like floating imports and unused declarations, whereas before they let the build handle that through SonarQube. (“Horse”, “Stable Door”, etc).

Basically, they appear to have tightened up their feedback loops.

When I complimented them on this, they said that they’d had to do it because the firehose – which management had mandated after our previous session – was pushing their delivery and quality metrics in the wrong direction.

Which raises the thought in my mind: is it the “AI” making them marginally more productive directly, or is it indirectly because the “AI” is forcing them to be more disciplined?

This would then beg the question, why doesn’t it have the same positive effect on the discipline of the other teams – the low-performing and “Meh” teams?

One possible explanation is that they’re not paying attention to the bottlenecks, the delays and the cruft to anywhere near the same extent – “LGTM”.

They haven’t noticed that the car’s slowing down because they’re not looking at the dashboard.

And if they’d had an incentive to improve, they would have improved already. So why would they start now?

The Tiger In The Magic Eye Picture

Can you see it? The tiger?

If you can’t, you’re not alone. When “Magic Eye” pictures were first invented, about half of people couldn’t see the hidden image.

But the tiger’s definitely there, buried in all that noise.

People’s experiences using Large Language Models can split the audience, too. For many, they see reasoning, understanding, and planning. But for some, they see probabilistic pattern-matching and token prediction.

Like the tiger, the true nature of LLMs is hidden by noise – the dazzling complexity of human languages that makes it hard to see the wood for the trees.

When we strip away that complexity and engage models in simple, deterministic problems, the tiger appears.

Let’s take the game Rock, Paper, Scissors as an example. You may have tried this fun experiment with an LLM: let it go first in each round. It will, of course, lose every time. But the really fun part of the experiment is asking it why it keeps losing. You will basically have to tell it, in the end. But it’s fascinating to see the plausible-looking explanations it will generate.

Even more interesting, though, is when you flip it and insist that you go first in every round. The model, if it really understands and really reasons about the game, should win every time – a clean sweep.

But it doesn’t. For sure, for the first 2-3 rounds, it wins. But then something happens, and it starts to lose. The “something that happens” is context.

Rock, Paper, Scissors is a context-free game. It’s not like chess, where every move depends on the history of previous moves. Every round is a clean slate. For a human player.

But for an LLM, every interaction is influenced by the patterns in previous interactions – inputs and outputs – in the conversation.

Chess – another simple, deterministic problem domain where the tiger finds it hard to stay hidden – illustrates what’s happening beautifully.

If you play an LLM at chess, you will likely initially be amazed that it’s playing at all. Each of its moves will be legal, reasonable and – seemingly – reasoned. I was certainly fooled in the first half of my first game with GPT-4.

But as the game goes on, and the sequence of moves grows longer, we see how it’s really doing it. LLMs do not play chess in the way we do, or chess programs do. They do not know where the pieces are on the board. They do not understand the rules. They cannot tell a good move from a bad move. And they most certainly don’t plan ahead in the way a chess player has to, evaluating the upsides or downsides of a move by looking into the future of possible subsequent moves.

The way LLMs play chess is to match the pattern in the sequence of moves so far against what is likely to be a huge corpus of chess transcripts in its training data, and predict what move is most likely to come next. Could be a winning move. Could be a losing move. It has no way of knowing. It does not understand.

As the games goes on, and the sequence gets longer and longer, the number of training examples that match the context gets smaller and smaller – matches get less and less probable.

When an LLM has to decide what token to predict next, the uncertainty (the entropy) in the decision determines how accurate the prediction is likely to be.

Think of the Ask The Audience lifeline in “Who Wants To Be A Millionaire?”. If the question is “What is the capital city of France?”, we might see a distribution of audience votes that strongly favours Paris (e.g., 89%, 2%, 3%, 6%). A high-confidence prediction can be made.

But if the question is “What microscopic mechanism gives rise to superconductivity in semiconductors?”, odds are most of the audience will have no clue, and there’ll be no clear choice between the four possible answers (e.g., 24%, 26%, 23%, 27%). This would be a low-confidence prediction. It would be a guess. The question pushes the audience outside of their training (quite literally).

When LLMs guess wrong, we say they “hallucinate”, and as the probabilities flatten out, they do that more and more.

The first move in a game of chess is going to match many, many examples in the training data. The second move will constrain the model to a much smaller sample of sequences. The third move will constrain it by another order of magnitude.

With each new move, the probabilities decay exponentially, and model accuracy deteriorates sharply.

And while some new research suggests that this effect might only apply to 10-15% of the tokens in the context (the “key” semantics), the net result is still that model performance decays as context grows.

In Rock, Paper, Scissors – where every round is a clean slate – we see the tiger. If the LLM really understood the game, and was genuinely reasoning about it, it would just keep winning if the human always went first.

But it doesn’t, because it’s literally just matching patterns and predicting next tokens. The emperor has no mind.

Once seen, the tiger can’t be unseen. I’ve become aware of it in every interaction with an LLM. It plays chess in exactly the same way it writes poems, tells jokes and – of course – writes code.

A computer program, when you think about it, is a bit like a game of chess. Every “move” – every token in the program – can only be followed by another token that satisfies the rules of the programming language. e.g., in Java, “public” can’t be followed by “private”.

With every additional token, the search space for valid code is constrained by an order of magnitude, and matches become an order of magnitude less probable. And we’re off into “hallucination” territory.

Now, here’s the really interesting thing. When I played GPT-5 at Rock, Paper, Scissors – with me going first every time – and every round was in a fresh context, it got a perfect score. It won every round.

Many of us have observed over the last few years how LLM code generation works better – is more accurate – when contexts are small (typically in the order of 100-1000 tokens). Many of us have learned to reset the context every few interactions – in my case, typically for every interaction with the model.

Each context is constructed specific to the next task. Only the code it needs to know about. Only the description of the task. Only the examples needed to clarify that particular interaction (e.g., a test case, or an example of the refactoring I want it to do). Don’t let it build on top of its history. Give it a picture of the world as it is now. Show it the board.

We also need to carefully manage how much information we need to give it for each interaction. If it needs to see 1,000’s of lines of code, we’re immediately taking it out of its training data distribution.

Separation of concerns in our designs becomes absolutely essential. The less cohesive and more tightly-coupled modules are, the bigger the “blast radius” when making changes, the more tokens have to be brought into play for each interaction.

Of course, separation of concerns has always been essential. But if it’s going to take a leaky chatbot for teams to actually start addressing it, then so be it.

(I just so happen to train developers in modular software design, by the way.)

One final thing the Rock, Paper, Scissors experiment teaches us: when I asked GPT-5 to explain why it didn’t win every round, it generated some plausible-looking explanations. But all of them were wrong.

This is the kind of “reasoning” that “reasoning models” do. They match patterns in the context to patterns in their training data, and predict what tokens are most likely to come next.

It’s not actual reasoning, it’s just a statistical approximation of the most plausible explanations for why it did what it did. The real reason why it did it is because those were the most probable responses based on its training.

When Evaluating Software Development Advice, Consider The Scale It’s Been Tested On

Software development becomes a distinctly different game at different scales.

What might be fine for a proof of concept with maybe just a few hundred lines of code is likely to bring the whole house tumbling down at tens or hundreds of thousands of LOC.

When evaluating advice about approaches to development, be careful to find out at what scale it’s been tested.

For example, products like JUnit and iPlayer demonstrate Test-Driven Development at an appreciable scale, and over many years.

I know that to be true for various versions of iPlayer, because I trained and coached quite a few of the devs back in the day. And you’ll find timestamps on the JUnit commits that demonstrate longevity.

As a result, I have high confidence in TDD on a wide range of problem types. Some might think it’s overkill for a proof of concept, but over more than 25 years, it’s proven itself valuable at larger scales in terms of its impact on delivery lead times, product reliability and sustaining the pace of development.

I think this is probably especially important with claims made about “AI”-assisted coding, because there’s a lot advice out there from people who only seem to have used it on relatively small, single-person projects, and very few substantial code bases maintained by teams that we’ve seen stand the test of time (yet).

In this sense, “AI” coding assistants – and the techniques we’re discovering tend to produce better results with them – are like a new drug that’s still very much in testing, and the long-term side-effects may not show for years to come. (Though we are definitely seeing some short-term side-effects!)

I’m trying as much as possible to take a measured, evidence-based approach to using the technology. Wherever possible, I try to see it applied at appreciable scales that are more representative of the code we’re typically working with, and build on studies that are more towards the credible end of the spectrum. But there are few large-scale – and no long-term – studies to guide us here.

“AI” could turn out to be our aspirin, or it could turn out to be our cocaine – something we’ll have to spend countless billions dealing with the downsides of in the future.

But when we’re evaluating advice about using “AI” coding assistants, we should consider the scale at which it’s been tested. And we should consider if the person offering the advice even looked for side-effects, or recognised them when they saw them. Maybe their bar is set differently to ours.

Myth: “AI-Generated Code Doesn’t Need To Be Easy To Understand”

You may see people online claiming that “AI”-generated code doesn’t need to be easy for humans to understand, because humans won’t need to.

Bah humbug!

It’s quite clear that the factors that make code easier for us to wrap our heads around also make LLM performance on it better (less unreliable).

LLMs struggle with inconsistent, unclear naming because that hampers pattern matching.

LLMs struggle with complexity because they were trained on so little of it (though they’re very capable of generating it, ironically).

LLMs struggle when concerns aren’t cleanly separated, because that means more source code has to be brought into the context – and they don’t like that!

Think of the context in each interaction with an LLM as being analogous to cognitive load. We strive to write code in a way that reduces cognitive load for the reader.

Far from being less important when using LLMs for code generation, modification or summarising/documenting, it’s twice as important! LLMs aren’t as smart as we are.

But putting all of that aside, anyone who’s tried using “AI” coding assistants for any significant length of time and on anything substantial will know that you will be spotting and fixing problems yourself. These tools are nowhere near being reliable enough that you can just leave them to it. They’re not compilers.

You can file “AI-generated code doesn’t need to be easy to understand” under “Advice about running marathons from someone who ran the 400 m once”.

Is Software The UFOlogy of Engineering Disciplines?

One area where software development lags far behind other technical design disciplines like electronic and mechanical engineering is in standards of evidence.

To illustrate what I mean, I want to talk about the July 2023 congressional hearings on Unidentified Anomalous Phenomena (“UAPs”).

Former military and intelligence personnel gave testimony under oath about encounters with UAP, and some sensational claims were made by David Grusch – who had worked with the UAP Task Force at the Department of Defence – about captured “non-human” aircraft, materials and “biologics” being held by private defence contractors.

Some UFO researchers hold the testimonies of the very credible witnesses as proof that we are being visited by at least one civilisation that’s far in advance technologically of ours.

The scientists working in the DoD’s All-domain Anomaly Resolution Office (AARO), and in NASA’s UAP working group disagree with that interpretation.

Witness testimony – even given under oath – is merely evidence that somebody said something. And maybe they really believe what they say. But that doesn’t make it real.

Since that congressional hearing more than 2 years ago, no hard evidence has entered the public or scientific domain that supports Grusch’s claims.

The NASA working group complained that the military were less than forthcoming with good data that it’s believed they may be holding (they admit as much on the record). But, again, that’s not in itself evidence of “non-human” visitation and alien vehicle reverse-engineering projects. It’s evidence that the military and their contractors are keeping secrets. Who knew, right?

And, yes, there are videos – confirmed by the military to be genuine – showing anomalous objects recorded by Airforce and Navy personnel during routine operations off the coast of the United States and in combat zones around the world.

But those videos, taken by themselves, show nothing particularly sensational. Accounts of “instant acceleration” and other impossible manoeuvres accompany these videos, but are not captured in them.

And that has been the general nature of UFO/UAP evidence going back to the 1940s. When the anecdotal noise is filtered out, there’s very little left in credible, meaningfully testable evidence to support the extraterrestrial (or extra-dimensional, or time-traveller, or hollow-earth-dweller or Atlantean or Lunar Nazi) hypothesis.

What hard evidence does exist points to one or more genuinely unknown physical phenomena. But that doesn’t mean aliens. That just means ¯\_(ツ)_/¯

More than 20 years ago, I corresponded with famous UFO researcher Stanton T. Friedman. His central claim was that “the evidence is overwhelming that some UFOs are extraterrestrial spacecraft”. He was kind enough – at his own expense – to mail me a thick folder of this “overwhelming evidence”, which included reports written after official government studies in the US, France and other countries. (The UK MOD’s 1990s study, Project Condign, was declassified a couple of years later, adding to the corpus of scientific studies.)

All of these studies, if you read beyond the executive summary, come to a similar conclusion: UFOs are real, and we don’t know what they are.

They usually also conclude that further scientific study is warranted. But that’s rarely followed through, because UFOs are the “third rail” of a scientific career – unless you’re safely tenured, like Avi Loeb or Michio Kaku, most scientists daren’t touch the subject.

Anyway, back to Mr Friedman. Stanton Friedman was a scientist – a nuclear physicist (a real one!). He would often wear these credentials as evidence that his approach to the study of UFOs was rigorous in the same sense that his work on, say, nuclear space propulsion was rigorous.

But that was simply not the case. Friedman, like most UFOlogists – not all, mind – approached the subject like an investigative journalist. He didn’t look for physical evidence. He looked for documentation to support his theories, and his “rigour” manifested in attempts to authenticate these documents.

Even if the MAJESTIC documents are from genuine top secret classified government files (and that’s very much disputed still to this day), a document is only evidence that somebody wrote something down.

So I was not overwhelmed by the evidence Stanton sent me. Intrigued? Definitely. Open-and-shut case? Definitely not.

I agree with many of the official government studies: UFOs warrant serious scientific investigation. But, curiously, many UFOlogists – including Stanton Friedman – disagreed.

I had been following the work of an electronic engineer, at the time working in NASA’s Jet Propulsion Labs, called Scot Stride who was proposing multi-modal instrumented searches of the sky to collect more and better data on these phenomena.

He called it “SETV” – the Search for Extraterrestrial Visitation. Not to be confused with SETA – the Search for Extraterrestrial Artefacts. SETV’s null hypothesis – that no UFOs are extraterrestrial technology – was concerned with contemporary visitation.

Now, to me, a not-so-long-ago-at-the-time physics bod, SETV sounded like a good idea. The challenge in understanding the nature of UFOs has always been the amount and the quality of the data – too much noise, very little signal.

An object tracked by multiple sensors, from multiple locations, could provide far clearer data on the reality (as in, is this object real and not, say, a sensor blip?), the size, the distance, and therefore the speed or acceleration of objects in the sky.

But Stanton poured cold water on the idea of instrumented searches. UFOs, he told me, cannot be studied scientifically. Which I thought was a little odd, given his physics credentials – far superior to mine – and that he was kind of using them to shore up the credibility of his work. He was the “flying saucer physicist”.

SETV, as far as I know, never got off the ground – perhaps due to lack of funding. (A similar initiative called UFODATA also appears to have stalled. I hold out some hope Avi Loeb may help to divert some research money into other instrumented sky searches.)

But, as of today, the state of the art in UFO/UAP evidence is lots of noise and very little signal.

I’ve met similar hostility from folks who, in one breath, claim that software engineering is “scientific” – because data – but row back on that when I suggest we might need better data: more signal, less noise.

Most empirical studies into our discipline are small, attempting to extract meaningful trends from statistically insignificant sample sizes. This leaves them wide open to statistical noise.

Many studies are, like the congressional UAP hearings, building on reported – rather than directly observed – data. If a development team tells you that switching from white bread to wholemeal reduced their bug counts, that’s anecdote, not hard evidence.

Some folks say that software engineering is scientific because it’s grounded in scientific principles – many would argue that engineering is the “appliance of science”.

But what are the scientific principles software engineering is founded on? We might argue that discrete mathematics – set theory, logic, graph theory etc – is a science. And so there’s perhaps some merit, when these theories are applied (e.g., in program verification), in saying that we’re applying science.

But we’re not testing the theories. They are take as a given – as logically proven. And by “logically proven”, I mean logically consistent with all the connected theories. A scientist might argue that proofs aren’t science.

To quote Donald Knuth:

“Beware of bugs in the above code; I have only proved it correct, not tried it.”

Or, in the words of the Second Doctor: “Logic, my dear Zoe, merely enables one to be wrong with authority”.

Just because axioms are logically consistent, that doesn’t mean they’re true. To establish truth, we must defer to reality. We must test them in the real world. In this sense, mathematics is not science. It’s applied philosophy.

And that brings me to the third of the ways that I and others part company. In order to meaningfully test a hypothesis, we must be able to know with high confidence when the data contradicts it. If software engineering ‘s to be truly scientific, our hypotheses need to be refutable.

As computer programmers, we know the challenge of expressing ideas in a way that can’t be misinterpreted. It’s a large part of our work, and the main reason why computer programming remains a minority pursuit. It is hard.

But just like the cognitive dissonance of the anti-science “flying saucer physicist”, many of us hold the contradictory belief that hypotheses about our field of work need not be expressed in any refutable form – they need not be meaningfully testable – even though that’s kind of what we do for a living.

When we combine woolly and untestable claims with small, noisy data sets, comprising mostly of anecdotes, software engineering as a discipline falls well within the territory of UFOlogy.

Now, not everybody subscribes to the idea that for a study to be scientific, it requires hypotheses to be refutable. Physics undergraduates have refutability drummed into us from the start (e.g., Wolfgang Pauli’s “not even wrong” jab at ambiguous claims), and it causes friction with other fields of study that describe themselves as “scientific”, but that lack refutability.

But whether we agree on the definition of “scientific” is not really the important thing here. What matters is where low-signal, largely anecdotal, non-refutable experiments have led us in our understanding of not just what works and what doesn’t in specific situations, but why.

A lot of what we think we know about creating and adapting software is built on the equivalent of UFO reports. Let me give you an example of what can happen when research cuts through that noise.

In their study of developer testing in Java IDEs, researchers discovered that, of the participants who claimed they did Test-Driven Development, analysis of their real IDE activity showed that only about 8% actually did.

The implication here is that 92% of what we think we know about TDD and it’s outcomes is, in reality, based on developers doing something else. Many other studies – on much smaller scales, usually – call me to question whether participants were really doing TDD. And, indeed, whether the authors of the studies could even tell if they weren’t.

The upshot of all this is that when skeptics demand “proof” of the benefits of TDD, even someone with 26 years experience doing and teaching it like me, has to resort to “you’ll just have to take my word for it”. Like the UFO witnesses who “know what they saw”, I know there are real benefits. I just don’t have the hard data to back it up. For every study that finds there are, there’s another one that concludes ¯_(ツ)_/¯

I could survey developers who’ve been doing TDD for, say, more than a year, to ask if they believe there are real benefits. I could ask them if they’d ever consider going back to test-after. (I already have a pretty good idea what the response would be.)

But this is shaky ground. The majority of developers using “AI” coding assistants, for example, believe they’re being more productive. But data on delivery lead times and release stability paints the opposite picture in the majority of cases.

As a teacher and a mentor, the lack of genuine signal in the software engineering body of knowledge makes my job a lot harder.

I have to resort to my powers of persuasion, and I have to rely on people’s willingness to at the very least suspend their disbelief. I did not need to be persuaded that force = mass x acceleration, because the evidence is so compelling.

It also leaves our profession vulnerable to spurious claims that aren’t backed up by credible evidence, but can’t easily be disproved. I might argue that a whole bunch of people’s pensions might be about to be wiped out by a spurious claim about the impact of a particular technology on software teams. Our industry’s very much the rabbit that the GenAI folks are banking on other industries chasing. If programmers don’t get much benefit, what chance lawyers or doctors or teachers?

I appreciate that the complex socio-technical nature of software development presents many challenges to a rigorously scientific approach to gaining useful insights – to learning to predict the effects of pulling certain levers so that we can more confidently engineer the outcomes we want. And I accept that there will always be aspects that remain beyond the scientific method.

However, it feels to me like we’re not even really trying. And we’re so good at making excuses for why we can’t do better.

If there’s one thing we’re not short of as a discipline, it’s hard data. Our activities – like the actions we perform in our IDEs, the code itself, the version histories in our repos, the outputs of builds, the results of testing and linting, the telemetry from production systems – are radiating a rich and long tail of hard data; data about things that actually happened, and not just what we believe or claim happened.

If we were comets, you’d want to fly your probe through that tail.

Again, there are many challenges and problems to be solved, not least of which is the ad hoc, proprietary, non-standardised nature of all that data.

In that sense, we are arguably one of the least mature of the engineering disciplines. My Dad’s architectural CAD system can tell you what order a house has to be built in (you have to do that these days to get planning permission) and can even generate orders for building materials with specific suppliers.

Our tooling workflows are still mostly held together with twigs and tape. And that’s chiefly because we so seldom consider the whole picture when we design development tools – a random landscape of point solutions that don’t play nice with each other.

We lack the data interchange standards of more mature disciplines. And that could well be because we also lack the underlying rigour – including rigour around terminology. How do we standardise things that go by many different names?

But that is a solvable problem. If building design and electronic engineering and mechanical engineering can do it, so can we. Heck, we did it for them! (We suffer from a condition I call “builder’s houses”.)

And if this reads like a bit of a manifesto, so be it. I’m well aware that I’m in a minority who feel this way about software engineering. But if you’re out there thinking along similar lines, maybe drop me a line?

If I Was Your Head of Engineering…

If I was your head of engineering, I’d align software and systems development with business outcomes, and I’d organise the teams around delivering those outcomes, with diverse skills and outlooks being brought to bear.

Technical specialisms and technology areas like requirements analysis, UX design, testing, architecture, InfoSec, and operations would organise into communities of practice, sharing their experiences and expertise, and coordinating on improvements, outside and across teams and across specialisms.

I’d focus on long-lived teams, and treat them as the real product.

I’d bring developers closer to the business, and the business closer to the developers, making everyone – business and technical – active participants in defining and executing technology strategy in pursuit of meaningful aims. Let the dog see the rabbit.

I’d empower teams to make decisions when they need to be made whenever possible, and democratise the decision-making process so they don’t need to keep referring things back up to me. I’d work hard to be an enabler instead of a bottleneck.

I’d invest heavily in knowledge and skills, process improvement, and automation to remove friction from software delivery and reduce lead times and accelerate business feedback.

I’d encourage developers to branch out, try new things, wear different hats, and gradually become more adaptable and confident generalising specialists.

I’d foster psychological safety so people aren’t afraid to try, to experiment, to question, to fail and therefore to learn.

I’d create technical career tracks that go all the way up to my equivalent, while remaining hands-on, to keep the best people doing what they do best, and setting a constructive example for less experienced team members that most dev teams sorely lack.

I’d actively search for and nurture talent out of schools and colleges and coding clubs, and invest in a healthy on-ramp, with properly-funded and resourced long-term apprenticeships and outreach to a diverse pool of potentially great developers.

I’d encourage – in words and deeds – the engineering organisation to incrementally keep raising the bar and to build capabilities that most dev orgs daren’t even dream of.

I’d encourage developers to actively participate in communities where they can share and learn (and recruit, when it comes to that), and to take a wider view of their work and how it impacts on the world.

Oh, and I’d pay developers what they’re really worth to the business.

Which is why I’m not your head of engineering.