The Test Pyramid – The Key To True Agility

On the Codemanship TDD course, before we discuss Continuous Delivery and how essential it is to achieving real agility, we talk about the Test Pyramid.

It has various interpretations, in terms of the exactly how many layers and exactly what kinds of testing each layer is made of (unit, integration, service, controller, component, UI etc), but the overall sentiment is straightforward:

The longer tests take to run, the fewer of those kinds of tests you should aim to have

test_pyramid

The idea is that the tests we run most often need to be as fast as possible (otherwise we run them less often). These are typically described as “unit tests”, but that means different things to different people, so I’ll qualify: tests that do not involve any external dependencies. They don’t read from or write to databases, they don’t read or write files, they don’t connect with web services, and so on. Everything that happens in these tests happens inside the same memory address space. Call them In-Process Tests, if you like.

Tests that necessarily check our code works with external dependencies have to cross process boundaries when they’re executed. As our In-Process tests have already checked the logic of our code, these Cross-Process Tests check that our code – the client – and the external code – the suppliers – obey the contracts of their interactions. I call these “integration tests”, but some folk have a different definition of integration test. So, again, I qualify it as: tests that involve external dependencies.

These typically take considerably longer to execute than “unit tests”, and we should aim to have proportionally fewer of them and to run them proportionally less often. We might have thousands of unit tests, and maybe hundreds of integration tests.

If the unit tests cover the majority of our code – say, 90% of it – and maybe 10% of our code has direct external dependencies that have to be tested, on average we’ll make about 9 changes that need unit testing compared to 1 change that needs integration testing. In other words, we’d need to run our unit tests 9x as often as our integration tests, which is a good thing if each integration test is about 9 times slower than a unit test.

At the top of our test pyramid are the slowest tests of all. Typically these are tests that exercise the entire system stack, through the user interface (or API) all the way down to the external dependencies. These tests check that it all works when we plug everything together and deploy it into a specific environment. If we’ve already tested the logic of our code with unit tests, and tested the interactions with external suppliers, what’s left to test?

Some developers mistakenly believe that these system-levels tests are for checking the logic of the user experience – user “journeys”, if you like. This is a mistake. There are usually a lot of user journeys, so we’d end up with a lot of these very slow-running tests and an upside-down pyramid. The trick here is to make the logic of the user experience unit-testable. View models are a simple architectural pattern for logically representing what users see and what users do at that level. At the highest level they may be looking at an HTML table and clicking a button to submit a form, but at the logical level, maybe they’re looking at a movie and renting it.

A view model can help us encapsulate the logic of user experience in a way that can be tested quickly, pushing most of our UI/UX tests down to the base of the pyramid where they belong. What’s left – the code that must directly reference physical UI elements like HTML tables and buttons – can be wafer thin. At that level, all we’re testing is that views are rendered correctly and that user actions trigger the correct internal logic (which can easily be done using mock objects). These are integration tests, and belong in the middle layer of our pyramid, not the top.

Another classic error is to check core logic through the GUI. For example, checking that insurance premiums are calculated correctly by looking at what number is rendered on that web page. Some module somewhere does that calculation. That should be unit-testable.

So, if they’re not testing user journeys, and they’re not testing core logic, what do our system tests test? What’s left?

Well, have you ever found yourself saying “It worked on my machine”? The saying goes “There’s many a slip ‘twixt cup and lip.” Just because all the pieces work, and just because they all play nicely together, it’s not guaranteed that when we deploy the whole system into, say, our EC2 instances, that nothing could be different to the environments we tested it in. I’ve seen roll-outs go wrong because the servers handled dates different, or had the wrong locale, or a different file system, or security restrictions that weren’t in place on dev machines.

The last piece of the jigsaw is the system configuration, where our code meets the real production environment – or a simulation of it – and we find out if really works where it’s intended to work as a whole.

We may need dozens of those kinds of tests, and perhaps only need to run them on, say, every CI build by deploying the outputs to a staging environment that mirrors the production environment (and only if all our unit and integration tests pass first, of course.) These are our “good to go?” tests.

The shape of our test pyramid is critical to achieving feedback loops that are fast enough to allow us to sustain the pace of development. Ideally, after we make any change, we should want to get feedback straight away about the impact of that change. If 90% of our code can be re-tested in under 30 seconds, we can re-test 90% of our changes many times an hour and be alerted within 30 seconds if we broke something. If it takes an hour to re-test our code, then we have a problem.

Continuous Delivery means that our code is always shippable. That means it must always be working, or as near as possible always. If re-testing takes an hour, that means that we’re an hour away from finding out if changes we made broke the code. It means we’re an hour away from knowing if our code is shippable. And, after an hour’s-worth of changes without re-testing, chances are high that it is broken and we just don’t know it yet.

An upside-down test pyramid puts Continuous Delivery out of your reach. Your confidence that the code’s shippable at any point in time will be low. And the odds that it’s not shippable will be high.

The impact of slow-running test suites on development is profound. I’ve found many times that when a team invested in speeding up their tests, many other problems magically disappeared. Slow tests – which means slow builds, which means slow release cycles – is like a development team’s metabolism. Many health problems can be caused by a slow metabolism. It really is that fundamental.

Slow tests are pennies to the pound of the wider feedback loops of release cycles. You’d be surprised how much of your release cycles are, at the lowest level, made up of re-testing cycles. The outer feedback loops of delivery are made of the inner feedback loops of testing. Fast-running automated tests – as an enabler of fast release cycles and sustained innovation – are therefore highly desirable

A right-way-up test pyramid doesn’t happen by accident, and doesn’t come at no cost, though. Many organisations, sadly, aren’t prepared to make that investment, and limp on with upside-down pyramids and slow test feedback until the going gets too tough to continue.

As well as writing automated tests, there’s also an investment needed in your software’s architecture. In particular, the way teams apply basic design principles tends to determine the shape of their test pyramid.

I see a lot of duplicated code that contains duplicated external dependencies, for example. It’s not uncommon to find systems with multiple modules that connect to the same database, or that connect to the same web service. If those connections happened in one place only, that part of the code could be integration tested just once. D.R.Y. helps us achieve a right-way-up pyramid.

I see a lot of code where a module or function that does a business calculation also connects to an external dependency, or where a GUI module also contains business logic, so that the only way to test that core logic is with an integration test. Single Responsibility helps us achieve a right-way-up pyramid.

I see a lot of code where a module in one web service interacts with multiple features of another web service – Feature Envy, but on a larger scale – so there are multiple points of integration that require testing. Encapsulation helps us achieve a right-way-up pyramid.

I see a lot of code where a module containing core logic references an external dependency, like a database connection, directly by its implementation, instead of through an abstraction that could be easily swapped by dependency injection. Dependency Inversion helps us achieve a right-way-up pyramid.

Achieving a design with less duplication, where modules do one job, where components and services know as little as possible about each other, and where external dependencies can be easily stubbed or mocked by dependency injection, is essential if you want your test pyramid to be the right way up. But code doesn’t get that way by accident. There’s significant ongoing effort required to keep the code clean by refactoring. And that gets easier the faster your tests run. Chicken, meet egg.

If we’re lucky enough to be starting from scratch, the best way we know of to ensure a right-way-up test pyramid is to write the tests first. This compels us to design our code in such a way that it’s inherently unit-testable. I’ve yet to come across a team genuinely doing Continuous Delivery who wasn’t doing some kind of TDD.

If you’re working on legacy code, where maybe you’re relying on browser-based tests, or might have no automated tests at all, there’s usually a mountain to climb to get a test pyramid that’s the right way up. You need to write fast-running tests, but you will probably need to refactor the code to make that possible. Egg, meet chicken.

Like all mountains, though, it can be climbed. One small, careful step at a time. Michael Feather’s book Working Effectively With Legacy Code describes a process for making changes safely to code that lacks fast-running automated tests. It goes something like this:

  • Identify what code you need to change
  • Identify where around that code you’d want unit tests to make the change safely
  • Break any dependencies in that code getting in the way of unit testing
  • Write the unit tests
  • Make the change
  • While you’re there, make other improvements that will help the next developer who needs to change that code (the “boy scout rule” – leave the camp site tidier than you found it)

Change after change, made safely in this way, will – over time – build up a suite of fast-running unit tests that will make future changes easier. I’ve worked on legacy code bases that went from upside-down test pyramids of mostly GUI-based system tests, that took hours or even days to run, to right-side-up pyramids where most of the code could be tested in under a minute. The impact on the cost and the speed of delivery is always staggering. It can be done.

But be patient. A code base might take a year or two to turn around, and at first the going will be tough. I find I have to be super-disciplined in those early stages. I manually re-test as I refactor, and resist the temptation to make a whole bunch of changes at a time before I re-test. Slow and steady, adding value and clearing paths for future changes at the same time.

Iterating Is The Ultimate Requirements Discipline

The title of this blog post is something I’ve been trying to teach teams for many years now. As someone who very much drank the analysis and design Kool Aid of the 1990s, I learned through personal experience on dozens of projects – and from observing hundreds more from a safe distance – that time spent agonising over the system spec is largely time wasted.

A requirements specification is, at best, guesswork. It’s our starter for ten. When that spec – if the team builds what’s been requested, of course – meets the real world, all bets are usually off. This is why teams need more throws of the dice – as many as possible, really – to get it right. Most of the value in our code is added after that first production release, if we can incorporate our users’ feedback.

Probably the best way to illustrate this effect is with some code. Take a look at this simple algorithm for calculating square roots.

public static double sqrt(double number) {
    if(number == 0) return 0;
    double t;

    double squareRoot = number / 2;

    do {
        t = squareRoot;
        squareRoot = (t + (number / t)) / 2;
    } while ((t - squareRoot) != 0);

    return squareRoot;
}

When I mutation test this, I get a coverage report that says one line of code in this static method isn’t being tested.

pit

The mutation testing tool turned number / 2 into number * 2, and all the tests still passed. But it turns out that number * 2 works just as well as the initial input for this iterative algorithm. Indeed, number * number works, and number * 10000000 works, too. It just takes an extra few loops to converge on the correct answer.

It’s in the nature of convergent iterative processes that the initial input matters far less than the iterations. More frequent iterations will find a working solution sooner than any amount of up-front analysis and design.

This is why I encourage teams to focus on getting working software in front of end users sooner, and on iterating that solution faster. Even if your first release is way off the mark, you converge on something better soon enough. And if you don’t, you know the medicine’s not working sooner and waste a lot less time and money barking up the wrong mixed metaphor.

What I try to impress on teams and managers is that building it right is far from a ‘nice-to-have’. The technical discipline required to rapidly iterate working software and to sustain the pace of releases is absolutely essential to building the right thing, and it just happens to be the same technical discipline that produces reliable, maintainable software. That’s a win-win.

Iterating is the ultimate requirements discipline.

 

How Agile Works

After 18 years of talk and hype about Agile, I find that it’s easy to lose sight of what Agile means in essence, and – importantly – how it works.

I see it as an inescapable reality of software development – or any sufficiently complex endeavour – that we shouldn’t expect to get it right first time. The odds of our first solution being the best solution are vanishingly small – the proverbial “hole in one”.

So we should expect to need to take multiple passes at a solution, so we can learn with each iteration of the design what works and what doesn’t and progressively get it less wrong.

If Agile is an algorithm, then it’s a search algorithm. It searches an effectively infinite solution space for a design that best fits our problem. The name of this search algorithm is evolution.

Starting with the simplest input, it tests that design against one or more fitness functions. The results of this test are fed back into the next iteration of the design. And around and around we go, adding a little, changing a little, and testing again and again.

In nature, evolution takes tiny steps forward. If a viable organism produced offspring that are too different from itself, chances are that next generation will be non-viable. Evolution doesn’t take big, risky leaps. Instead, it edges forward one tiny, low-risk change at a time.

The Agile design process doesn’t make 100 changes to a solution and then test for fitness. It makes one or two changes, and sees how they work out before making more.

The speed of this search algorithm depends on three things:

  • The frequency of iterations
  • The amount of change in each iteration
  • The quality of feedback into the next iteration

If releases of working software are too far apart, we learn too slowly about what works and what doesn’t.

If we change too much in each release, we increase the risk of making the solution non-viable. We also take on a much higher risk and cost if a release has to be rolled back, as we lose a tonne of changes. It’s in the nature of software that it works as a connected whole. It’s easy to roll back 1 of 1 changes. It’s very hard to roll back 1 of 100 changes.

The lessons we learn with each release will depend on how it was tested. We find that feedback gathered from real end users using the software for real is usually the most valuable feedback. Everything else is just guesswork until our code meets the real world.

“Agile” teams who do weekly show-and-tells, but release working software into production less frequently, are missing out on the best feedback. Our code’s just a hypothesis until real people try to use it for real.

This is why our working relationship with our customer is so important – critical, in fact. far too many teams who call themselves “Agile” don’t get to engage with the customer and end users directly, and the quality of the feedback suffers when we’re only hearing someone’s interpretation of what their feedback was. It works best when the people writing the code get to see and hear first-hand from the people using it.

For me, it’s not Agile if it doesn’t fully embrace those fundamental principles, because they’re the engine that makes it work. Agile teams do small, frequent releases of working software to real customers and end users who they work with directly.

To achieve this, there are some technical considerations. If it takes a long time to check that the software’s fit for release, then you will release less often. If it takes a long time to build and deploy the software, then you’ll release less often. If the changes get harder and harder to make, then you’ll release less often.

And even after we’ve solved the problem, the world doesn’t stand still. The most common effect of releasing software into the world is that – if the software gets used – the world changes. Typically, it changes in ways we weren’t expecting. Western democracies are still struggling with the impact of social media, for example. But on a smaller scale, releasing software into any environment can have unintended consequences.

It’s not enough to get it right once. We have to keep learning and keep changing the software, normally for its entire operational lifetime (which, on average, is about 8 years). So we have to be able to sustain the pace of releases pretty much indefinitely.

All this comes with a bunch of technical challenges that have to be met in order to achieve small, frequent releases at a sustainable pace. Most “Agile” teams fail to master these technical disciplines, and their employers resist making the investment in skills, time and tools required to build a “delivery engine” that’s up to the job.

Most “Agile” teams don’t have the direct working relationship with the people using their software required to gain the most useful feedback.

To put it more bluntly, most “Agile” teams aren’t really Agile at all. They mistake Jira and Jenkins and stand-up meetings and backlogs and burn-down charts for agility. None of those things are, in of themselves, Agile.

Question is: are you?

The 2 Most Critical Feedback Loops in Software Development

When I’m explaining the inner and outer feedback loops of Test-Driven Development – the “wheels within wheels”, if you like – I make the point that the two most important feedback loops are the outermost and the innermost.

feedbackloops

The outermost because the most important question of all is “Did we solve the problem?” The innermost because the answer is usually “No”, so we have to go round again. This means that the code we delivered will need to change, which raises the second most important question; “Did we break the code?”

The sooner we can deliver something so we can answer “Did we solve the problem?”, the sooner we can feedback the lessons learned on the next go round. The sooner we can re-test the code, the sooner we can know if our changes broke it, and the sooner we can fix it ready for the next release.

I realised nearly two decades ago that everything in between – requirements analysis, customer tests, software design, etc etc – is, at best, guesswork. A far more effective way of building the right thing is to build something, get folk to use it, and feedback what needs to change in the next iteration. Fast iterations accelerate this learning process. This is why I firmly believe these days that fast iterations – with all that entails – is the true key to building the right thing.

Continuous Delivery – done right, with meaningful customer feedback drawn from real use in the world world (or as close as we dare bring our evolving software to the real world) – is the ultimate requirements discipline.

Fast-running automated tests that provide good assurance that our code’s always working are essential to this. How long it takes to build, test and deploy our software will determine the likely length of those outer feedback loops. Typically, the lion’s share of that build time is regression testing.

About a decade ago, many teams told me “We don’t need unit tests because we have integration tests”, or “We have <insert name of trendy new BDD tool here> tests”. Then, a few years later, their managers were crying “Help! Our tests take 4 hours to run!” A 4-hour build-and-test cycle creates a serious bottleneck, leading to code that’s almost continuously broken without teams knowing. In other words, not shippable.

Turn a 4-hour build-and-test cycle into a 40-second build-and-test cycle, and a lot of problems magically disappear. You might be surprised how many other bottlenecks in software development have slow-running tests as their underlying cause – analysis paralysis, for example. That’s usually a symptom of high stakes in getting it wrong, and that’s usually a symptom of infrequent releases. “We better deliver the right thing this time, because the next go round could be 6 months later.” (Those among us old enough to remember might recall just how much more care we had to take over our code because of how long it took to compile. It’s a similar effect, but on a much larger scale with much higher stakes than a syntax error.)

Where developers usually get involved in this process – user stories and backlogs – is somewhere short of where they need to be involved. User stories – and prioritised queues of user stories – are just guesses at what an analyst or customer or product owner believes might solve the problem. To obsess over them is to completely overestimate their value. The best teams don’t guess their way to solving a problem; they learn their way.

Like pennies to the pound, the outer feedback loop of “Does it actually work in the real world?” is made up of all the inner feedback loops, and especially the innermost loop of regression testing after code is changed.

Teams who invest in fast-running automated regression tests have a tendency to out-learn teams who don’t, and their products have a tendency to outlive the competition.

 

 

How to Beat Evil FizzBuzz

On the last day of the 3-day Codemanship TDD training workshop, participants are asked to work as a team to solve what would – for an individual developer – be a very simple exercise.

The FizzBuzz TDD kata is well known, and a staple in many coding interviews these days. Write a program that outputs the numbers 1…100 as a single comma-delimited string. Any numbers that are divisible by 3, replace with ‘Fizz’. Any numbers that are divisible by 5, replace with ‘Buzz’. And any numbers that are divisible by 3 and 5, replace with ‘FizzBuzz’. Simples.

An individual can usually complete this in less than half an hour. But what if we make it evil?

Splitting the problem up into five parts, and then assigning each part to a pair or individual in the group, who can only work on the code for their part.

  • Generate a list of integers from 1 to 100
  • Replace integers divisible by 3 with ‘Fizz’
  • Replace integers divisible by 5 with ‘Buzz’
  • Replace integers divisible by 3 and 5 with ‘FizzBuzz’
  • Output the resulting list as a comma-delimited string

Working as a single team to produce a single program that passes my customer test – seeing the final string with all the numbers, Fizzes, Buzzes and FizzBuzzes in the right places produced by their program run on my computer – the group has to coordinate closely to produce a working solution. They have one hour, and no more check ins are allowed after their time’s up. They must demonstrate whatever they’ve got in the master branch of their GitHub repository at the end of 60 minutes.

This is – on the surface of it – an exercise in Continuous Integration. They need to create a shared repository, and each work on their own copy, pushing directly to the master branch. (This is often referred to as trunk-based development.) They must set up a CI server that runs a build – including automated tests – whenever changes are pushed.

Very importantly, once the CI server is up and running, and they’ve got their first green build, the build must never go red again. (Typically it takes a few tries to get a build up and running, so they often start red.)

Beyond those rules:

  • Produce a single program that passes the customer’s test on the customer’s machine
  • Only write code for the part they’ve been assigned
  • Push directly to master on a single GitHub repository – no branching
  • CI must run a full build – including tests – on every push
  • Must not break the build once it’s gone green for the first time
  • Last push must happen before the end of the hour

They can do whatever they need to. It’s their choice of programming language, application type (console, web app, desktop app etc) and so on. They choose which CI solution to use.

90% of groups who attempt Evil FizzBuzz fail to complete it within the hour. The three most common reasons they fail are:

  1. Too long shaving yaks – many groups don’t get their CI up and running until about 30-40 minutes in. In some cases, they never get it up and running.
  2. Lack of a bigger picture – many groups fail to establish a shared vision for how their program will work, and – importantly – how the pieces will fit together
  3. Integrating too late – with cloud-based CI, the whole process of checking your code in can take 2-3 minutes minimum. Times that by 5, and groups often discover that everyone deciding to push their changes with just fives minutes to go means their ship has sailed without them.

On the first point, it’s important to have a game plan and to keep things simple. I can illustrate using a Node and JavaScript example.

First, one of the pairs needs to create a skeleton Node project, with a dummy test for the build server to run. We need to get our delivery pipeline up and running quickly, before anyone even thinks about writing any solution code.

skeleton_node_project

This is just an empty Node project, with a single dummy Mocha unit test. Make sure the test passes, then create a GitHub repository and push this skeleton project to it.

initial_commit

Now, let’s set up a CI server. I’m going to use circleci.com. Logging in with my GitHub account, I can easily see and add a build project for my new evil_fizzbuzz repository.

add_circleci_project

It helps enormously to go with the popular conventions for your project. I’m using Node, which is widely supported, Mocha for tests which are named and located where – by default – the build tool would expect to find them, and it’s all very Yarn-friendly. Well, maybe. We’ll see. I add a .circleci/config.yml file to my project and paste in the default settings recommended for my project by CircleCI.

circleci_config

Then I push this new file to master, and instruct CircleCI to start a build. This first build fails. They usually do. Looking at the output, the part of the workflow where it fell over has the error message:

The engine "node" is incompatible with this module. Expected version "6.* || 8.* || >= 10.*"

I’m not proud. Don’t sit there trying to figure things like this out. Just Google the error message and see if anyone has a fix for it. Turns out it’s common, and there’s a simple fix you can do in the config.yml file. So I fix it, push that change, and wait for a second build.

green_build

The build succeeds, but I need to make sure the test was actually run before we can continue.

tests_ran

Looks like we’re in business. Time to start working on our solution.

Next, you’ll need to invite all your team mates to contribute to your GitHub project. This is where team skills help: someone needs to get all the necessary user IDs, make sure everyone is aware that invites are being sent out, and ensure everyone accepts their invite ASAP. Coordination!

While this is going on, someone should be thinking about how the finished program will be demonstrated on the customer’s laptop. Do they have a compatible version of Node.js installed already? And how will they resolve dependencies – in this case, Mocha?

Effective software design begins and ends with the user experience. The pair responsible for the final output should take care of this, I think.

Time to complete our end-to-end “Hello, world!” so our delivery pipeline joins all the dots.

The output pair add a JavaScript file that will act as the entry point for the program, and have it write “Hello, world!” to the console.

hello_world

After checking program.js works on the local command line, push it to master.

We establish that our customer – me, in this case – happens to have Git and Node.js installed, so possibly the simplest way to demonstrate the program running on my computer might be to clone the files from master into a local folder, run npm install to resolve the Mocha dependency, and then we can just run node program.js in our customer demo. (We can tidy that up later if need be, but it will pass the test.)

rmdir teamjason /s /q
mkdir teamjason
cd teamjason
git clone https://github.com/jasongorman/evil_fizzbuzz.git
cd evil_fizzbuzz
npm install

We test that it works on the customer’s laptop, and now we’re finally ready to start implementing our FizzBuzz solution.

Phew. Yaks shaved.

But where to start?

This is the second place a lot of teams go wrong. They split off into their own pairs, clone the GitHub repository, and start working on their part of the solution straight away with no overall understanding of how it will all fit together to solve the problem.

This is where mob programming can help. Before splitting off, get everyone around one computer (there’s always a projector or huge TV in the room they can use). The pair responsible for writing the final output write the code (which satisfies the rules), while the rest of the group give input on the top-level design. In simpler terms, the team works outside-in, to identify what parts will be needed and see how their part fits in.

In my illustration, I’m thinking maybe a bit if functional composition might be the way to go.


const output = generateList().map(
(number) => fizz(buzz(fizzBuzz(number))))
console.log(output.toString());

view raw

program.js

hosted with ❤ by GitHub

This is the only code the pair who are responsible for outputting the result are allowed to write, according to the rules of Evil FizzBuzz. But the functions used here don’t exist, so we can’t push this to master without breaking the build.

Here’s where we get creative. Each of the other four pairs takes their turn at the keyboard to declare their function – just an empty one for now.


const generateList = () => {
return []
}
const fizz = (number) => {
}
const buzz = (number) => {
}
const fizzBuzz = (number) => {
}
const output = generateList().map(
(number) => fizz(buzz(fizzBuzz(number))))
console.log(output.toString());

view raw

program.js

hosted with ❤ by GitHub

We can run this and see that it is well-formed, and produces an empty output, as we’d expect at this point. Let’s push it to master.

It’s vital for everyone to keep one eye on the build status, as it’s a signal – a pulse, if you like – every developer on a team needs to be aware of. This build succeeds.

builds

So, we have an end-to-end delivery pipeline, and we have a high-level design, so everyone can see how their part fits into the end solution.

This can be where pairs split off to implement their part. Now is the time to make clones and here’s where the CI skills come into play.

Let’s say one pair is working on the Fizz part. They take a clone of master, and – because it is a TDD course, after all – write and pass their first Mocha test.


const assert = require('assert')
const fizz = require('../program.js').fizz
describe('Fizz', ()=>{
it('1 is unchanged', ()=>{
assert.equal(fizz(1), '1')
})
})

view raw

fizz_test.js

hosted with ❤ by GitHub

On a green light, it’s time maybe for a bit of refactoring. The pair decide to pull the fizz function into it’s own file, to keep what they’re doing more separate from everyone else.

Having refactored the structure of the solution a little, they feel this might be a good time to share those changes with the rest of the team. This helps avoid the third mistake teams make – integrating too late, with too many potentially conflicting changes. (Many Evil FizzBuzz attempts end with about 15 minutes of merge hell.) Typically this ends with them breaking the build and the team disqualified.

But before pushing to master, they run all of the tests, just to be sure.

fizz_test

With all tests passing, it should be safe to push. Then they wait for a green build before moving on to the next test case.

build_in_progress

While builds are in progress, other members of the team must be mindful that it’s not safe to push their changes until the whole process has completed successfully. They must also ensure they don’t pull changes that break the build, so everyone should be keeping one eye on the build status.

Phew. It’s green.

When you see someone else’s build succeed, that would be a good time to consider pulling any changes that have been made, and running all of the tests locally. Keeping in step with master, when working in such close proximity code-wise, is very important.

Each pair continues in this vein: pass a test, maybe do some refactoring, check in those changes, wait for a green build, pull any changes other pairs have made when you see their builds go green, and keep running those tests!

It’s also a very good idea to keep revisiting the customer test to see what visible progress is being made, and to spot any integration problems as early as possible. Does the high-level design actually work? Is each function playing its part?

Let’s pay another visit to the team after some real progress has been made. When we run the customer test program, what output do we get now?

command_line_inprogress

Okay, it looks like we’re getting somewhere now. The list of 100 numbers is being generated, and every third number is Fizz. Work is in progress on Buzz and FizzBuzz. if we were 45 minutes in at this point, we’d be in with a shot at beating Evil FizzBuzz.

Very quickly, the other two pieces pieces of our jigsaw slot into place. First, the Buzzes…

command_line_inprogress_buzz

And finally the FizzBuzzes.

command_line_complete

At this point, we’re pretty much ready for our real customer test. We shaved the yaks, we established an overall design, we test-drove the individual parts and are good to go.

So this is how – in my experience – you beat Evil FizzBuzz.

  1. Shave those yaks first! You need to pull together a complete delivery pipeline, that includes getting it on to the customer’s machine and ready to demo, as soon as you can. The key is to keep things simple and to stick to standards and conventions for the technology you’ve chosen. It helps enormously, of course, if you have a good amount of experience with these tools. If you don’t, I recommend working on that before attempting Evil FizzBuzz. “DevOps” is all the rage, but surprisingly few developers actually get much practice at it. Very importantly, if your delivery pipeline isn’t up and running, the whole delivery machine is blocked. Unshaved yaks are everybody’s problem. Don’t have one pair “doing the build” while the rest of you go away and work on code. How’s your code going to get into the finished solution and on to the customer’s machine?
  2. Get the bigger picture and keep it in sight the whole time. Whether it’s through mob programming, sketching on a whiteboard or whatever – involve the whole team and nail that birds-eye view before you split off. And, crucially, keep revisiting your final customer test. Lack of visibility of the end product is something teams working on real products and projects cite a major barrier to getting the right thing done. Invisible progress often turns out to be no progress at all. As ‘details people’, we tend to be bad at the bigger picture. Work on getting better at it.
  3. Integrate early and often. You might only have unit 3 tests to pass for your part in a one-hour exercise, but that’s 3 opportunities to test and share your changes with the rest of the team. And the other side of that coin – pull whenever you see someone else’s build succeed, and test their changes on your desktop straight away. 5 pairs trying to merge a bunch of changes in the last 15 minutes often becomes a train wreck. Frequent, small merges work much better on average.

 

 

 

 

 

Code Craft’s Value Proposition: More Throws Of The Dice

Evolutionary design is a term that’s used often, not just in software development. Evolution is a way of solving complex problems, typically with necessarily complex solutions (solutions that have many interconnected/interacting parts).

But that complexity doesn’t arise in a single step. Evolved designs start very simple, and then become complex over many, many iterations. Importantly, each iteration of the design is tested for it’s “fitness” – does it work in the environment in which it operates? Iterations that don’t work are rejected, iterations that work best are selected, and become the input to the next iteration.

We can think of evolution as being a search algorithm. It searches the space of all possible solutions for the one that is the best fit to the problem(s) the design has to solve.

It’s explained best perhaps in Richard Dawkins’ book The Blind Watchmaker. Dawkins wrote a computer simulation of a natural process of evolution, where 9 “genes” generated what he called “biomorphs”. The program would generate a family of biomorphs – 9 at a time – with a parent biomorph at the centre surrounded by 8 children whose “DNA” differed from the parent by a single gene. Selecting one of the children made it the parent of a new generation of biomorphs, with 8 children of their own.

biomorph
Biomorphs generated by the evolutionary simulation at http://www.emergentmind.com/biomorphs

You can find a recreation and more detailed explanation of the simulation here.

The 9 genes of the biomorphs define a universe of 118 billion possible unique designs. The evolutionary process is a walk through that universe, moving just one space in any direction – because just one gene is changing with each generation – with each iteration. From simple beginnings, complex forms can quickly arise.

A brute force search might enumerate all possible solutions, test each one for fitness, and select the best out of that entire universe of designs. With Dawkins’ biomorphs, this would mean testing 118 billion designs to find the best. And the odds of selecting the best design at random are 1:118,000,000,000. There may, of course, be many viable designs in the universe of all possible solutions. But the chances of finding one of them with a single random selection – a guess – are still very small.

For a living organism, that has many orders of magnitude more elements in their genetic code and therefore an effectively infinite solution space to search, brute force simply isn’t viable. And the chances of landing on a viable genetic code in a single step are effectively zero. Evolution solves problems not by brute force or by astronomically improbable chance, but by small, perfectly probable steps.

If we think of the genes as a language, then it’s not a huge leap conceptually to think of a programming language in the same way. A programming language defines the universe of all possible programs that could be written in that language. Again, the chances of landing on a viable working solution to a complex problem in a single step are effectively zero. This is why Big Design Up-Front doesn’t work very well – arguably at all – as a solution search algorithm. There is almost always a need to iterate the design.

Natural evolution has three key components that make it work as a search algorithm:

  • Reproduction – the creation of a new generation that has a virtually identical genetic code
  • Mutation – tiny variances in the genetic code with each new generation that make it different in some way to the parent (e.g., taller, faster, better vision)
  • Selection – a mechanism for selecting the best solutions based on some “fitness” function against which each new generation can be tested

The mutations from one generation to the next are necessarily small. A fitness function describes a fitness landscape that can be projected onto our theoretical solution space of all possible programs written in a language. Programs that differ in small ways are more likely to have very similar fitness than programs that are very different. Make one change to a working solution and, chances are, you’ve still got a working solution. Make 100 changes, and the risk of breaking things is much higher.

Evolutionary design works best when each iteration is almost identical to that last, with only one or two small changes. Teams practicing Continuous Delivery with a One-Feature-Per-Release policy, therefore, tend to arrive at better solutions than teams who schedule many changes in each release.

And within each release, there’s much more scope to test even smaller changes – micro-changes of the kind enacted in, say, refactoring, or in the micro-iterations of Test-Driven Development.

Which brings me neatly to the third component of evolutionary design: selection. In nature, the Big Bad World selects which genetic codes thrive and which are marked out for extinction. In software, we have other mechanisms.

Firstly, there’s our own version of the Big Bad World. This is the operating environment of the solution. A Point Of Sale system is ultimately selected or rejected through real use in real shops. An image manipulation program is selected or rejected by photographers and graphic designers (and computer programmers writing blog posts).

Real-world feedback from real-world use should never be underestimated as a form of testing. It’s the most valuable, most revealing, and most real form of testing.

Evolutionary design works better when we test our software in the real world more frequently. One production release a year is way too little feedback, way too late. One production release a week is far better.

Once we’ve established that the software is fit for purpose through customer testing – ideally in the real world – there are other kinds of testing we can do to help ensure the software stays working as we change it. A test suite can be thought of as a codified set of fitness functions for our solution.

One implication of the evolutionary design process is that, on average, more iterations will produce better solutions. And this means that faster iterations tend to arrive at a working solution sooner. Species with long life cycles – e.g., humans or elephants – evolve much slower than species with short life cycles like fruit flies and bacteria. (Indeed, they evolve so fast that it’s been observed happening in the lab.) This is why health organisations have to guard against new viruses every year, but nobody’s worried about new kinds of shark suddenly emerging.

For this reason, anything in our development process that slows down the iterations impedes our search for a working solution. One key factor in this is how long it takes to build and re-test the software as we make changes to it. Teams whose build + test process takes seconds tend to arrive at better solutions sooner than teams whose builds take hours.

More generally, the faster and more frictionless the delivery pipeline of a development team, the faster they can iterate and the sooner a viable solution evolves. Some teams invest heavily in Continuous Delivery, and get changes from a programmer’s mind into production in minutes. Many teams under-invest, and changes can take weeks or months to reach the real world where the most useful feedback is to be had.

Other factors that create delivery friction include the maintainability of the code itself. Although a system may be complex, it can still be built from simple, single-purpose, modular parts that can be changed much faster and more cheaply than complex spaghetti code.

And while many BDUF teams focus on “getting it right first time”, the reality we observe is that the odds of getting it right first time are vanishingly small, no matter how hard we try. I’ll take more iterations over a more detailed requirements specification any day.

When people exclaim of code craft “What’s the point of building it right if we’re building the wrong thing?”, they fail to grasp the real purpose of the technical practices that underpin Continuous Delivery like unit testing, TDD, refactoring and Continuous Integration. We do these things precisely because we want to increase the chances of building the right thing. The real requirements analysis happens when we observe how users get on with our solutions in the real world, and feed back those lessons into a new iteration. The sooner we get our code out there, the sooner can get that feedback. The faster we can iterate solutions, the sooner a viable solution can evolve. The longer we can sustain the iterations, the more throws of the dice we can give the customer.

That, ultimately, is the promise of good code craft: more throws of the dice.

 

Code Craft is Seat Belts for Programmers

Every so often we all get a good laugh when some unfortunate new hire or intern at a major tech company accidentally “deletes Google” on their first day. It’s easy to snigger (because, of course, none of us has ever messed up like that).

The fact is, though, that pointing and laughing when tech professionals make mistakes doesn’t stop mistakes getting made. It can also breed a toxic work culture, where people learn to avoid mistakes by not taking risks. Not taking risks is anathema to innovation, where – by definition – we’re trying stuff we’ve never done before. Want to stifle innovation where you work? Pointing and laughing is a great way to get there.

One of the things I like most about code craft is how it can promote a culture of safety to try new things and take risks.

A suite of good, fast-running unit tests, for example, makes it easier to spot our boos-boos sooner, so we can un-boo-boo them quickly and without attracting attention.

Continuous Integration offers a level of un-doability that makes it easier and safer to experiment, safe in the knowledge that if we mess it up, we can get back to the last version that worked with a simple hard reset.

The micro-cycles of refactoring mean we never stray far from the path of working code. Combine that with fast-running tests and frequent commits, and ambitious and improbable re-architecting of – say – legacy code becomes a sequence of mundane, undo-able and safe micro-rewrites.

And I can’t help feeling – when I see some poor sod getting Twitter Heat for screwing up a system in production – that it was the deficiency in their delivery pipeline that allowed it to happen that was really at fault. The organisation messed up.

Software development’s a learning process. Think about when young children – or people of any age – first learn to use a computer. The fear of “breaking it” often discourages them from trying new things, and this hampers their learning process. never underestimate just how much great innovation happens when someone says “I wonder what happens if I do this…” Remove that fear by fostering a culture of “what if…?” shielded by systems that forgive.

Code craft is seat belts for programmers.

Code Craft is More Throws Of The Dice

On the occasions founders ask me about the business case for code craft practices like unit testing, Continuous Integration and refactoring, we get to a crunch moment: will this guarantee success for my business?

Honestly? No. Nobody can guarantee that.

Unit testing can’t guarantee that. Test-Driven Development can’t guarantee that. Refactoring can’t guarantee it. Automated builds can’t guarantee it. Microservices can’t. The Cloud can’t. Event sourcing can’t. NoSQL can’t. Lean can’t. Scrum can’t. Kanban can’t. Agile can’t. Nothing can.

And that is the whole point of code craft. In the crap game of technology, every release of our product or system is almost certainly not a winning throw of the dice. You’re going to need to throw again. And again. And again. And again.

What code craft offers is more throws of the dice. It’s a very simple value proposition. Releasing working software sooner, more often and for longer improves your chances of hitting the jackpot. More so than any other discipline in software development.

Codemanship Twitter Code Craft Quiz – Answers

Yesterday evening – for fun and larks – I posted 20 quiz questions about code craft as Twitter polls. It’s been fun watching the percentages for each answer emerge, but now it’s time to reveal my answers so you can see how yours compare.

The correct answer is Always Shippable. The goal of CD is to empower our customer to release our software whenever they choose, without having to go through a long testing and release process. Many of the principles and practices of code craft – e.g., unit testing and TDD – contribute to that goal.

Evidently, a lot of folk get Continuous Delivery confused with Continuous Deployment, and that’s understandable because the name kind of implies something similar. Perhaps we should have called it “Continuously Shippable”?

The correct answer is Comment Block. There’s no such refactoring. If you want to remove code, do a Safe Delete (delete code, but only if no other code references it). If you want to keep old code, use version control.

The correct answer is Refactoring. They were separate disciplines in the original description of Extreme Programming practices, but folk quickly realised that refactoring needed to be an explicit step in the TFD process.

The correct answer is Tell, Don’t Ask. The goal of Tell, Don’t Ask is to better encapsulate – hide the data of – modules so that they know less about each other.

The correct answer is Feature Envy. Feature Envy is when a method of one class references the features of another class – typically the data – more than its own. It’s “Ask, Don’t Tell”.

The best answer is Examples. Yes, it is true that BDD uses executable specifications, but what makes those specifications executable? The thing that makes them executable is the thing that makes them precise and unambiguous – Examples! BDD, TDD and ATDD are all examples of Specification By Example.

The correct answer is the Facade pattern.

The correct answer is Property-Based Testing. This is sometimes more descriptively called “Generative Testing”, because we write code to generate potentially very large sets of test inputs automatically (e.g., random numbers, combinations of inputs, etc). It has a similar aim to Exploratory Testing, but isn’t manual like ET, and therefore can scale to mind-boggling numbers of test cases with minimal extra code, and run far, far faster.

The correct answer is Automated Testing. If it takes you 5 hours to manually re-test your software, you can only check in safely every 5 hours at the earliest. Which doesn’t sound very “continuous” to me. Good to see that message getting through.

The best answer is Stubs and Mocks. The challenge in testing multithreaded logic is that thread scheduling – e.g., by the OS or a VM – is usually beyond our control, so we can’t guarantee how operations in separate threads will be interleaved. This can lead to unpredictable test results that are difficult to reproduce – “Heisenbugs” and “flickering builds”. One simple way to reduce this effect is to test as much “multithreaded” logic as possible in a single thread. Test Doubles can be used to pretend to be the other end of a multithreaded conversation. For example, we can use mock objects to test that callbacks were invoked as expected, or we can use stubs that provide synchronous implementations of asynchronous methods. The goal is to get as much of the logic as possible into places where it can be tested synchronously. This is compatible with a goal of good multithreaded code design – which is to have a little of it as possible.

The correct answer is Tell, Don’t Ask. I was very surprised by how few people got this. Tell, Don’t Ask is about designing more cohesive classes in order to reduce class coupling. The underlying goal of Common Closure – things that change together belong together – and Common Reuse – things that are reused together belong together – is more cohesive packages, in order to reduce package coupling. They share the goal of improving encapsulation. IMO, package design principles have been historically explained poorly, and this may go some way to explaining why a lot of developers struggle to grok them. In practice, they’re the exact same principles at the class/module and package level. The way I try to explain them attempts to be consistent at every level of code organisation.

The correct answer is 3. This is about the Rule of Three. We wait to see three examples of code duplication before we refactor it into a single generalisation or abstraction. The rule of thumb describes a simple way to balance the risks of refactoring too early, before we’ve seen enough examples to form a good abstraction (the number one cause of “leaky abstractions”), and refactoring too late, when we have more duplication to deal with.

The best answer is Identify Change Points. In his book, Working Effectively With Legacy Code, Michael Feathers describes a process for safely making changes – i.e., with the benefit of fast-running automated tests (“unit tests”) – to legacy software. There are two reasons why I wouldn’t start by writing a system test:

  1. How do I know what system tests I’ll need without identifing which features lie on the critical path of the code I’m going to change? Do I write system tests for all of it?
  2. How long do I want to live with those system tests? Is it worth writing them just to have them for as long a it takes to introduce unit tests? My goal is to get fast-running tests for the logic in place ASAP.

If I’m refactoring code that has few or no automated tests, a Golden Master – a test that uses an example output (e.g., a web page) to compare against any potentially broken output – can be a relatively quick way of establishing basic safety. But, again, how do I know what output(s) to use without identifying which features would need to be retested for the change I’m planning to make. And a Golden Master test would effectively be another slow-running system test, which I probably wouldn’t want to live with for long enough to justify writing one in the first place.

After we’ve identified what parts of the code need to change, our goal should be to get fast-running tests around those parts of the code. While we break any dependencies that are getting in our way, I will usually re-test the software manually. Gasp! The point being, I’m not manually testing it for very long before I can add unit tests. It might take me a morning. Is it worth automating system tests that you’re not going to want to rely on going forward, just for a morning?

Having said all that, if I was the only developer on my team writing unit tests on a legacy system, I’d introduce a Golden Master into the build pipeline to protect against obvious regressions. But not on a per change basis. I’d do that before even thinking about changes.

The best answer is Check In. I would have hoped that wouldn’t need explaining! A big part of the discipline of Continuous Integration is to try to ensure that the code you have in VCS – the code that is, in theory, always shippable – is never broken. When it is broken – for whatever reason – any changes you push on to it risk being lost if the code has to be reverted. Plus, there’s no way of knowing if your build succeeded. Don’t push on to broken code.

The correct answer is C++. If I change a C++ interface, even clients that aren’t using the methods I changed have to be recompiled, re-tested and re-deployed. C++ clients bind to the whole interface at compile time. In dynamic languages, this generally isn’t the case. Ruby, Python and JavaScript clients bind at runtime, and only to the methods they use. Indeed, the object doesn’t even have to have a specific type, just as long as it implements compatible methods. Much of S.O.L.I.D. is language-dependent in this way.

The correct answer is See The Test Fail. More specifically, see the test assertion fail. So you know, going forward, it’s a good test that you can rely to fail when the result is wrong. Test your tests.

The best answer is When The Tests Pass. Refactoring was added as an explicit step in the TDD micro-cycle. But refactor what, exactly? I encourage developers to do a little review on code they’ve added or changed whenever they get to a green light:

  • Is it easy to understand?
  • Is there duplication I should remove?
  • Is it as simple as I can make it?
  • Does each part do one thing?
  • Is there Feature Envy between modules?
  • Are modules exposed to things they’re not using?
  • Are module dependencies easily swappable?

I find from experience and from client studies that code reviews on a less frequent basis tend to be too little, too late. TDD and refactoring and CI/CD are practices specifically aimed at breaking work down into the smallest chunks, so we can get very frequent feedback, and bring more focus to each design decision.

And when we’re programming in pairs, the thinking is that code review is continuous. It’s one of the main reasons we do it.

When we chunk code reviews into pull requests – or even larger batches of design decisions – we tend to miss a whole bunch of things. This is borne out by the resulting quality of the code.

I also see how, for many teams, pull requests become a significant bottleneck, which is usually the consequence of batching feedback. The whole point of Extreme Programming is to turn all the good dials up to 11. PR code reviews set the dial at about 5-6.

If you still feel your merge process needs that last line of defence, consider investing in automating code quality testing in your build pipeline instead.

It’s a hot take for PR fans, I know! You may now start throwing the furniture around.

The best answer is Refactoring. This has been a painful lesson for many, many developers. When we open up discussions about refactoring with people who manage our time, the risk is that we’re inviting them to say “no” to it. And, nine times out of ten, they will. Which is why 9 out of 10 coe bases end up too rigid and brittle to accomodate change, and the pace of innovation slows to a very expensive crawl.

Refactoring is an absolutely essential part of code craft. We should be doing it continuously. It’s part of how we write code. End of discussion.

The correct answer is Liskov Substitution. The LSP states that we should be able to substitute an instance of any class with an instance of any of its subclasses. (In modern parlance, we might use the word “type” instead of “class”.) This is all about contracts. If I define an interface for, say, a device driver to be used with my operating system, there are certain rules all device drivers need to obey to function correctly in my OS. I could write a suite of contract tests – tests that are written against that interface, with the actual implementation under test deferred/injected – so that anyone implementing a device driver can assure themselves it will obey the device driver contract. Indeed, this is exactly what Microsoft did for their device driver interfaces.

The best answer is True. Now, this is going to take some explaining…

Firstly, if we include Specification By Example in code craft – which I do – then a good chunk of it is about pinning down what the customer wants. It may not necessarily turn out to be what the customer needs, though. Which is what the rest of code craft is about.

The traditional view of requirements engineering is that we try to specify up-front what the customer needs and then deliver that. We learned that this doesn’t really work almost as soon as people started programming computers.

Our first pass at a solution will almost always be – to some extent – wrong. So we take another pass and get it less wrong. And another. And another. Until the solution is good enough for our customer’s real needs.

In building the right thing, feedback cycles matter more than up-front guesses. The faster we can iterate our design, the sooner we can arrive at a workable solution. Fast customer feedback cycles are enabled by code craft. The whole point of code craft is to help us learn our way to the right solution.

Acting on customer feedback means we’ll be changing the code. If the code is difficult to change, then we can’t sustain the pace of learning. The wrong design gets baked in to code that’s too rigid and brittle to evolve into the right design.

And software can have an operational lifespan that long surpasses the original needs of the customer. Legacy code is a very real and very damaging limiting factor on tens of thousands of businesses. Marketing would love to be able to offer their customers the spiffy new widget the competition just rolled out, but if it’s going to cost millions and take years, it’s not an option.

So, in a very real and direct sense, code craft is all about building the right thing by building it right.