On the Codemanship TDD course, before we discuss Continuous Delivery and how essential it is to achieving real agility, we talk about the Test Pyramid.
It has various interpretations, in terms of the exactly how many layers and exactly what kinds of testing each layer is made of (unit, integration, service, controller, component, UI etc), but the overall sentiment is straightforward:
The longer tests take to run, the fewer of those kinds of tests you should aim to have
The idea is that the tests we run most often need to be as fast as possible (otherwise we run them less often). These are typically described as “unit tests”, but that means different things to different people, so I’ll qualify: tests that do not involve any external dependencies. They don’t read from or write to databases, they don’t read or write files, they don’t connect with web services, and so on. Everything that happens in these tests happens inside the same memory address space. Call them In-Process Tests, if you like.
Tests that necessarily check our code works with external dependencies have to cross process boundaries when they’re executed. As our In-Process tests have already checked the logic of our code, these Cross-Process Tests check that our code – the client – and the external code – the suppliers – obey the contracts of their interactions. I call these “integration tests”, but some folk have a different definition of integration test. So, again, I qualify it as: tests that involve external dependencies.
These typically take considerably longer to execute than “unit tests”, and we should aim to have proportionally fewer of them and to run them proportionally less often. We might have thousands of unit tests, and maybe hundreds of integration tests.
If the unit tests cover the majority of our code – say, 90% of it – and maybe 10% of our code has direct external dependencies that have to be tested, on average we’ll make about 9 changes that need unit testing compared to 1 change that needs integration testing. In other words, we’d need to run our unit tests 9x as often as our integration tests, which is a good thing if each integration test is about 9 times slower than a unit test.
At the top of our test pyramid are the slowest tests of all. Typically these are tests that exercise the entire system stack, through the user interface (or API) all the way down to the external dependencies. These tests check that it all works when we plug everything together and deploy it into a specific environment. If we’ve already tested the logic of our code with unit tests, and tested the interactions with external suppliers, what’s left to test?
Some developers mistakenly believe that these system-levels tests are for checking the logic of the user experience – user “journeys”, if you like. This is a mistake. There are usually a lot of user journeys, so we’d end up with a lot of these very slow-running tests and an upside-down pyramid. The trick here is to make the logic of the user experience unit-testable. View models are a simple architectural pattern for logically representing what users see and what users do at that level. At the highest level they may be looking at an HTML table and clicking a button to submit a form, but at the logical level, maybe they’re looking at a movie and renting it.
A view model can help us encapsulate the logic of user experience in a way that can be tested quickly, pushing most of our UI/UX tests down to the base of the pyramid where they belong. What’s left – the code that must directly reference physical UI elements like HTML tables and buttons – can be wafer thin. At that level, all we’re testing is that views are rendered correctly and that user actions trigger the correct internal logic (which can easily be done using mock objects). These are integration tests, and belong in the middle layer of our pyramid, not the top.
Another classic error is to check core logic through the GUI. For example, checking that insurance premiums are calculated correctly by looking at what number is rendered on that web page. Some module somewhere does that calculation. That should be unit-testable.
So, if they’re not testing user journeys, and they’re not testing core logic, what do our system tests test? What’s left?
Well, have you ever found yourself saying “It worked on my machine”? The saying goes “There’s many a slip ‘twixt cup and lip.” Just because all the pieces work, and just because they all play nicely together, it’s not guaranteed that when we deploy the whole system into, say, our EC2 instances, that nothing could be different to the environments we tested it in. I’ve seen roll-outs go wrong because the servers handled dates different, or had the wrong locale, or a different file system, or security restrictions that weren’t in place on dev machines.
The last piece of the jigsaw is the system configuration, where our code meets the real production environment – or a simulation of it – and we find out if really works where it’s intended to work as a whole.
We may need dozens of those kinds of tests, and perhaps only need to run them on, say, every CI build by deploying the outputs to a staging environment that mirrors the production environment (and only if all our unit and integration tests pass first, of course.) These are our “good to go?” tests.
The shape of our test pyramid is critical to achieving feedback loops that are fast enough to allow us to sustain the pace of development. Ideally, after we make any change, we should want to get feedback straight away about the impact of that change. If 90% of our code can be re-tested in under 30 seconds, we can re-test 90% of our changes many times an hour and be alerted within 30 seconds if we broke something. If it takes an hour to re-test our code, then we have a problem.
Continuous Delivery means that our code is always shippable. That means it must always be working, or as near as possible always. If re-testing takes an hour, that means that we’re an hour away from finding out if changes we made broke the code. It means we’re an hour away from knowing if our code is shippable. And, after an hour’s-worth of changes without re-testing, chances are high that it is broken and we just don’t know it yet.
An upside-down test pyramid puts Continuous Delivery out of your reach. Your confidence that the code’s shippable at any point in time will be low. And the odds that it’s not shippable will be high.
The impact of slow-running test suites on development is profound. I’ve found many times that when a team invested in speeding up their tests, many other problems magically disappeared. Slow tests – which means slow builds, which means slow release cycles – is like a development team’s metabolism. Many health problems can be caused by a slow metabolism. It really is that fundamental.
Slow tests are pennies to the pound of the wider feedback loops of release cycles. You’d be surprised how much of your release cycles are, at the lowest level, made up of re-testing cycles. The outer feedback loops of delivery are made of the inner feedback loops of testing. Fast-running automated tests – as an enabler of fast release cycles and sustained innovation – are therefore highly desirable
A right-way-up test pyramid doesn’t happen by accident, and doesn’t come at no cost, though. Many organisations, sadly, aren’t prepared to make that investment, and limp on with upside-down pyramids and slow test feedback until the going gets too tough to continue.
As well as writing automated tests, there’s also an investment needed in your software’s architecture. In particular, the way teams apply basic design principles tends to determine the shape of their test pyramid.
I see a lot of duplicated code that contains duplicated external dependencies, for example. It’s not uncommon to find systems with multiple modules that connect to the same database, or that connect to the same web service. If those connections happened in one place only, that part of the code could be integration tested just once. D.R.Y. helps us achieve a right-way-up pyramid.
I see a lot of code where a module or function that does a business calculation also connects to an external dependency, or where a GUI module also contains business logic, so that the only way to test that core logic is with an integration test. Single Responsibility helps us achieve a right-way-up pyramid.
I see a lot of code where a module in one web service interacts with multiple features of another web service – Feature Envy, but on a larger scale – so there are multiple points of integration that require testing. Encapsulation helps us achieve a right-way-up pyramid.
I see a lot of code where a module containing core logic references an external dependency, like a database connection, directly by its implementation, instead of through an abstraction that could be easily swapped by dependency injection. Dependency Inversion helps us achieve a right-way-up pyramid.
Achieving a design with less duplication, where modules do one job, where components and services know as little as possible about each other, and where external dependencies can be easily stubbed or mocked by dependency injection, is essential if you want your test pyramid to be the right way up. But code doesn’t get that way by accident. There’s significant ongoing effort required to keep the code clean by refactoring. And that gets easier the faster your tests run. Chicken, meet egg.
If we’re lucky enough to be starting from scratch, the best way we know of to ensure a right-way-up test pyramid is to write the tests first. This compels us to design our code in such a way that it’s inherently unit-testable. I’ve yet to come across a team genuinely doing Continuous Delivery who wasn’t doing some kind of TDD.
If you’re working on legacy code, where maybe you’re relying on browser-based tests, or might have no automated tests at all, there’s usually a mountain to climb to get a test pyramid that’s the right way up. You need to write fast-running tests, but you will probably need to refactor the code to make that possible. Egg, meet chicken.
Like all mountains, though, it can be climbed. One small, careful step at a time. Michael Feather’s book Working Effectively With Legacy Code describes a process for making changes safely to code that lacks fast-running automated tests. It goes something like this:
- Identify what code you need to change
- Identify where around that code you’d want unit tests to make the change safely
- Break any dependencies in that code getting in the way of unit testing
- Write the unit tests
- Make the change
- While you’re there, make other improvements that will help the next developer who needs to change that code (the “boy scout rule” – leave the camp site tidier than you found it)
Change after change, made safely in this way, will – over time – build up a suite of fast-running unit tests that will make future changes easier. I’ve worked on legacy code bases that went from upside-down test pyramids of mostly GUI-based system tests, that took hours or even days to run, to right-side-up pyramids where most of the code could be tested in under a minute. The impact on the cost and the speed of delivery is always staggering. It can be done.
But be patient. A code base might take a year or two to turn around, and at first the going will be tough. I find I have to be super-disciplined in those early stages. I manually re-test as I refactor, and resist the temptation to make a whole bunch of changes at a time before I re-test. Slow and steady, adding value and clearing paths for future changes at the same time.