Fast-Running Tests Are Key To Agility. But How Fast Is ‘Fast’?

I’ve written before about how vital it is to be able to re-test our software quickly, so we can ensure it’s always shippable after every change we make to the code.

Achieving a fast-running test suite requires us to engineer our tests so that the vast majority run as quickly as possible, and that means most of our tests don’t involve any external dependencies like databases or web services.

If we visualise our test suites as a pyramid, the base of the pyramid – the bulk of the tests – should be these in-memory tests (let’s call them ‘unit tests’ for the sake of argument). The tip of the pyramid – the slowest running tests – would typically be end-to-end or system tests.

But one person’s “fast” is often another person’s “slow”. It’s kind of ambiguous as to what I and others mean when we say “your tests should run fast”. For a team relying on end-to-end tests that can take many seconds to run, 100ms sounds really fast. For a team relying on unit tests that take 1 or 2 milliseconds, 100ms sounds really slow.

A Twitter follower asked me how long a suite of 5,000 tests should take to run? If the test suite’s organised into an ideal pyramid, then – and, of course, these are very rough numbers based on my own experience – it might look something like this:

  • The top of the pyramid would be end-to-end tests. Let’s say each of those takes 1 second. You should aim to have about 1% of your tests be end-to-end tests. So, 50 tests = 50s.
  • The middle of the pyramid would be integration and contract tests that check interactions with external dependencies. Maybe they each take about 100ms to run. You should aim to have less than 10% of those kinds of tests, so about 500 tests = 50s.
  • The base of the pyramid should be the remaining 4450 unit tests, each running in roughly 1-10ms. Let’s take an average of 5ms. 4450 unit tests = 22s.

You’d be in a good place if the entire suite could run in about 2 minutes.

Of course, these are ideal numbers. But it’s the ballpark we’re interested in. System tests run in seconds. Integration tests run in 100s of milliseconds. Unit tests run in milliseconds.

It’s also worth bearing in mind you wouldn’t need to run all of the tests all of the time. Integration and contract tests, for example, only need to be run when you’ve changed integration code. If that’s 10% of the code, then we might need to run them 10% of the time. End-to-end tests might be run even less frequently (e.g., in CI).

Now, what if your pyramid was actually a diamond shape, with the bulk of your tests hitting external processes? Then your test suite would take about 8 minutes to run, and you’d have to run those integration tests 90% of the time. Most teams would find that a serious bottleneck.

And if your pyramid was upside-down, with most tests being end-to-end, then you’re looking at 75 minutes for each test run, 90% of the time. I’ve seen those kind of test execution times literally kill businesses with their inability to evolve their products and systems.

What Can We Learn From The Movie Industry About Testing Feedback Loops?

In any complex creative endeavour – and, yes, software development is a creative endeavour – feedback is essential to getting it right (or, at least, less wrong).

The best development approaches tend to be built around feedback loops, and the last few decades of innovation in development practices and processes have largely focused on shrinking those feedback loops so we can learn our way to Better faster.

When we test our software, that’s a feedback loop, for example. Although far less common these days, there are still teams out there doing it manually. Their testing feedback loops can last days or even weeks. Many teams, though, write fast-running automated tests, and can test the bulk of their code in minutes or even seconds.

What difference does it make if your tests take days instead of seconds?

To illustrate, I’m going to draw a parallel with movie production. Up until the late 1960s, feedback loops in movie making were at best daily. Footage shot on film during the day were processed by a lab and then watched by directors, producers, editors and so on at the end of the day. Hence the movie industry term “dailies”. If a shot didn’t come out right – maybe the performance didn’t fit into the context of that scene with a previous scene (the classic “boom microphone in shot” or “character just ran 6 miles but is mysteriously not out of breath” spring to mind) – chances are the production team wouldn’t know until they saw the footage later.

That could well mean going back and reshooting some scenes. That means calling back the actors and the crew, and potentially remounting the whole thing if the sets have already been pulled down. Expensive. Sometimes prohibitively expensive, which is why lower-budget productions had little choice but to keep those shots in their theatrical releases.

In the 1960s, comedy directors like Jerry Lewis and Blake Edwards pioneered the use of Video assist. These were systems that enabled the same footage to be recorded simultaneously on film and on videotape, so actors and directors could watch takes back as soon as they’d been captured, and correct mistakes there and then when the actors, crew, sets and so on were all still there. Way, way cheaper than remounting.

The speed of testing feedback in software development has a similar impact. If I make a change that breaks the code, and my code is tested overnight, say, then I probably won’t know it’s broken until the next day (or the next week, or the next month, or the next year when a user reports the bug).

But I’ve already moved on. The sets have been dismantled, so to speak. To fix a bug long after the fact requires the equivalent of remounting a shoot in movies. Time has to be scheduled, the developer has to wrap their head around that code again, and the bug fix has to go through the whole testing and release process again. Far more expensive. Often orders of magnitude more expensive. Sometimes prohibitively expensive, which is why many teams ship software they know has bugs, but they just don’t have budget to fix them (or, at least, they believe they’re not worth fixing.)

If my code is tested in seconds, that’s like having Video assist. I can make one change and run the tests. If I broke the code, I’ll know there and then, and can easily fix it while I’m still in the zone.

Just as Video assist helps directors make better movies for less money, fast-running automated tests can help us deliver more reliable software with less effort. This is a measurable effect (indeed, it has been measured), so we know it works.

Stuck In Service-Oriented Hell? You Need Contract Tests

As our system architectures get increasingly distributed, many teams experience the pain of writing code that consumes services through APIs that are changing.

Typically, we don’t know that a non-backwards-compatible change to an external dependency has happened until our own code suddenly stops working. I see many organisations spending a lot of time fixing those problems. I shudder to think how much time and money, as an industry, we’re wasting on it.

Ideally, developers of APIs wouldn’t make changes that break contracts with client code. But this is not an ideal world.

What would be very useful is an early warning system that flags up the fact that a change we’ve made is going to break client code before we release it. As a general rule with bugs, the sooner we know, the cheaper they are to fix.

Contract tests are a good tool for getting that early warning. A contract test is a kind of integration test that focuses specifically on the interaction between our code and an external dependency. When they fail, they pinpoint the source of the problem: namely that something has likely changed at the other end.

There are different ways of writing contract tests, but one of favourites is to use Abstract Tests. Take this example from my Guitar Shack code:

package com.guitarshack;
import com.guitarshack.net.Network;
import com.guitarshack.net.RESTClient;
import com.guitarshack.net.RequestBuilder;
import com.guitarshack.sales.SalesData;
import org.junit.Test;
import java.util.Calendar;
import static org.junit.Assert.assertTrue;
/*
This Abstract Test allows us to create two versions of the set-up, one with
stubbed JSON and one that actually connects to the web service, effectively pinpointing
whether an error has occurred because of a change to our code or a change to the external dependency
*/
public abstract class SalesDataTestBase {
@Test
public void fetchesSalesData(){
SalesData salesData = new SalesData(new RESTClient(getNetwork(), new RequestBuilder()), () > {
Calendar calendar = Calendar.getInstance();
calendar.set(2019, Calendar.JANUARY, 31);
return calendar.getTime();
});
int total = salesData.getTotal(811);
assertTrue(total > 0);
}
protected abstract Network getNetwork();
}

The getNetwork() method is left abstract so it can be overridden in subclasses.

One implementation uses a stub implement of the Network interface that returns hardcoded JSON, so I can unit test most of my SalesData class.

package com.guitarshack.unittests;
import com.guitarshack.net.Network;
import com.guitarshack.SalesDataTestBase;
public class SalesDataUnitTest extends SalesDataTestBase {
@Override
protected Network getNetwork() {
return request > "{\"productID\":811,\"startDate\":\"7/17/2019\",\"endDate\":\"7/27/2019\",\"total\":31}";
}
}

Another implementation returns the real implementation of Network, called Web, and connects to a real web service hosted as an AWS Lambda.

package com.guitarshack.contracttests;
import com.guitarshack.*;
import com.guitarshack.net.Network;
import com.guitarshack.net.Web;
/*
If this test fails when the SalesDataUnitTest is still passing, this indicates a change
in the external API
*/
public class SalesDataContractTest extends SalesDataTestBase {
@Override
protected Network getNetwork() {
return new Web();
}
}

If the contract test suddenly starts failing while the unit test is still passing, that’s a big hint that the problem is at the other end.

To illustrate, I changed the output field ‘total’ to ‘salesTotal’ in the JSON being outputted from the web service. See what happens when I run my tests.

The contract test fails (as well as a larger integration test, that wouldn’t pinpoint the source of the problem as effectively), while the unit test version is still passing.

When I change ‘salesTotal’ back to ‘total’, all the tests pass again.

This is very handy for as a client code developer, writing code that consumes the sales data API. But it would be even more useful for the developer of that API to be able to run my contract tests, perhaps before a release, or as an overnight job, so they could get early warning that their changes have broken the contract.

For teams who are able to access each others’ code (e.g., on GitHub), that’s quite straightforward. I could rig up my Maven project to enable developers to build and run just those tests, for example. Notice that my unit and contract tests are in different packages to make that easy.

For teams who can’t access each other’s repos, it may take a little more ingenuity. But we’re probably used to seeing our code built and tested on other machines (e.g., on cloud CI servers) and getting the results back over the web. It’s not rocket science to offer Contract Testing as a Service. You could then give the API developers exclusive – possibly secured – access to your contract test builds over HTTP.

I’ve seen contract testing – done well – save organisations a lot of blood, sweat and tears. At the very least, it can defend API developers from breaking the First Law of Software Development:

Though shalt not break shit that was working

Jason Gorman

Readable Parameterized Tests

Parameterized tests (sometimes called “data-driven tests”) can be a useful technique for removing duplication from test code, as well as potentially buying teams much greater test assurance with surprisingly little extra code.

But they can come at the price of readability. So if we’re going to use them, we need to invest some care in making sure it’s easy to understand what the parameter data means, and to ensure that the messages we get when tests fail are meaningful.

Some testing frameworks make it harder than others, but I’m going to illustrate using some mocha tests in JavaScript.

Consider this test code for a Mars Rover:

it("turns right from N to E", () => {
let rover = {facing: "N"};
rover = go(rover, "R");
assert.equal(rover.facing, "E");
})
it("turns right from E to S", () => {
let rover = {facing: "E"};
rover = go(rover, "R");
assert.equal(rover.facing, "S");
})
it("turns right from S to W", () => {
let rover = {facing: "S"};
rover = go(rover, "R");
assert.equal(rover.facing, "W");
})
it("turns right from W to N", () => {
let rover = {facing: "W"};
rover = go(rover, "R");
assert.equal(rover.facing, "N");
})
view raw rover_test.js hosted with ❤ by GitHub

These four tests are different examples of the same behaviour, and there’s a lot of duplication (I should know – I copied and pasted them myself!)

We can consolidate them into a single parameterised test:

[{input: "N", expected: "E"}, {input: "E", expected: "S"}, {input: "S", expected: "W"},
{input: "W", expected: "N"}].forEach(
function (testCase) {
it("turns right", () => {
let rover = {facing: testCase.input};
rover = go(rover, "R");
assert.equal(rover.facing, testCase.expected);
})
})
view raw rover_test.js hosted with ❤ by GitHub

While we’ve removed a fair amount of duplicate test code, arguably this single parameterized test is harder to follow – both at read-time, and at run-time.

Let’s start with the parameter names. Can we make it more obvious what roles these data items play in the test, instead of just using generic names like “input” and “expected”?

[{startsFacing: "N", endsFacing: "E"}, {startsFacing: "E", endsFacing: "S"}, {startsFacing: "S", endsFacing: "W"},
{startsFacing: "W", endsFacing: "N"}].forEach(
function (testCase) {
it("turns right", () => {
let rover = {facing: testCase.startsFacing};
rover = go(rover, "R");
assert.equal(rover.facing, testCase.endsFacing);
})
})
view raw rover_test.js hosted with ❤ by GitHub

And how about we format the list of test cases so they’re easier to distinguish?

[
{startsFacing: "N", endsFacing: "E"},
{startsFacing: "E", endsFacing: "S"},
{startsFacing: "S", endsFacing: "W"},
{startsFacing: "W", endsFacing: "N"}
].forEach(
function (testCase) {
it("turns right", () => {
let rover = {facing: testCase.startsFacing};
rover = go(rover, "R");
assert.equal(rover.facing, testCase.endsFacing);
})
})
view raw rover_test.js hosted with ❤ by GitHub

And how about we declutter the body of the test a little by destructuring the testCase object?

[
{startsFacing: "N", endsFacing: "E"},
{startsFacing: "E", endsFacing: "S"},
{startsFacing: "S", endsFacing: "W"},
{startsFacing: "W", endsFacing: "N"}
].forEach(
function ({startsFacing, endsFacing}) {
it("turns right", () => {
let rover = {facing: startsFacing};
rover = go(rover, "R");
assert.equal(rover.facing, endsFacing);
})
})
view raw rover_test.js hosted with ❤ by GitHub

Okay, hopefully this is much easier to follow. But what happens when we run these tests?

It’s not at all clear which test case is which. So let’s embed some identifying data inside the test name.

[
{startsFacing: "N", endsFacing: "E"},
{startsFacing: "E", endsFacing: "S"},
{startsFacing: "S", endsFacing: "W"},
{startsFacing: "W", endsFacing: "N"}
].forEach(
function ({startsFacing, endsFacing}) {
it(`turns right from ${startsFacing} to ${endsFacing}`, () => {
let rover = {facing: startsFacing};
rover = go(rover, "R");
assert.equal(rover.facing, endsFacing);
})
})
view raw rover_test.js hosted with ❤ by GitHub

Now when we run the tests, we can easily identify which test case is which.

With a bit of extra care, it’s possible with most unit testing tools – not all, sadly – to have our cake and eat it with readable parameterized tests.

Slow Tests Kill Businesses

I’m always surprised at how few organisations track some pretty fundamental stats about software development, because if they did then they might notice what’s been killing their business.

It’s a picture I’ve seen many, many times; a software product or system is created, and it goes live. But it has bugs. Many bugs. So, a bigger chunk of the available development time is used up fixing bugs for the second release. Which has even more bugs. Many, many bugs. So an even bigger chunk of the time is used to fix bugs for the third release.

It looks a little like this:

Over the lifetime of the software, the proportion of development time devoted to bug fixing increases until that’s pretty much all the developers are doing. There’s precious little time left for new features.

Naturally, if you can only spare 10% of available dev time for new features, you’re going to need 10 times as many developers. Right? This trend is almost always accompanied by rapid growth of the team.

So the 90% of dev time you’re spending on bug fixing is actually 90% of the time of a team that’s 10x as large – 900% of the cost of your first release, just fixing bugs.

So every new feature ends up in real terms costing 10x in the eighth release what it would have in the first. For most businesses, this rules out change – unless they’re super, super successful (i.e., lucky). It’s just too damned expensive.

And when you can’t change your software and your systems, you can’t change the way you do business at scale. Your business model gets baked in – petrified, if you like. And all you can do is throw an ever-dwindling pot of money at development just to stand still, while you watch your competitors glide past you with innovations you’ll never be able to offer your customers.

What happens to a business like that? Well, they’re no longer in business. Customers defected in greater and greater numbers to competitor products, frustrated by the flakiness of the product and tired of being fobbed off with promises about upgrades and hotly requested features and fixes that never arrived.

Now, this effect is entirely predictable. We’ve known about it for many decades, and we’ve known the causal mechanism, too.

Source: IBM System Science Institute

The longer a bug goes undetected, exponentially the more it costs to fix. In terms of process, the sooner we test new or changed code, the cheaper the fix is. This effect is so marked that teams actually find that if they speed up testing feedback loops – testing earlier and more often – they deliver working software faster.

This is very simply because they save more time downstream on bug fixes than they invest in earlier and more frequent testing.

The data used in the first two graphs was taken from a team that took more than 24 hours to build and test their code.

Here’s the same stats from a team who could build and test their code in less than 2 minutes (I’ve converted from releases to quarters to roughly match the 12-24 week release cycles of the first team – this second team was actually releasing every week):

This team has nearly doubled in size over the two years, which might sound bad – but it’s more of a rosy picture than the first team, whose costs spiraled to more than 1000% of their first release, most of which was being spent fixing bugs and effectively going round and round in circles chasing their own tails while their customers defected in droves.

I’ve seen this effect repeated in business after business – of all shapes and sizes: software companies, banks, retail chains, law firms, broadcasters, you name it. I’ve watched $billion businesses – some more than a century old – brought down by their inability to change their software and their business-critical systems.

And every time I got down to the root cause, there they were – slow tests.

Every. Single. Time.

Is Your Agile Transformation Just ‘Agility Theatre’?

I’ve talked before about what I consider to be the two most important feedback loops in software development.

When I explain the feedback loops – the “gears” – of Test-Driven Development, I go to great pains to highlight which of those gears matter most, in terms of affecting our odds of success.

tdd_gears

Customer or business goals drive the whole machine of delivery – or at least, they should. We are not done because we passed some acceptance tests, or because a feature is in production. We’re only done when we’ve solved the customer’s problem.

That’s very likely going to require more than one go-around. Which is why the second most important feedback loop is the one that establishes if we’re good to go for the next release.

The ability to establish quickly and effectively if the changes we made to the software have broken it is critical to our ability to release it. Teams who rely on manual regression testing can take weeks to establish this, and their release cycles are inevitably very slow. Teams who rely mostly on automated system and integration tests have faster release cycles, but still usually far too slow for them to claim to be “agile”. Teams who can re-test most of the code in under a minute are able to release as often as the customer wants – many times a day, if need be.

The speed of regression testing – of establishing if our software still works – dictates whether our release cycles span months, weeks, or hours. It determines the metabolism of our delivery cycle and ultimately how many throws of the dice we get at solving the customer’s problem.

It’s as simple as that: faster tests = more throws of the dice.

If the essence of agility is responding to change, then I conclude that fast-running automated tests lie at the heart of that.

What’s odd is how so many “Agile transformations” seem to focus on everything but that. User stories don’t make you responsive to change. Daily stand-ups don’t make you responsive to change. Burn-down charts don’t make you responsive to change. Kanban boards don’t make you responsive to change. Pair programming doesn’t make you responsive to change.

It’s all just Agility Theatre if you’re not addressing the two must fundamental feedback loops, which the majority of organisations simply don’t. Their definition of done is “It’s in production”, as they work their way through a list of features instead of trying to solve a real business problem. And they all too often under-invest in the skills and the time needed to wrap software in good fast-running tests, seeing that as less important than the index cards and the Post-It notes and the Jira tickets.

I talk often with managers tasked with “Agilifying” legacy IT (e.g., mainframe COBOL systems). This means speeding up feedback cycles, which means speeding up delivery cycles, which means speeding up build pipelines, which – 99.9% of the time – means speeding up testing.

After version control, it’s #2 on my list of How To Be More Agile. And, very importantly, it works. But then, we shouldn’t be surprised that it does. Maths and nature teach us that it should. How fast do bacteria or fruit flies evolve – with very rapid “release cycles” of new generations – vs elephants or whales, whose evolutionary feedback cycles take decades?

There are two kinds of Agile consultant: those who’ll teach you Agility Theatre, and those who’ll shrink your feedback cycles. Non-programmers can’t help you with the latter, because the speed of the delivery cycle is largely determined by test execution time. Speeding up tests requires programming, as well as knowledge and experience of designing software for testability.

70% of Agile coaches are non-programmers. A further 20% are ex-programmers who haven’t touched code for over a decade. (According to the hundreds of CVs I’ve seen.) That suggests that 90% of Agile coaches are teaching Agility Theatre, and maybe 10% are actually helping teams speed up their feedback cycles in any practical sense.

It also strongly suggests that most Agile transformations have a major imbalance; investing heavily in the theatre, but little if anything in speeding up delivery cycles.

Changing Legacy Code Safely

One of the topics we cover on the Codemanship TDD course is one that developers raise often: how can we write fast-running unit tests for code that’s not easily unit-testable? Most developers are working on legacy code – code that’s difficult and risky to change – most of the time. So it’s odd there’s only one book about it.

I highly recommend Micheal Feather’s book for any developer working in any technology applying any approach to development. On the TDD course, I summarise what we mean by “legacy code” – code that doesn’t have fast-running automated tests, making it risky to change – and briefly demonstrate Michael’s process for changing legacy code safely.

The example I use is a simple Python program for pricing movie rentals based on their IMDB ratings. Average movies rentals cost £3.95. High-rated movies cost an extra pound, low-rated movies cost a pound less.

My program has no automated tests, so I’ve been testing it manually using the command line.

Suppose the business asked us to change the pricing logic; how could we do this safely if we lack automated tests to guard against breaking the code?

Michael’s process goes like this:

  • Identify what code you will need to change
  • Identify where around that code you’d want unit tests
  • Break any dependencies that are stopping you from writing unit tests
  • Write the unit tests you’d want to satisfy you the changes you’ll make didn’t break the code
  • Make the change
  • While you’re there, refactor to improve the code that’s now covered by unit tests to make life easier for the next person who changes it (which could be you)

My Python program has a class called Pricer which we’ll need to change to update the pricing logic.

class Pricer(object):
def price(self, imdbID):
response = requests.get("http://www.omdbapi.com/?i=" + imdbID + "&apikey=6487ec62")
json = response.json()
title = json["Title"]
rating = float(json["imdbRating"])
price = 3.95
if rating > 7:
price += 1.0
if rating < 4:
price -= 1.0
return Video(imdbID, title, rating, price)

view raw
pricer.py
hosted with ❤ by GitHub

I’ve been testing this logic one level above by testing the Rental class that uses Pricer.

class Rental(object):
def __init__(self, customer, imdbID):
self.customer = customer
self.video = Pricer().price(imdbID)
def __str__(self):
return "Video Rental – customer: " + self.customer \
+ ". Video => title: " + self.video.title \
+ ", price: £" + str(self.video.price)

view raw
rental.py
hosted with ❤ by GitHub

My script that I’ve been manually testing with allows me to create Rental objects and write their data to the command line for different movies using their IMDB ID’s.

def main():
customer = sys.argv[1]
imdbID = sys.argv[2]
rental = Rental(customer, imdbID)
print(rental)
if __name__ == "__main__": main()
#
# Example movies:
#
# tt0096754 – high rated
# tt0060666 – low rated
# tt0317303 – medium rated

view raw
program.py
hosted with ❤ by GitHub

I use three example movies – one with a high rating, one low-rated and one medium-rated – to test the code. For example, the output for the high-rated movie looks like this.

C:\Users\User\Desktop\tdd 2.0\python_legacy>python program.py jgorman tt0096754
Video Rental – customer: jgorman. Video => title: The Abyss, price: £4.95

I’d like to reproduce these manual tests as unit tests, so I’ll be writing unittest tests for the Rental class for each kind of movie.

But before I can do that, there’s an external dependency we have to deal with. The Pricer class connects directly to the OMDB API that provides movie information. I want to stub that so I can provide test IMDB ratings without connecting.

Here’s where we have to get disciplined. I want to refactor the code to make it unit-testable, but it’s risky to do that because… there’s no unit tests! Opinions differ on approach, but personally – learned through bitter experience – I’ve found that it’s still important to re-test the code after every refactoring, manually if need be. It will seem like a drag, but we all tend to overlook how much time we waste downstream fixing avoidable bugs. It will seem slower to manually re-test, but it’s often actually faster in the final reckoning.

Okay, let’s do a refactoring. First, let’s get that external dependency in its own method.

class Pricer(object):
def price(self, imdbID):
rating, title = self.fetch_video_info(imdbID)
price = 3.95
if rating > 7:
price += 1.0
if rating < 4:
price -= 1.0
return Video(imdbID, title, rating, price)
def fetch_video_info(self, imdbID):
response = requests.get("http://www.omdbapi.com/?i=&quot; + imdbID + "&apikey=6487ec62")
json = response.json()
title = json["Title"]
rating = float(json["imdbRating"])
return rating, title

view raw
pricer.py
hosted with ❤ by GitHub

I re-run my manual tests. Still passing. So far, so good.

Next, let’s move that new method into its own class.

class Pricer(object):
def price(self, imdbID):
rating, title = VideoInfo().fetch_video_info(imdbID)
price = 3.95
if rating > 7:
price += 1.0
if rating < 4:
price -= 1.0
return Video(imdbID, title, rating, price)
class VideoInfo:
def fetch_video_info(self, imdbID):
response = requests.get("http://www.omdbapi.com/?i=&quot; + imdbID + "&apikey=6487ec62")
json = response.json()
title = json["Title"]
rating = float(json["imdbRating"])
return rating, title

view raw
pricer.py
hosted with ❤ by GitHub

And re-test. All passing.

To make the dependency on VideoInfo swappable, the instance needs to be injected into the constructor of Pricer from Rental.

class Pricer(object):
def __init__(self, videoInfo):
self.videoInfo = videoInfo
def price(self, imdbID):
rating, title = self.videoInfo.fetch_video_info(imdbID)
price = 3.95
if rating > 7:
price += 1.0
if rating < 4:
price -= 1.0
return Video(imdbID, title, rating, price)

view raw
pricer.py
hosted with ❤ by GitHub

class Rental(object):
def __init__(self, customer, imdbID):
self.customer = customer
self.video = Pricer(VideoInfo()).price(imdbID)
def __str__(self):
return "Video Rental – customer: " + self.customer \
+ ". Video => title: " + self.video.title \
+ ", price: £" + str(self.video.price)

view raw
rental.py
hosted with ❤ by GitHub

And re-test. All passing.

Next, we need to inject the Pricer into Rental, so we can stub VideoInfo in our planned unit tests.

class Rental(object):
def __init__(self, customer, imdbID, pricer):
self.customer = customer
self.video = pricer.price(imdbID)
def __str__(self):
return "Video Rental – customer: " + self.customer \
+ ". Video => title: " + self.video.title \
+ ", price: £" + str(self.video.price)

view raw
rental.py
hosted with ❤ by GitHub

And re-test. All passing.

Now we can write unit tests to replicate our command line tests.

class VideoInfoStub(VideoInfo):
def __init__(self, rating):
self.rating = rating
self.title = 'XXXXX'
def fetch_video_info(self, imdbID):
return self.rating, self.title
class RentalTest(unittest.TestCase):
def test_movie_title_retrieved(self):
rental = Rental('jgorman', 'tt9999', Pricer(VideoInfoStub(7.1)))
self.assertEqual('XXXXX', rental.video.title)
def test_customer_id_recorded(self):
rental = Rental('jgorman', 'tt9999', Pricer(VideoInfoStub(7.1)))
self.assertEqual('jgorman', rental.customer)
def test_high_rated_movie_price(self):
rental = Rental('jgorman', 'tt9999', Pricer(VideoInfoStub(7.1)))
self.assertEqual(4.95, rental.video.price)
def test_low_rated_movie_price(self):
rental = Rental('jgorman', 'tt9999', Pricer(VideoInfoStub(3.9)))
self.assertEqual(2.95, rental.video.price)
def test_medium_rated_movie_price(self):
rental = Rental('jgorman', 'tt9999', Pricer(VideoInfoStub(6.9)))
self.assertEqual(3.95, rental.video.price)

view raw
rental_test.py
hosted with ❤ by GitHub

These unit tests reproduce all the checks I was doing visually at the command line, but they run in a fraction of a second. The going get’s much easier from here.

Now I can make the change to the pricing logic the business requested.

user_story

We can tackle this in a test-driven way now. Let’s update the relevant unit test so that it now fails.

def test_high_rated_movie_price(self):
rental = Rental('jgorman', 'tt9999', Pricer(VideoInfoStub(7.1)))
self.assertEqual(5.95, rental.video.price)

view raw
rental_test.py
hosted with ❤ by GitHub

Now let’s make it pass.

def price(self, imdbID):
rating, title = self.videoInfo.fetch_video_info(imdbID)
price = 3.95
if rating > 7:
price += 2.0
if rating < 4:
price -= 1.0
return Video(imdbID, title, rating, price)

view raw
pricer.py
hosted with ❤ by GitHub

(And, yes – obviously in a real product, the change would likely be more complex than this.)

Okay, so we’ve made the change, and we can be confident we haven’t broken the software. We’ve also added some test coverage and dealt with a problematic dependency in our architecture. If we wanted to get movie ratings from somewhere else (e.g., Rotten Tomatoes), or even aggregate sources, it would be quite straightforward now that we’ve cleanly separated that concern from our business logic.

One last thing while we’re here: there’s a couple of things in this code that have been bugging me. Firstly, we’ve been mixing our terminology: the customer says “movie”, but our code says “video”. Let’s make our code speak the customer’s language.

class Pricer(object):
def __init__(self, movieInfo):
self.movieInfo = movieInfo
def price(self, imdbID):
rating, title = self.movieInfo.fetch_movie_info(imdbID)
price = 3.95
if rating > 7:
price += 2.0
if rating < 4:
price -= 1.0
return Movie(imdbID, title, rating, price)

view raw
pricer.py
hosted with ❤ by GitHub

Secondly, I’m not happy with clients accessing objects’ fields directly. Let’s encapsulate.

class Rental(object):
def __init__(self, customer, imdbID, pricer):
self._customer = customer
self._movie = pricer.price(imdbID)
def __str__(self):
return "Movie Rental – customer: " + self._customer \
+ ". Movie => title: " + self._movie.get_title() \
+ ", price: £" + str(self._movie.get_price())
def get_customer(self):
return self._customer
def get_movie_price(self):
return self._movie.get_price()
def get_movie_title(self):
return self._movie.get_title()

view raw
rental.py
hosted with ❤ by GitHub

With our added unit tests, these extra refactorings were much easier to do, and hopefully that means that changing this code in the future will be much easier, too.

Over time, one change at a time, the unit test coverage will build up and the code will get easier to change. Applying this process over weeks, months and years, I’ve seen some horrifically rigid and brittle software products – so expensive and risky to change that the business had stopped asking – be rehabilitated and become going concerns again.

By focusing our efforts on changes our customer wants, we’re less likely to run into a situation where writing unit tests and refactoring gets vetoed by our managers. The results of highly visible “refactoring sprints”, or even long refactoring phases – I’ve known clients freeze requirements for up to a year to “refactor” legacy code – are typically disappointing, and run the risk of making refactoring and adding unit tests forbidden by disgruntled bosses.

One final piece of advice: never, ever discuss this process with non-technical stakeholders. If you’re asked to break down an estimate to change legacy code, resist. My experience has been that it often doesn’t take any longer to make the change safely, and the longer-term benefits are obvious. Don’t give your manager or your customer the opportunity to shoot themselves in the foot by offering up unit tests and refactoring as a line item. Chances are, they’ll say “no, thanks”. And that’s in nobody’s interests.

Adventures In Multi-Threading

I’ve been spending my early mornings buried in Java threading recently. Although we talk often of concurrency and “thread safety” in this line of work, there’s surprisingly little actual multi-threaded code being written. Normally, when developers talk about multi-threading, we’re referring to how we write code to handle asynchronous operations in other people’s code (e.g., promises in JavaScript).

My advice to developers has always been to avoid writing multi-threaded code wherever possible. Concurrency is notoriously difficult to get right, and the safest multi-threaded code is single-threaded.

I’ve been eating my own dog food on that, and it occurred to me a couple of weeks back that I’ve written very little multi-threaded code myself in recent years.

But there is still some multi-threaded code being written in languages like Java, C# and Python for high-performance solutions that are targeted at multi-CPU platforms. And over the last few months I’ve been helping a client with just such a solution for scaling up property-based tests to run on multi-core Cloud platforms.

One of the issues we faced is how do we test our multi-threaded code?

There’s a practical issue of executing multiple threads in a single-threaded unit test – particularly synchronizing so that we can assert an outcome after all threads have completed their work.

And also, thread scheduling is out of our control and – on Windows and similar platforms – unpredictable and non-repeatable. A race condition or a deadlock might not show up every time we run a test.

Over the last couple of weeks, I’ve been playing with a rough prototype to try and answer these questions. It uses a simple producer-consumer example – loading parcels into a loading bay and then taking them off the loading bay and loading them into a truck – to illustrate the challenges of both safe multi-threading and multi-threaded testing.

When I test multi-threaded code, I’m interested in two properties:

  • Safety – what should always be true while the code is executing?
  • Liveness – what should eventually be achieved?

To test safety, an assertion needs to be checked throughout execution. To test liveness, an assertion needs to be checked after execution.

After writing code to do this, I refactored the useful parts into custom assertion methods, always() and eventually().

always() takes a list of Runnables (Java’s equivalent of functions that accept no parameters and have no return value) that will concurrently perform the work we want to test. It will submit each Runnable to a fixed thread pool a specified number of times (thread count) and then wait for all the threads in the pool to terminate.

On a single separate thread, a boolean function (in Java, Supplier<Boolean>) is evaluated multiple times throughout execution of the threads under test. This terminates after the worker threads have terminated or timed out. If, at any point in execution, the assertion evaluates to false, the test will fail.

In use, it looks like this:

@Test
@Parameters(method = "iterations")
public void bayIsNeverOverloaded(int n) {
always(Arrays.asList(bayLoader, truckLoader)
, () > bay.getParcelCount() <= 50, 2, 1000);
}

view raw
LoadingTest.java
hosted with ❤ by GitHub

bayLoader and truckLoader are objects that implement the Runnable interface. They will be submitted to the thread pool 2x each (because we’ve specified a thread count of 2 as our third parameter), so there will be 4 worker threads in total, accessing the same data defined in our set-up.

@Before
public void setUp() {
List<Parcel> parcels = new ArrayList<>();
for (int i = 1; i <= 1000; i++) {
parcels.add(new Parcel());
}
bay = new LoadingBay(50);
truck = new Truck(parcels.size());
bayLoader = new BayLoader(bay, parcels);
truckLoader = new TruckLoader(bay, truck);
}

view raw
LoadingTest.java
hosted with ❤ by GitHub

The bayLoader threads will load parcels on to the loading bay, which holds a maximum of 50 parcels, until all the parcels have been loaded.

The truckLoader threads will unload parcels from the loading bay and load them on to the truck, until the entire manifest of parcels has been loaded.

A safety property of this concurrent logic is that there should never be more than 50 parcels in the loading bay at any time, and that’s what our always assertion checks multiple times during execution:

() -> bay.getParcelCount() <= 50

When I run this test once, it passes. Running it multiple times, it still passes. But just because a test’s passing, that doesn’t mean our code really works. Let’s deliberately introduce an error into our test assertion to make sure it fails.

() -> bay.getParcelCount() <= 49

The first time I run this, the test fails. And the second and third times. But on the fourth run, the test passes. This is the thread determinism problem; we have no control over when our assertion is checked during execution. Sometimes it catches a safety error. Sometimes the error slips through the gaps and the test misses it.

The good news is that if it catches an error just once, that proves we have an error in our concurrent logic. Of course, if we catch no errors, that doesn’t prove they’re not there. (Absence of evidence isn’t evidence of absence.)

What if we run the test 100 times? Rather than sit there clicking the “run” button over and over, I can rig this test up as a JUnitParams parameterised test and feed it 100 test cases. (If you don’t have a parameterised testing feature, you can just loop 100 times).

@Test
@Parameters(method = "iterations")
public void bayIsNeverOverloaded(int n) {
always(Arrays.asList(bayLoader, truckLoader)
, () > bay.getParcelCount() <= 49, 2, 1000);
}
private Object[] iterations() {
return IntStream
.range(0, 100)
.mapToObj((x) > x)
.toArray();
}

view raw
LoadingTest.java
hosted with ❤ by GitHub

When I run this, it fails 91/100 times. Changing the assertion back, it passes 100/100. So I can have 100% confidence the code satisfies this safety property? Not so fast. 100 test runs leaves plenty of gaps. Maybe I can be 99% confident with 100 test runs. How about we do 1000 test runs? Again, they all pass. So that gives me maybe 99.9% confidence. 10,000 could give me 99.99% confidence. And so on.

Thankfully, after a little performance engineering, 10,000 tests run in less than 30 seconds. All green.

The eventually() assertion method works along similar lines, except that it only evaluates its assertion once at the end (and therefore runs significantly faster):

@Test
@Parameters(method = "iterations")
public void loadsTruck(int n) {
eventually(Arrays.asList(bayLoader, truckLoader)
, () > bay.isEmpty() && truck.isLoaded(), 2, 1000);
}

view raw
LoadingTest.java
hosted with ❤ by GitHub

If my code encounters a deadlock, the worker threads will time out after 1000 milliseconds. If a race condition occurs and our data becomes corrupted, the assertion will fail. Running this 10,000 times shows all the tests are green. I’m 99.99% confident my concurrent logic works.

Finally, speaking of deadlocks and race conditions, how might we avoid those?

A race condition can occur when two or more threads attempt to access the same data at the same time. In particular, we run the risk of a pre-condition paradox when bay loaders attempt to load parcels on to the loading bay, and truck loaders attempt to unload parcels from the bay.

The bay loader can only load a parcel if the bay is not full. A truck loader can only unload a parcel if the bay is not empty.

class LoadingBay {
private final int capacity;
private List<Parcel> parcels = new ArrayList<>();
LoadingBay(int capacity){
this.capacity = capacity;
}
void load(Parcel parcel){
if(!isFull()){
parcels.add(parcel);
}
}
private Boolean isFull(){
return parcels.size() == capacity;
}
Parcel unload(){
if(!isEmpty()){
return parcels.remove(parcels.size() 1);
}
return null;
}
public boolean isEmpty() {
return parcels.isEmpty();
}
public int getParcelCount() {
return parcels.size();
}
}

view raw
LoadingBay.java
hosted with ❤ by GitHub

When I run my tests with this implementation of LoadingBay, 12% of them fail their liveness and safety checks because there’s a non-zero possibility of, say, a bay loader attempting to load a parcel after we’ve checked the bay isn’t full and another bay loader loading the 50th parcel in between that check and loading. Similarly, a truck loader might check that the bay isn’t empty, but before they unload the last parcel another truck loader thread takes it.

To avoid this situation, we need to ensure that pre-condition checks and actions are executed in a single, atomic sequence with no chance of other threads interfering.

void load(Parcel parcel){
synchronized (this) {
if (!isFull()) {
parcels.add(parcel);
}
}
}

view raw
LoadingBay.java
hosted with ❤ by GitHub

Parcel unload(){
synchronized (this) {
if (!isEmpty()) {
return parcels.remove(parcels.size() 1);
}
return null;
}
}

view raw
LoadingBay.java
hosted with ❤ by GitHub

When I test this implementation, tests still fail. The problem is that some parcels aren’t getting loaded on to the bay (though the bay loader thinks they have been), and some parcels aren’t getting unloaded, either. Our truck loader may be putting null parcels on the truck.

When loading, the bay must not be full. When unloading, it must not be empty. So our worker threads need to wait until their pre-conditions are satisfied. Now, Java threading gives us wait() methods, but they only wait for a specified amount of time. We need to wait until a condition becomes true.

void load(Parcel parcel){
while(true) {
synchronized (this) {
if (!isFull()) {
parcels.add(parcel);
break;
}
}
}
}

view raw
LoadingBay.java
hosted with ❤ by GitHub

Parcel unload(){
while(true){
synchronized (this) {
if(!isEmpty()) {
return parcels.remove(parcels.size() 1);
}
}
}
}

view raw
LoadingBay.java
hosted with ❤ by GitHub

This passes all 10,000 safety and liveness test runs, so I have 99.99% confidence we don’t have a race condition. But…

What happens when all the parcels have been loaded on to the truck? There’s a risk of deadlock if the bay remains permanently empty.

So we also need a way to stop the loading and unloading process once all the manifest has been loaded.

I’ve dealt with this in a similar way to waiting for pre-conditions to be satisfied, except this time we repeat loading and unloading until the parcels are all on the truck.

class BayLoader implements Runnable {
private final LoadingBay bay;
private final List<Parcel> parcels;
BayLoader(LoadingBay bay, List<Parcel> parcels) {
this.bay = bay;
this.parcels = parcels;
}
private void loadAll() {
while(true){
synchronized (this){
if(parcels.isEmpty()) {
break;
} else {
bay.load(parcels.get(parcels.size() 1));
parcels.remove(parcels.size() 1);
}
}
}
}
@Override
public void run() {
loadAll();
}
}

view raw
BayLoader.java
hosted with ❤ by GitHub

public class TruckLoader implements Runnable {
private final LoadingBay bay;
private final Truck truck;
TruckLoader(LoadingBay bay, Truck truck) {
this.bay = bay;
this.truck = truck;
}
void loadTruck() {
while(true){
synchronized (this){
if(truck.isLoaded()) {
break;
} else {
truck.load(bay.unload());
}
}
}
}
@Override
public void run() {
loadTruck();
}
}

view raw
TruckLoader.java
hosted with ❤ by GitHub

You may have already spotted the patterns in these two forms of loops:

  • Execute this action when this condition is true
  • Execute this action until this condition is true

Let’s refactor to encapsulate those nasty while loops.

void load(Parcel parcel){
execute(() > parcels.add(parcel))
.when(() > !isFull(), this);
}
Parcel unload(){
return execute(() > parcels.remove(parcels.size() 1))
.when(() > !parcels.isEmpty(), this);
}

view raw
LoadingBay.java
hosted with ❤ by GitHub

private void loadAll() {
execute(() > {
bay.load(parcels.get(parcels.size() 1));
parcels.remove(parcels.size() 1);
})
.until(parcels::isEmpty, this);
}

view raw
BayLoader.java
hosted with ❤ by GitHub

void loadTruck() {
execute(() > truck.load(bay.unload()))
.until(truck::isLoaded, this);
}

view raw
TruckLoader.java
hosted with ❤ by GitHub

There. That looks a lot better, doesn’t it? All nice and functional.

I tend to find conditional synchronisation easier to wrap my head around than all the wait() and notify() and callbacks malarky, and experiences so far with this approach suggest I tend to produce more reliable multi-threaded code.

My explorations continue, but I thought there might be folk out there who’d find it useful to see where I’ve got so far with this.

You can see the current source code at https://github.com/jasongorman/syncloop (it’s just a proof of concept, so provided with no warranty or support, of course.)

 

 

The Test Pyramid – The Key To True Agility

On the Codemanship TDD course, before we discuss Continuous Delivery and how essential it is to achieving real agility, we talk about the Test Pyramid.

It has various interpretations, in terms of the exactly how many layers and exactly what kinds of testing each layer is made of (unit, integration, service, controller, component, UI etc), but the overall sentiment is straightforward:

The longer tests take to run, the fewer of those kinds of tests you should aim to have

test_pyramid

The idea is that the tests we run most often need to be as fast as possible (otherwise we run them less often). These are typically described as “unit tests”, but that means different things to different people, so I’ll qualify: tests that do not involve any external dependencies. They don’t read from or write to databases, they don’t read or write files, they don’t connect with web services, and so on. Everything that happens in these tests happens inside the same memory address space. Call them In-Process Tests, if you like.

Tests that necessarily check our code works with external dependencies have to cross process boundaries when they’re executed. As our In-Process tests have already checked the logic of our code, these Cross-Process Tests check that our code – the client – and the external code – the suppliers – obey the contracts of their interactions. I call these “integration tests”, but some folk have a different definition of integration test. So, again, I qualify it as: tests that involve external dependencies.

These typically take considerably longer to execute than “unit tests”, and we should aim to have proportionally fewer of them and to run them proportionally less often. We might have thousands of unit tests, and maybe hundreds of integration tests.

If the unit tests cover the majority of our code – say, 90% of it – and maybe 10% of our code has direct external dependencies that have to be tested, on average we’ll make about 9 changes that need unit testing compared to 1 change that needs integration testing. In other words, we’d need to run our unit tests 9x as often as our integration tests, which is a good thing if each integration test is about 9 times slower than a unit test.

At the top of our test pyramid are the slowest tests of all. Typically these are tests that exercise the entire system stack, through the user interface (or API) all the way down to the external dependencies. These tests check that it all works when we plug everything together and deploy it into a specific environment. If we’ve already tested the logic of our code with unit tests, and tested the interactions with external suppliers, what’s left to test?

Some developers mistakenly believe that these system-levels tests are for checking the logic of the user experience – user “journeys”, if you like. This is a mistake. There are usually a lot of user journeys, so we’d end up with a lot of these very slow-running tests and an upside-down pyramid. The trick here is to make the logic of the user experience unit-testable. View models are a simple architectural pattern for logically representing what users see and what users do at that level. At the highest level they may be looking at an HTML table and clicking a button to submit a form, but at the logical level, maybe they’re looking at a movie and renting it.

A view model can help us encapsulate the logic of user experience in a way that can be tested quickly, pushing most of our UI/UX tests down to the base of the pyramid where they belong. What’s left – the code that must directly reference physical UI elements like HTML tables and buttons – can be wafer thin. At that level, all we’re testing is that views are rendered correctly and that user actions trigger the correct internal logic (which can easily be done using mock objects). These are integration tests, and belong in the middle layer of our pyramid, not the top.

Another classic error is to check core logic through the GUI. For example, checking that insurance premiums are calculated correctly by looking at what number is rendered on that web page. Some module somewhere does that calculation. That should be unit-testable.

So, if they’re not testing user journeys, and they’re not testing core logic, what do our system tests test? What’s left?

Well, have you ever found yourself saying “It worked on my machine”? The saying goes “There’s many a slip ‘twixt cup and lip.” Just because all the pieces work, and just because they all play nicely together, it’s not guaranteed that when we deploy the whole system into, say, our EC2 instances, that nothing could be different to the environments we tested it in. I’ve seen roll-outs go wrong because the servers handled dates different, or had the wrong locale, or a different file system, or security restrictions that weren’t in place on dev machines.

The last piece of the jigsaw is the system configuration, where our code meets the real production environment – or a simulation of it – and we find out if really works where it’s intended to work as a whole.

We may need dozens of those kinds of tests, and perhaps only need to run them on, say, every CI build by deploying the outputs to a staging environment that mirrors the production environment (and only if all our unit and integration tests pass first, of course.) These are our “good to go?” tests.

The shape of our test pyramid is critical to achieving feedback loops that are fast enough to allow us to sustain the pace of development. Ideally, after we make any change, we should want to get feedback straight away about the impact of that change. If 90% of our code can be re-tested in under 30 seconds, we can re-test 90% of our changes many times an hour and be alerted within 30 seconds if we broke something. If it takes an hour to re-test our code, then we have a problem.

Continuous Delivery means that our code is always shippable. That means it must always be working, or as near as possible always. If re-testing takes an hour, that means that we’re an hour away from finding out if changes we made broke the code. It means we’re an hour away from knowing if our code is shippable. And, after an hour’s-worth of changes without re-testing, chances are high that it is broken and we just don’t know it yet.

An upside-down test pyramid puts Continuous Delivery out of your reach. Your confidence that the code’s shippable at any point in time will be low. And the odds that it’s not shippable will be high.

The impact of slow-running test suites on development is profound. I’ve found many times that when a team invested in speeding up their tests, many other problems magically disappeared. Slow tests – which means slow builds, which means slow release cycles – is like a development team’s metabolism. Many health problems can be caused by a slow metabolism. It really is that fundamental.

Slow tests are pennies to the pound of the wider feedback loops of release cycles. You’d be surprised how much of your release cycles are, at the lowest level, made up of re-testing cycles. The outer feedback loops of delivery are made of the inner feedback loops of testing. Fast-running automated tests – as an enabler of fast release cycles and sustained innovation – are therefore highly desirable

A right-way-up test pyramid doesn’t happen by accident, and doesn’t come at no cost, though. Many organisations, sadly, aren’t prepared to make that investment, and limp on with upside-down pyramids and slow test feedback until the going gets too tough to continue.

As well as writing automated tests, there’s also an investment needed in your software’s architecture. In particular, the way teams apply basic design principles tends to determine the shape of their test pyramid.

I see a lot of duplicated code that contains duplicated external dependencies, for example. It’s not uncommon to find systems with multiple modules that connect to the same database, or that connect to the same web service. If those connections happened in one place only, that part of the code could be integration tested just once. D.R.Y. helps us achieve a right-way-up pyramid.

I see a lot of code where a module or function that does a business calculation also connects to an external dependency, or where a GUI module also contains business logic, so that the only way to test that core logic is with an integration test. Single Responsibility helps us achieve a right-way-up pyramid.

I see a lot of code where a module in one web service interacts with multiple features of another web service – Feature Envy, but on a larger scale – so there are multiple points of integration that require testing. Encapsulation helps us achieve a right-way-up pyramid.

I see a lot of code where a module containing core logic references an external dependency, like a database connection, directly by its implementation, instead of through an abstraction that could be easily swapped by dependency injection. Dependency Inversion helps us achieve a right-way-up pyramid.

Achieving a design with less duplication, where modules do one job, where components and services know as little as possible about each other, and where external dependencies can be easily stubbed or mocked by dependency injection, is essential if you want your test pyramid to be the right way up. But code doesn’t get that way by accident. There’s significant ongoing effort required to keep the code clean by refactoring. And that gets easier the faster your tests run. Chicken, meet egg.

If we’re lucky enough to be starting from scratch, the best way we know of to ensure a right-way-up test pyramid is to write the tests first. This compels us to design our code in such a way that it’s inherently unit-testable. I’ve yet to come across a team genuinely doing Continuous Delivery who wasn’t doing some kind of TDD.

If you’re working on legacy code, where maybe you’re relying on browser-based tests, or might have no automated tests at all, there’s usually a mountain to climb to get a test pyramid that’s the right way up. You need to write fast-running tests, but you will probably need to refactor the code to make that possible. Egg, meet chicken.

Like all mountains, though, it can be climbed. One small, careful step at a time. Michael Feather’s book Working Effectively With Legacy Code describes a process for making changes safely to code that lacks fast-running automated tests. It goes something like this:

  • Identify what code you need to change
  • Identify where around that code you’d want unit tests to make the change safely
  • Break any dependencies in that code getting in the way of unit testing
  • Write the unit tests
  • Make the change
  • While you’re there, make other improvements that will help the next developer who needs to change that code (the “boy scout rule” – leave the camp site tidier than you found it)

Change after change, made safely in this way, will – over time – build up a suite of fast-running unit tests that will make future changes easier. I’ve worked on legacy code bases that went from upside-down test pyramids of mostly GUI-based system tests, that took hours or even days to run, to right-side-up pyramids where most of the code could be tested in under a minute. The impact on the cost and the speed of delivery is always staggering. It can be done.

But be patient. A code base might take a year or two to turn around, and at first the going will be tough. I find I have to be super-disciplined in those early stages. I manually re-test as I refactor, and resist the temptation to make a whole bunch of changes at a time before I re-test. Slow and steady, adding value and clearing paths for future changes at the same time.

The 2 Most Critical Feedback Loops in Software Development

When I’m explaining the inner and outer feedback loops of Test-Driven Development – the “wheels within wheels”, if you like – I make the point that the two most important feedback loops are the outermost and the innermost.

feedbackloops

The outermost because the most important question of all is “Did we solve the problem?” The innermost because the answer is usually “No”, so we have to go round again. This means that the code we delivered will need to change, which raises the second most important question; “Did we break the code?”

The sooner we can deliver something so we can answer “Did we solve the problem?”, the sooner we can feedback the lessons learned on the next go round. The sooner we can re-test the code, the sooner we can know if our changes broke it, and the sooner we can fix it ready for the next release.

I realised nearly two decades ago that everything in between – requirements analysis, customer tests, software design, etc etc – is, at best, guesswork. A far more effective way of building the right thing is to build something, get folk to use it, and feedback what needs to change in the next iteration. Fast iterations accelerate this learning process. This is why I firmly believe these days that fast iterations – with all that entails – is the true key to building the right thing.

Continuous Delivery – done right, with meaningful customer feedback drawn from real use in the world world (or as close as we dare bring our evolving software to the real world) – is the ultimate requirements discipline.

Fast-running automated tests that provide good assurance that our code’s always working are essential to this. How long it takes to build, test and deploy our software will determine the likely length of those outer feedback loops. Typically, the lion’s share of that build time is regression testing.

About a decade ago, many teams told me “We don’t need unit tests because we have integration tests”, or “We have <insert name of trendy new BDD tool here> tests”. Then, a few years later, their managers were crying “Help! Our tests take 4 hours to run!” A 4-hour build-and-test cycle creates a serious bottleneck, leading to code that’s almost continuously broken without teams knowing. In other words, not shippable.

Turn a 4-hour build-and-test cycle into a 40-second build-and-test cycle, and a lot of problems magically disappear. You might be surprised how many other bottlenecks in software development have slow-running tests as their underlying cause – analysis paralysis, for example. That’s usually a symptom of high stakes in getting it wrong, and that’s usually a symptom of infrequent releases. “We better deliver the right thing this time, because the next go round could be 6 months later.” (Those among us old enough to remember might recall just how much more care we had to take over our code because of how long it took to compile. It’s a similar effect, but on a much larger scale with much higher stakes than a syntax error.)

Where developers usually get involved in this process – user stories and backlogs – is somewhere short of where they need to be involved. User stories – and prioritised queues of user stories – are just guesses at what an analyst or customer or product owner believes might solve the problem. To obsess over them is to completely overestimate their value. The best teams don’t guess their way to solving a problem; they learn their way.

Like pennies to the pound, the outer feedback loop of “Does it actually work in the real world?” is made up of all the inner feedback loops, and especially the innermost loop of regression testing after code is changed.

Teams who invest in fast-running automated regression tests have a tendency to out-learn teams who don’t, and their products have a tendency to outlive the competition.