The 4 Gears of Test-Driven Development

When I explain Test-Driven Development to people who are new to the concept, I try to be clear that TDD is not just about using unit tests to drive design at the internal code level.

Unit tests and the familiar red-green-refactor micro feedback cycle that we most commonly associate with TDD – thanks to 1,001 TDD katas that focus at that level – is actually just the innermost feedback cycle of TDD. There are multiple outer feedback loops that drive the choice of unit tests. Otherwise, how would we know what unit tests we needed to write?

Outside the rapid unit test feedback loop, there’s a slower customer test feedback loop that drives our understanding of what your units need to do in a particular software usage scenario.

Outside the customer test feedback loop, there’s a slower-still feature feedback loop, which may require us to pass multiple customer tests to complete.

And, most important of all, there’s an even slower goal feedback loop that drives our understanding of what features might be required to solve a business problem.

On the Codemanship TDD course, pairs experience these feedback loops first hand. They’re asked to think of a real-world problem they believe might be solved with a simple piece of software. For example, “It’s hard to find good vegan takeaway in my local area.” We’re now in the first feedback loop of TDD – goals.

Then they imagine a headline feature – a proverbial button the user clicks that solves this problem: what would that feature do? Perhaps it displays a list of takeaway restaurants with vegan dishes on their menu that will deliver to my address, ordered by customer ratings. We’re now in the next feedback loop of TDD – features.

Next, we need to think about what other features the software might require to make the headline feature possible. For example, we need to gather details of takeaway restaurants in the area, including their vegan menus and their locations, and whether or not they’ll deliver to the customer’s address. Our headline feature might require a number of such supporting features to make it work.

We work with our customer to design a minimum feature set that we believe will solve their problem. It’s important to keep it as simple as we can, because we want to have a working prototype ready as soon as we’re able that we can test with real end users in the real world.

Next, for each feature – starting with the most important one, which is typically the headline feature – we drive out a precise understanding of exactly what that feature will do using examples harvested from the real world. We might go online, or grab a phone book, and start checking out takeaway restaurants, collecting their menus and asking what postcode areas they deliver in. Then we would pick addresses in our local area, and figure out – for each address – which restaurants would be available according to our criteria. We could search on sites like Google and Trip Advisor for reviews of the restaurants, or – if we can’t find reviews, invent some ratings – so we can describe how the result lists should be ordered.

We capture these examples in a format that’s human readable and machine readable, so we can collaborate directly with the customer on them and also pull the same data into automated executable tests.

We’re now in the customer test feedback loop. Working one customer test at a time, we automate execution of that test so we can continuously check our progress in passing it.

For each customer test, we then test-drive an implementation that will pass the test, using unit tests to drive out the details of how the software will complete each unit of work required. If the happy path for our headline feature requires that we

  • calculate a delivery map location using the customer’s address
  • identify for each restaurant in our list if they will deliver to that location
  • filter the list to exclude the restaurants that don’t
  • order the filtered list by average customer rating

…then that’s a bunch of unit tests we might need to write. We’re now in the unit test feedback loop.

Once we’ve completed our units and seen the customer test pass, we can move on to the next customer test, passing them one at a time until the feature is complete.

Many dev teams make the mistake of thinking that we’re done at this point. This is usually because they have no visibility of the real end goal. We’re rarely invited to participate in that conversation, to be fair. Which is a terrible, terrible mistake.

Once all the features – headline and supporting – are complete, we’re ready to test our minimum solution with real end users. We release our simple software to a representative group of tame vegan takeaway diners, who will attempt to use it to find good food. Heck, we can try using it ourselves, too. I’m all in favour of developers eating their own (vegan) dog food, because there’s no substitute for experiencing it for ourselves.

Our end users may report that some of the restaurants in their search results were actually closed, and that they had to phone many takeaway restaurants to find one open. They may report that when they ordered food, it took over an hour to be delivered to their address because the restaurant had been a little – how shall we say? – optimistic about their reach. They may report that they were specifically interested in a particular kind of cuisine – e.g., Chinese or Indian – and that they had to scroll through pages and pages of results for takeaway that was of no interest to find what they wanted.

We gather this real-world feedback and feed that back into another iteration, where we add and change features so we can test again to see if we’re closer to achieving our goal.

I like to picture these feedback loops as gear wheels. The biggest gear – goals – turns the slowest, and it drives the smaller features gear, which turns faster, driving the smaller and faster customer tests wheel, which drives the smallest and fastest unit tests wheel.

tdd_gears

It’s important to remember that the outermost wheel – goals – drives all the other wheels. They should not turning by themselves. I see many teams where it’s actually the features wheel driving the goals wheel, and teams force their customers to change their goals to fit the features they’re delivering. Bad developers! In your beds!

It’s also very, very important to remember that the goals wheel never stops turning, because there’s actually an even bigger wheel making it turn – the real world – and the real world never stops turning. Things change, and there’ll always be new problems to solve, especially as – when we release software into the world, the world changes.

This is why it’s so very important to keep all our wheels well-oiled so they can keep on turning for as long as we need them to. If there’s too much friction in our delivery processes, the gears will grind to a halt: but the real world will keep on turning whether we like it or not.

 

Adventures In Multi-Threading

I’ve been spending my early mornings buried in Java threading recently. Although we talk often of concurrency and “thread safety” in this line of work, there’s surprisingly little actual multi-threaded code being written. Normally, when developers talk about multi-threading, we’re referring to how we write code to handle asynchronous operations in other people’s code (e.g., promises in JavaScript).

My advice to developers has always been to avoid writing multi-threaded code wherever possible. Concurrency is notoriously difficult to get right, and the safest multi-threaded code is single-threaded.

I’ve been eating my own dog food on that, and it occurred to me a couple of weeks back that I’ve written very little multi-threaded code myself in recent years.

But there is still some multi-threaded code being written in languages like Java, C# and Python for high-performance solutions that are targeted at multi-CPU platforms. And over the last few months I’ve been helping a client with just such a solution for scaling up property-based tests to run on multi-core Cloud platforms.

One of the issues we faced is how do we test our multi-threaded code?

There’s a practical issue of executing multiple threads in a single-threaded unit test – particularly synchronizing so that we can assert an outcome after all threads have completed their work.

And also, thread scheduling is out of our control and – on Windows and similar platforms – unpredictable and non-repeatable. A race condition or a deadlock might not show up every time we run a test.

Over the last couple of weeks, I’ve been playing with a rough prototype to try and answer these questions. It uses a simple producer-consumer example – loading parcels into a loading bay and then taking them off the loading bay and loading them into a truck – to illustrate the challenges of both safe multi-threading and multi-threaded testing.

When I test multi-threaded code, I’m interested in two properties:

  • Safety – what should always be true while the code is executing?
  • Liveness – what should eventually be achieved?

To test safety, an assertion needs to be checked throughout execution. To test liveness, an assertion needs to be checked after execution.

After writing code to do this, I refactored the useful parts into custom assertion methods, always() and eventually().

always() takes a list of Runnables (Java’s equivalent of functions that accept no parameters and have no return value) that will concurrently perform the work we want to test. It will submit each Runnable to a fixed thread pool a specified number of times (thread count) and then wait for all the threads in the pool to terminate.

On a single separate thread, a boolean function (in Java, Supplier<Boolean>) is evaluated multiple times throughout execution of the threads under test. This terminates after the worker threads have terminated or timed out. If, at any point in execution, the assertion evaluates to false, the test will fail.

In use, it looks like this:

bayLoader and truckLoader are objects that implement the Runnable interface. They will be submitted to the thread pool 2x each (because we’ve specified a thread count of 2 as our third parameter), so there will be 4 worker threads in total, accessing the same data defined in our set-up.

The bayLoader threads will load parcels on to the loading bay, which holds a maximum of 50 parcels, until all the parcels have been loaded.

The truckLoader threads will unload parcels from the loading bay and load them on to the truck, until the entire manifest of parcels has been loaded.

A safety property of this concurrent logic is that there should never be more than 50 parcels in the loading bay at any time, and that’s what our always assertion checks multiple times during execution:

() -> bay.getParcelCount() <= 50

When I run this test once, it passes. Running it multiple times, it still passes. But just because a test’s passing, that doesn’t mean our code really works. Let’s deliberately introduce an error into our test assertion to make sure it fails.

() -> bay.getParcelCount() <= 49

The first time I run this, the test fails. And the second and third times. But on the fourth run, the test passes. This is the thread determinism problem; we have no control over when our assertion is checked during execution. Sometimes it catches a safety error. Sometimes the error slips through the gaps and the test misses it.

The good news is that if it catches an error just once, that proves we have an error in our concurrent logic. Of course, if we catch no errors, that doesn’t prove they’re not there. (Absence of evidence isn’t evidence of absence.)

What if we run the test 100 times? Rather than sit there clicking the “run” button over and over, I can rig this test up as a JUnitParams parameterised test and feed it 100 test cases. (If you don’t have a parameterised testing feature, you can just loop 100 times).

When I run this, it fails 91/100 times. Changing the assertion back, it passes 100/100. So I can have 100% confidence the code satisfies this safety property? Not so fast. 100 test runs leaves plenty of gaps. Maybe I can be 99% confident with 100 test runs. How about we do 1000 test runs? Again, they all pass. So that gives me maybe 99.9% confidence. 10,000 could give me 99.99% confidence. And so on.

Thankfully, after a little performance engineering, 10,000 tests run in less than 30 seconds. All green.

The eventually() assertion method works along similar lines, except that it only evaluates its assertion once at the end (and therefore runs significantly faster):

If my code encounters a deadlock, the worker threads will time out after 1000 milliseconds. If a race condition occurs and our data becomes corrupted, the assertion will fail. Running this 10,000 times shows all the tests are green. I’m 99.99% confident my concurrent logic works.

Finally, speaking of deadlocks and race conditions, how might we avoid those?

A race condition can occur when two or more threads attempt to access the same data at the same time. In particular, we run the risk of a pre-condition paradox when bay loaders attempt to load parcels on to the loading bay, and truck loaders attempt to unload parcels from the bay.

The bay loader can only load a parcel if the bay is not full. A truck loader can only unload a parcel if the bay is not empty.

When I run my tests with this implementation of LoadingBay, 12% of them fail their liveness and safety checks because there’s a non-zero possibility of, say, a bay loader attempting to load a parcel after we’ve checked the bay isn’t full and another bay loader loading the 50th parcel in between that check and loading. Similarly, a truck loader might check that the bay isn’t empty, but before they unload the last parcel another truck loader thread takes it.

To avoid this situation, we need to ensure that pre-condition checks and actions are executed in a single, atomic sequence with no chance of other threads interfering.

When I test this implementation, tests still fail. The problem is that some parcels aren’t getting loaded on to the bay (though the bay loader thinks they have been), and some parcels aren’t getting unloaded, either. Our truck loader may be putting null parcels on the truck.

When loading, the bay must not be full. When unloading, it must not be empty. So our worker threads need to wait until their pre-conditions are satisfied. Now, Java threading gives us wait() methods, but they only wait for a specified amount of time. We need to wait until a condition becomes true.

This passes all 10,000 safety and liveness test runs, so I have 99.99% confidence we don’t have a race condition. But…

What happens when all the parcels have been loaded on to the truck? There’s a risk of deadlock if the bay remains permanently empty.

So we also need a way to stop the loading and unloading process once all the manifest has been loaded.

I’ve dealt with this in a similar way to waiting for pre-conditions to be satisfied, except this time we repeat loading and unloading until the parcels are all on the truck.

You may have already spotted the patterns in these two forms of loops:

  • Execute this action when this condition is true
  • Execute this action until this condition is true

Let’s refactor to encapsulate those nasty while loops.

There. That looks a lot better, doesn’t it? All nice and functional.

I tend to find conditional synchronisation easier to wrap my head around than all the wait() and notify() and callbacks malarky, and experiences so far with this approach suggest I tend to produce more reliable multi-threaded code.

My explorations continue, but I thought there might be folk out there who’d find it useful to see where I’ve got so far with this.

You can see the current source code at https://github.com/jasongorman/syncloop (it’s just a proof of concept, so provided with no warranty or support, of course.)

 

 

The Test Pyramid – The Key To True Agility

On the Codemanship TDD course, before we discuss Continuous Delivery and how essential it is to achieving real agility, we talk about the Test Pyramid.

It has various interpretations, in terms of the exactly how many layers and exactly what kinds of testing each layer is made of (unit, integration, service, controller, component, UI etc), but the overall sentiment is straightforward:

The longer tests take to run, the fewer of those kinds of tests you should aim to have

test_pyramid

The idea is that the tests we run most often need to be as fast as possible (otherwise we run them less often). These are typically described as “unit tests”, but that means different things to different people, so I’ll qualify: tests that do not involve any external dependencies. They don’t read from or write to databases, they don’t read or write files, they don’t connect with web services, and so on. Everything that happens in these tests happens inside the same memory address space. Call them In-Process Tests, if you like.

Tests that necessarily check our code works with external dependencies have to cross process boundaries when they’re executed. As our In-Process tests have already checked the logic of our code, these Cross-Process Tests check that our code – the client – and the external code – the suppliers – obey the contracts of their interactions. I call these “integration tests”, but some folk have a different definition of integration test. So, again, I qualify it as: tests that involve external dependencies.

These typically take considerably longer to execute than “unit tests”, and we should aim to have proportionally fewer of them and to run them proportionally less often. We might have thousands of unit tests, and maybe hundreds of integration tests.

If the unit tests cover the majority of our code – say, 90% of it – and maybe 10% of our code has direct external dependencies that have to be tested, on average we’ll make about 9 changes that need unit testing compared to 1 change that needs integration testing. In other words, we’d need to run our unit tests 9x as often as our integration tests, which is a good thing if each integration test is about 9 times slower than a unit test.

At the top of our test pyramid are the slowest tests of all. Typically these are tests that exercise the entire system stack, through the user interface (or API) all the way down to the external dependencies. These tests check that it all works when we plug everything together and deploy it into a specific environment. If we’ve already tested the logic of our code with unit tests, and tested the interactions with external suppliers, what’s left to test?

Some developers mistakenly believe that these system-levels tests are for checking the logic of the user experience – user “journeys”, if you like. This is a mistake. There are usually a lot of user journeys, so we’d end up with a lot of these very slow-running tests and an upside-down pyramid. The trick here is to make the logic of the user experience unit-testable. View models are a simple architectural pattern for logically representing what users see and what users do at that level. At the highest level they may be looking at an HTML table and clicking a button to submit a form, but at the logical level, maybe they’re looking at a movie and renting it.

A view model can help us encapsulate the logic of user experience in a way that can be tested quickly, pushing most of our UI/UX tests down to the base of the pyramid where they belong. What’s left – the code that must directly reference physical UI elements like HTML tables and buttons – can be wafer thin. At that level, all we’re testing is that views are rendered correctly and that user actions trigger the correct internal logic (which can easily be done using mock objects). These are integration tests, and belong in the middle layer of our pyramid, not the top.

Another classic error is to check core logic through the GUI. For example, checking that insurance premiums are calculated correctly by looking at what number is rendered on that web page. Some module somewhere does that calculation. That should be unit-testable.

So, if they’re not testing user journeys, and they’re not testing core logic, what do our system tests test? What’s left?

Well, have you ever found yourself saying “It worked on my machine”? The saying goes “There’s many a slip ‘twixt cup and lip.” Just because all the pieces work, and just because they all play nicely together, it’s not guaranteed that when we deploy the whole system into, say, our EC2 instances, that nothing could be different to the environments we tested it in. I’ve seen roll-outs go wrong because the servers handled dates different, or had the wrong locale, or a different file system, or security restrictions that weren’t in place on dev machines.

The last piece of the jigsaw is the system configuration, where our code meets the real production environment – or a simulation of it – and we find out if really works where it’s intended to work as a whole.

We may need dozens of those kinds of tests, and perhaps only need to run them on, say, every CI build by deploying the outputs to a staging environment that mirrors the production environment (and only if all our unit and integration tests pass first, of course.) These are our “good to go?” tests.

The shape of our test pyramid is critical to achieving feedback loops that are fast enough to allow us to sustain the pace of development. Ideally, after we make any change, we should want to get feedback straight away about the impact of that change. If 90% of our code can be re-tested in under 30 seconds, we can re-test 90% of our changes many times an hour and be alerted within 30 seconds if we broke something. If it takes an hour to re-test our code, then we have a problem.

Continuous Delivery means that our code is always shippable. That means it must always be working, or as near as possible always. If re-testing takes an hour, that means that we’re an hour away from finding out if changes we made broke the code. It means we’re an hour away from knowing if our code is shippable. And, after an hour’s-worth of changes without re-testing, chances are high that it is broken and we just don’t know it yet.

An upside-down test pyramid puts Continuous Delivery out of your reach. Your confidence that the code’s shippable at any point in time will be low. And the odds that it’s not shippable will be high.

The impact of slow-running test suites on development is profound. I’ve found many times that when a team invested in speeding up their tests, many other problems magically disappeared. Slow tests – which means slow builds, which means slow release cycles – is like a development team’s metabolism. Many health problems can be caused by a slow metabolism. It really is that fundamental.

Slow tests are pennies to the pound of the wider feedback loops of release cycles. You’d be surprised how much of your release cycles are, at the lowest level, made up of re-testing cycles. The outer feedback loops of delivery are made of the inner feedback loops of testing. Fast-running automated tests – as an enabler of fast release cycles and sustained innovation – are therefore highly desirable

A right-way-up test pyramid doesn’t happen by accident, and doesn’t come at no cost, though. Many organisations, sadly, aren’t prepared to make that investment, and limp on with upside-down pyramids and slow test feedback until the going gets too tough to continue.

As well as writing automated tests, there’s also an investment needed in your software’s architecture. In particular, the way teams apply basic design principles tends to determine the shape of their test pyramid.

I see a lot of duplicated code that contains duplicated external dependencies, for example. It’s not uncommon to find systems with multiple modules that connect to the same database, or that connect to the same web service. If those connections happened in one place only, that part of the code could be integration tested just once. D.R.Y. helps us achieve a right-way-up pyramid.

I see a lot of code where a module or function that does a business calculation also connects to an external dependency, or where a GUI module also contains business logic, so that the only way to test that core logic is with an integration test. Single Responsibility helps us achieve a right-way-up pyramid.

I see a lot of code where a module in one web service interacts with multiple features of another web service – Feature Envy, but on a larger scale – so there are multiple points of integration that require testing. Encapsulation helps us achieve a right-way-up pyramid.

I see a lot of code where a module containing core logic references an external dependency, like a database connection, directly by its implementation, instead of through an abstraction that could be easily swapped by dependency injection. Dependency Inversion helps us achieve a right-way-up pyramid.

Achieving a design with less duplication, where modules do one job, where components and services know as little as possible about each other, and where external dependencies can be easily stubbed or mocked by dependency injection, is essential if you want your test pyramid to be the right way up. But code doesn’t get that way by accident. There’s significant ongoing effort required to keep the code clean by refactoring. And that gets easier the faster your tests run. Chicken, meet egg.

If we’re lucky enough to be starting from scratch, the best way we know of to ensure a right-way-up test pyramid is to write the tests first. This compels us to design our code in such a way that it’s inherently unit-testable. I’ve yet to come across a team genuinely doing Continuous Delivery who wasn’t doing some kind of TDD.

If you’re working on legacy code, where maybe you’re relying on browser-based tests, or might have no automated tests at all, there’s usually a mountain to climb to get a test pyramid that’s the right way up. You need to write fast-running tests, but you will probably need to refactor the code to make that possible. Egg, meet chicken.

Like all mountains, though, it can be climbed. One small, careful step at a time. Michael Feather’s book Working Effectively With Legacy Code describes a process for making changes safely to code that lacks fast-running automated tests. It goes something like this:

  • Identify what code you need to change
  • Identify where around that code you’d want unit tests to make the change safely
  • Break any dependencies in that code getting in the way of unit testing
  • Write the unit tests
  • Make the change
  • While you’re there, make other improvements that will help the next developer who needs to change that code (the “boy scout rule” – leave the camp site tidier than you found it)

Change after change, made safely in this way, will – over time – build up a suite of fast-running unit tests that will make future changes easier. I’ve worked on legacy code bases that went from upside-down test pyramids of mostly GUI-based system tests, that took hours or even days to run, to right-side-up pyramids where most of the code could be tested in under a minute. The impact on the cost and the speed of delivery is always staggering. It can be done.

But be patient. A code base might take a year or two to turn around, and at first the going will be tough. I find I have to be super-disciplined in those early stages. I manually re-test as I refactor, and resist the temptation to make a whole bunch of changes at a time before I re-test. Slow and steady, adding value and clearing paths for future changes at the same time.

How Agile Works

After 18 years of talk and hype about Agile, I find that it’s easy to lose sight of what Agile means in essence, and – importantly – how it works.

I see it as an inescapable reality of software development – or any sufficiently complex endeavour – that we shouldn’t expect to get it right first time. The odds of our first solution being the best solution are vanishingly small – the proverbial “hole in one”.

So we should expect to need to take multiple passes at a solution, so we can learn with each iteration of the design what works and what doesn’t and progressively get it less wrong.

If Agile is an algorithm, then it’s a search algorithm. It searches an effectively infinite solution space for a design that best fits our problem. The name of this search algorithm is evolution.

Starting with the simplest input, it tests that design against one or more fitness functions. The results of this test are fed back into the next iteration of the design. And around and around we go, adding a little, changing a little, and testing again and again.

In nature, evolution takes tiny steps forward. If a viable organism produced offspring that are too different from itself, chances are that next generation will be non-viable. Evolution doesn’t take big, risky leaps. Instead, it edges forward one tiny, low-risk change at a time.

The Agile design process doesn’t make 100 changes to a solution and then test for fitness. It makes one or two changes, and sees how they work out before making more.

The speed of this search algorithm depends on three things:

  • The frequency of iterations
  • The amount of change in each iteration
  • The quality of feedback into the next iteration

If releases of working software are too far apart, we learn too slowly about what works and what doesn’t.

If we change too much in each release, we increase the risk of making the solution non-viable. We also take on a much higher risk and cost if a release has to be rolled back, as we lose a tonne of changes. It’s in the nature of software that it works as a connected whole. It’s easy to roll back 1 of 1 changes. It’s very hard to roll back 1 of 100 changes.

The lessons we learn with each release will depend on how it was tested. We find that feedback gathered from real end users using the software for real is usually the most valuable feedback. Everything else is just guesswork until our code meets the real world.

“Agile” teams who do weekly show-and-tells, but release working software into production less frequently, are missing out on the best feedback. Our code’s just a hypothesis until real people try to use it for real.

This is why our working relationship with our customer is so important – critical, in fact. far too many teams who call themselves “Agile” don’t get to engage with the customer and end users directly, and the quality of the feedback suffers when we’re only hearing someone’s interpretation of what their feedback was. It works best when the people writing the code get to see and hear first-hand from the people using it.

For me, it’s not Agile if it doesn’t fully embrace those fundamental principles, because they’re the engine that makes it work. Agile teams do small, frequent releases of working software to real customers and end users who they work with directly.

To achieve this, there are some technical considerations. If it takes a long time to check that the software’s fit for release, then you will release less often. If it takes a long time to build and deploy the software, then you’ll release less often. If the changes get harder and harder to make, then you’ll release less often.

And even after we’ve solved the problem, the world doesn’t stand still. The most common effect of releasing software into the world is that – if the software gets used – the world changes. Typically, it changes in ways we weren’t expecting. Western democracies are still struggling with the impact of social media, for example. But on a smaller scale, releasing software into any environment can have unintended consequences.

It’s not enough to get it right once. We have to keep learning and keep changing the software, normally for its entire operational lifetime (which, on average, is about 8 years). So we have to be able to sustain the pace of releases pretty much indefinitely.

All this comes with a bunch of technical challenges that have to be met in order to achieve small, frequent releases at a sustainable pace. Most “Agile” teams fail to master these technical disciplines, and their employers resist making the investment in skills, time and tools required to build a “delivery engine” that’s up to the job.

Most “Agile” teams don’t have the direct working relationship with the people using their software required to gain the most useful feedback.

To put it more bluntly, most “Agile” teams aren’t really Agile at all. They mistake Jira and Jenkins and stand-up meetings and backlogs and burn-down charts for agility. None of those things are, in of themselves, Agile.

Question is: are you?

The 2 Most Critical Feedback Loops in Software Development

When I’m explaining the inner and outer feedback loops of Test-Driven Development – the “wheels within wheels”, if you like – I make the point that the two most important feedback loops are the outermost and the innermost.

feedbackloops

The outermost because the most important question of all is “Did we solve the problem?” The innermost because the answer is usually “No”, so we have to go round again. This means that the code we delivered will need to change, which raises the second most important question; “Did we break the code?”

The sooner we can deliver something so we can answer “Did we solve the problem?”, the sooner we can feedback the lessons learned on the next go round. The sooner we can re-test the code, the sooner we can know if our changes broke it, and the sooner we can fix it ready for the next release.

I realised nearly two decades ago that everything in between – requirements analysis, customer tests, software design, etc etc – is, at best, guesswork. A far more effective way of building the right thing is to build something, get folk to use it, and feedback what needs to change in the next iteration. Fast iterations accelerate this learning process. This is why I firmly believe these days that fast iterations – with all that entails – is the true key to building the right thing.

Continuous Delivery – done right, with meaningful customer feedback drawn from real use in the world world (or as close as we dare bring our evolving software to the real world) – is the ultimate requirements discipline.

Fast-running automated tests that provide good assurance that our code’s always working are essential to this. How long it takes to build, test and deploy our software will determine the likely length of those outer feedback loops. Typically, the lion’s share of that build time is regression testing.

About a decade ago, many teams told me “We don’t need unit tests because we have integration tests”, or “We have <insert name of trendy new BDD tool here> tests”. Then, a few years later, their managers were crying “Help! Our tests take 4 hours to run!” A 4-hour build-and-test cycle creates a serious bottleneck, leading to code that’s almost continuously broken without teams knowing. In other words, not shippable.

Turn a 4-hour build-and-test cycle into a 40-second build-and-test cycle, and a lot of problems magically disappear. You might be surprised how many other bottlenecks in software development have slow-running tests as their underlying cause – analysis paralysis, for example. That’s usually a symptom of high stakes in getting it wrong, and that’s usually a symptom of infrequent releases. “We better deliver the right thing this time, because the next go round could be 6 months later.” (Those among us old enough to remember might recall just how much more care we had to take over our code because of how long it took to compile. It’s a similar effect, but on a much larger scale with much higher stakes than a syntax error.)

Where developers usually get involved in this process – user stories and backlogs – is somewhere short of where they need to be involved. User stories – and prioritised queues of user stories – are just guesses at what an analyst or customer or product owner believes might solve the problem. To obsess over them is to completely overestimate their value. The best teams don’t guess their way to solving a problem; they learn their way.

Like pennies to the pound, the outer feedback loop of “Does it actually work in the real world?” is made up of all the inner feedback loops, and especially the innermost loop of regression testing after code is changed.

Teams who invest in fast-running automated regression tests have a tendency to out-learn teams who don’t, and their products have a tendency to outlive the competition.

 

 

How to Beat Evil FizzBuzz

On the last day of the 3-day Codemanship TDD training workshop, participants are asked to work as a team to solve what would – for an individual developer – be a very simple exercise.

The FizzBuzz TDD kata is well known, and a staple in many coding interviews these days. Write a program that outputs the numbers 1…100 as a single comma-delimited string. Any numbers that are divisible by 3, replace with ‘Fizz’. Any numbers that are divisible by 5, replace with ‘Buzz’. And any numbers that are divisible by 3 and 5, replace with ‘FizzBuzz’. Simples.

An individual can usually complete this in less than half an hour. But what if we make it evil?

Splitting the problem up into five parts, and then assigning each part to a pair or individual in the group, who can only work on the code for their part.

  • Generate a list of integers from 1 to 100
  • Replace integers divisible by 3 with ‘Fizz’
  • Replace integers divisible by 5 with ‘Buzz’
  • Replace integers divisible by 3 and 5 with ‘FizzBuzz’
  • Output the resulting list as a comma-delimited string

Working as a single team to produce a single program that passes my customer test – seeing the final string with all the numbers, Fizzes, Buzzes and FizzBuzzes in the right places produced by their program run on my computer – the group has to coordinate closely to produce a working solution. They have one hour, and no more check ins are allowed after their time’s up. They must demonstrate whatever they’ve got in the master branch of their GitHub repository at the end of 60 minutes.

This is – on the surface of it – an exercise in Continuous Integration. They need to create a shared repository, and each work on their own copy, pushing directly to the master branch. (This is often referred to as trunk-based development.) They must set up a CI server that runs a build – including automated tests – whenever changes are pushed.

Very importantly, once the CI server is up and running, and they’ve got their first green build, the build must never go red again. (Typically it takes a few tries to get a build up and running, so they often start red.)

Beyond those rules:

  • Produce a single program that passes the customer’s test on the customer’s machine
  • Only write code for the part they’ve been assigned
  • Push directly to master on a single GitHub repository – no branching
  • CI must run a full build – including tests – on every push
  • Must not break the build once it’s gone green for the first time
  • Last push must happen before the end of the hour

They can do whatever they need to. It’s their choice of programming language, application type (console, web app, desktop app etc) and so on. They choose which CI solution to use.

90% of groups who attempt Evil FizzBuzz fail to complete it within the hour. The three most common reasons they fail are:

  1. Too long shaving yaks – many groups don’t get their CI up and running until about 30-40 minutes in. In some cases, they never get it up and running.
  2. Lack of a bigger picture – many groups fail to establish a shared vision for how their program will work, and – importantly – how the pieces will fit together
  3. Integrating too late – with cloud-based CI, the whole process of checking your code in can take 2-3 minutes minimum. Times that by 5, and groups often discover that everyone deciding to push their changes with just fives minutes to go means their ship has sailed without them.

On the first point, it’s important to have a game plan and to keep things simple. I can illustrate using a Node and JavaScript example.

First, one of the pairs needs to create a skeleton Node project, with a dummy test for the build server to run. We need to get our delivery pipeline up and running quickly, before anyone even thinks about writing any solution code.

skeleton_node_project

This is just an empty Node project, with a single dummy Mocha unit test. Make sure the test passes, then create a GitHub repository and push this skeleton project to it.

initial_commit

Now, let’s set up a CI server. I’m going to use circleci.com. Logging in with my GitHub account, I can easily see and add a build project for my new evil_fizzbuzz repository.

add_circleci_project

It helps enormously to go with the popular conventions for your project. I’m using Node, which is widely supported, Mocha for tests which are named and located where – by default – the build tool would expect to find them, and it’s all very Yarn-friendly. Well, maybe. We’ll see. I add a .circleci/config.yml file to my project and paste in the default settings recommended for my project by CircleCI.

circleci_config

Then I push this new file to master, and instruct CircleCI to start a build. This first build fails. They usually do. Looking at the output, the part of the workflow where it fell over has the error message:

The engine "node" is incompatible with this module. Expected version "6.* || 8.* || >= 10.*"

I’m not proud. Don’t sit there trying to figure things like this out. Just Google the error message and see if anyone has a fix for it. Turns out it’s common, and there’s a simple fix you can do in the config.yml file. So I fix it, push that change, and wait for a second build.

green_build

The build succeeds, but I need to make sure the test was actually run before we can continue.

tests_ran

Looks like we’re in business. Time to start working on our solution.

Next, you’ll need to invite all your team mates to contribute to your GitHub project. This is where team skills help: someone needs to get all the necessary user IDs, make sure everyone is aware that invites are being sent out, and ensure everyone accepts their invite ASAP. Coordination!

While this is going on, someone should be thinking about how the finished program will be demonstrated on the customer’s laptop. Do they have a compatible version of Node.js installed already? And how will they resolve dependencies – in this case, Mocha?

Effective software design begins and ends with the user experience. The pair responsible for the final output should take care of this, I think.

Time to complete our end-to-end “Hello, world!” so our delivery pipeline joins all the dots.

The output pair add a JavaScript file that will act as the entry point for the program, and have it write “Hello, world!” to the console.

hello_world

After checking program.js works on the local command line, push it to master.

We establish that our customer – me, in this case – happens to have Git and Node.js installed, so possibly the simplest way to demonstrate the program running on my computer might be to clone the files from master into a local folder, run npm install to resolve the Mocha dependency, and then we can just run node program.js in our customer demo. (We can tidy that up later if need be, but it will pass the test.)

rmdir teamjason /s /q
mkdir teamjason
cd teamjason
git clone https://github.com/jasongorman/evil_fizzbuzz.git
cd evil_fizzbuzz
npm install

We test that it works on the customer’s laptop, and now we’re finally ready to start implementing our FizzBuzz solution.

Phew. Yaks shaved.

But where to start?

This is the second place a lot of teams go wrong. They split off into their own pairs, clone the GitHub repository, and start working on their part of the solution straight away with no overall understanding of how it will all fit together to solve the problem.

This is where mob programming can help. Before splitting off, get everyone around one computer (there’s always a projector or huge TV in the room they can use). The pair responsible for writing the final output write the code (which satisfies the rules), while the rest of the group give input on the top-level design. In simpler terms, the team works outside-in, to identify what parts will be needed and see how their part fits in.

In my illustration, I’m thinking maybe a bit if functional composition might be the way to go.

This is the only code the pair who are responsible for outputting the result are allowed to write, according to the rules of Evil FizzBuzz. But the functions used here don’t exist, so we can’t push this to master without breaking the build.

Here’s where we get creative. Each of the other four pairs takes their turn at the keyboard to declare their function – just an empty one for now.

We can run this and see that it is well-formed, and produces an empty output, as we’d expect at this point. Let’s push it to master.

It’s vital for everyone to keep one eye on the build status, as it’s a signal – a pulse, if you like – every developer on a team needs to be aware of. This build succeeds.

builds

So, we have an end-to-end delivery pipeline, and we have a high-level design, so everyone can see how their part fits into the end solution.

This can be where pairs split off to implement their part. Now is the time to make clones and here’s where the CI skills come into play.

Let’s say one pair is working on the Fizz part. They take a clone of master, and – because it is a TDD course, after all – write and pass their first Mocha test.

On a green light, it’s time maybe for a bit of refactoring. The pair decide to pull the fizz function into it’s own file, to keep what they’re doing more separate from everyone else.

Having refactored the structure of the solution a little, they feel this might be a good time to share those changes with the rest of the team. This helps avoid the third mistake teams make – integrating too late, with too many potentially conflicting changes. (Many Evil FizzBuzz attempts end with about 15 minutes of merge hell.) Typically this ends with them breaking the build and the team disqualified.

But before pushing to master, they run all of the tests, just to be sure.

fizz_test

With all tests passing, it should be safe to push. Then they wait for a green build before moving on to the next test case.

build_in_progress

While builds are in progress, other members of the team must be mindful that it’s not safe to push their changes until the whole process has completed successfully. They must also ensure they don’t pull changes that break the build, so everyone should be keeping one eye on the build status.

Phew. It’s green.

When you see someone else’s build succeed, that would be a good time to consider pulling any changes that have been made, and running all of the tests locally. Keeping in step with master, when working in such close proximity code-wise, is very important.

Each pair continues in this vein: pass a test, maybe do some refactoring, check in those changes, wait for a green build, pull any changes other pairs have made when you see their builds go green, and keep running those tests!

It’s also a very good idea to keep revisiting the customer test to see what visible progress is being made, and to spot any integration problems as early as possible. Does the high-level design actually work? Is each function playing its part?

Let’s pay another visit to the team after some real progress has been made. When we run the customer test program, what output do we get now?

command_line_inprogress

Okay, it looks like we’re getting somewhere now. The list of 100 numbers is being generated, and every third number is Fizz. Work is in progress on Buzz and FizzBuzz. if we were 45 minutes in at this point, we’d be in with a shot at beating Evil FizzBuzz.

Very quickly, the other two pieces pieces of our jigsaw slot into place. First, the Buzzes…

command_line_inprogress_buzz

And finally the FizzBuzzes.

command_line_complete

At this point, we’re pretty much ready for our real customer test. We shaved the yaks, we established an overall design, we test-drove the individual parts and are good to go.

So this is how – in my experience – you beat Evil FizzBuzz.

  1. Shave those yaks first! You need to pull together a complete delivery pipeline, that includes getting it on to the customer’s machine and ready to demo, as soon as you can. The key is to keep things simple and to stick to standards and conventions for the technology you’ve chosen. It helps enormously, of course, if you have a good amount of experience with these tools. If you don’t, I recommend working on that before attempting Evil FizzBuzz. “DevOps” is all the rage, but surprisingly few developers actually get much practice at it. Very importantly, if your delivery pipeline isn’t up and running, the whole delivery machine is blocked. Unshaved yaks are everybody’s problem. Don’t have one pair “doing the build” while the rest of you go away and work on code. How’s your code going to get into the finished solution and on to the customer’s machine?
  2. Get the bigger picture and keep it in sight the whole time. Whether it’s through mob programming, sketching on a whiteboard or whatever – involve the whole team and nail that birds-eye view before you split off. And, crucially, keep revisiting your final customer test. Lack of visibility of the end product is something teams working on real products and projects cite a major barrier to getting the right thing done. Invisible progress often turns out to be no progress at all. As ‘details people’, we tend to be bad at the bigger picture. Work on getting better at it.
  3. Integrate early and often. You might only have unit 3 tests to pass for your part in a one-hour exercise, but that’s 3 opportunities to test and share your changes with the rest of the team. And the other side of that coin – pull whenever you see someone else’s build succeed, and test their changes on your desktop straight away. 5 pairs trying to merge a bunch of changes in the last 15 minutes often becomes a train wreck. Frequent, small merges work much better on average.

 

 

 

 

 

Code Craft : Part III – Unit Tests are an Early Warning System for Programmers

Before I was introduced to code craft, my way of checking that the programs I wrote worked was to run them and use them and see if they did what I expected them to do.

Consider this command line program I wrote that does some simple maths:

I can run this program with different inputs to check if the results of the calculations are correct.

C:\Users\User\PycharmProjects\pymaths>python maths.py sqrt 4.0
The square root of 4.0 = 2.0

C:\Users\User\PycharmProjects\pymaths>python maths.py factorial 5
5 factorial = 120

C:\Users\User\PycharmProjects\pymaths>python maths.py floor 4.7
The floor of 4.7 = 4.0

C:\Users\User\PycharmProjects\pymaths>python maths.py ceiling 2.3
The ceiling of 2.3 = 3.0

Testing my code by using the program is fine if I want to check that it works first time around.

These four test cases, though, don’t give me a lot of confidence that the code really works for all the inputs my program has to handle. I’d want to cover more examples, perhaps using a list to remind me what tests I should do.

  • sqrt 0.0 = 0.0
  • sqrt -1.0 -> should raise an exception
  • sqrt 1.0 = 1.0
  • sqrt 4.0 = 2.0
  • sqrt 6.25 = 2.5
  • factorial 0 = 0
  • factorial 1 = 1
  • factorial 5 = 120
  • factorial – 1 -> should raise an exception
  • factorial 0.5 -> should raise an exception
  • floor 0.0 = 0.0
  • floor 4.7 = 4.0
  • floor -4.7 = -5.0
  • ceiling 0.0 = 0.0
  • ceiling 2.3 = 3.0
  • ceiling -2.3 = -2.0

Now, that’s a lot of test cases (and we haven’t even thought about how we handle incorrect command line arguments yet).

To run the program and try all of these test cases once seems like quite a bit of work, but if it’s got to be done, it’s got to be done. (The alternative is not doing all these tests, and then how do we know our program really works?)

But what if I need to change my maths code? (And if we know one thing about code, it’s that it changes). Then I’ll need to perform these tests again. And if I change the code again, I have to do the tests again. And again. And again. And again.

If we don’t re-test the code after we’ve changed it, we risk not knowing if we’ve broken it. I don’t know about you, but I’m not happy with the idea of my end users being lumbered with broken software. So I re-test the software every time it changes.

It took me about 5-6 minutes to perform all of these tests using the command line. That’s 5-6 minutes of testing every time I need to change my code. And maybe 5-6 minutes of testing doesn’t sound like a lot, but this program only has about 40 lines of code. Extrapolate that testing time to 1,000 lines of code. Or 10,000 lines. Or a million.

Testing programs by using them – what we call manual testing – simply doesn’t scale up to large amounts of code. The time it takes to re-test our program when we’ve changed the code becomes an obstacle to making those changes safely. If it takes hours or days or even weeks to re-test it, then change will be slow and difficult. It may even be impractical to change it at all, and far too many programs lots of people rely on end up in this situation. The time taken to test our code has a profound impact on the cost of making changes.

Studies have shown that the effort required to fix a bug rises dramatically the longer that bug goes undiscovered.

Cost-of-Correcting-Defects-Boehm-and-Basili

If it takes a week to re-test our program, then the cost of fixing the bugs that testing discovers will be much higher than if we’d been alerted a minute after we made that error. The average programmer can introduce a lot of bugs in a week.

Creating good working software depends heavily on our ability to check that the code’s working very frequently – almost continuously, in fact. So we have to be able to perform our tests very, very quickly. And that’s not possible when we perform them manually.

So, how could we speed up testing to make changes quicker and easier? Well we’re computer programmers – so how about we write a computer program to test our code?

A few things to note about my test code:

  • Each test case has a unique name to make it easy to identify which test failed
  • There are two helper functions that ask if the actual result matches the expected result – either an expected output, or an expected exception that should have been raised
  • The script counts the total number of tests run and the number of tests passed, so it can summarise the result of running this suite of tests
  • My test code isn’t testing the whole program from the outside, like I was doing at the command line. Some code just tests the sqrt function, some just tests the factorial function, and so on. Tests that only test parts of a program are often referred to as unit tests. A ‘unit’ could be an individual function or a method of a class, or a whole class or module, or a group of these things working together to do a specific job. Opinions vary, but what we mostly all agree is that a unit is a discrete part of a program, and not the whole program.

The advantages of testing units instead of whole programs are important:

  1. When a test fails, it’s much easier to pinpoint the source of the problem
  2. Less code is executed in order to check a specific piece of logic works, so unit tests tend to run much faster
  3. By invoking functions directly, there’s usually less code involved in writing a unit test

When I run my test script, if all the tests pass, I get this output:

Running math tests…
Tests run: 16
Passed: 16 , Failed: 0

Phew! All my tests are passing.

This suite of tests ran in a fraction of a second, meaning I can run them as many times as I like, as often as I want. I can change a single line of code, then run my tests to check that change didn’t break anything. If I make a boo-boo, there’s a high chance my tests will alert me straight away. We say that these automated tests give me high assurance that – at any point in time – my code is working.

This ability to re-test our code after just a single change can make a huge difference to how we program. If I break the code, very little has little has changed since the code was last working, so it’s much easier to pinpoint what’s gone wrong and much easier to fix it. If I’ve made 100 changes before I re-test the code, it could be a lot of work to figure out which change(s) caused the problem. I have found, after 25 years of writing unit tests, that I need to spend very little time in my debugger.

If any tests fail, I get this kind of output:

Running math tests…
sqrt of 0.0 failed – expected 1.0 , actual 0
sqrt of -1.0 failed – expected Exception to be raised
Tests run: 16
Passed: 14 , Failed: 2

It helpfully tells me which tests failed, and what the expected and actual results were, to make it easier for me to pin down the cause of the problem. Since I only made a small change to the code since the tests last all passed, it’s easy for me to fix.

Notice that I’ve grouped my tests by the function that they’re testing. There’s a bunch of tests for the sqrt function, a bunch for factorial, and more for floor and for ceiling. As my maths program grows, I’ll add many more tests. Keeping them all in one big module will get unmanageable, so it makes sense to split them out into their own modules. That makes them easier to manage, and also allows us to run just the tests for, say, sqrt, or just the tests for factorial – if we only changed code in those parts of the program – if we want to.

Here I’ve split the tests for sqrt into their own test module, which we call a test fixture. It can be run by itself, or can be invoked as part of the main test suite along with the other test fixtures.

The two helper functions I wrote that check and record the result of each test – assert_equals and assert_raises – could be reused in other suites of tests, since they’re quite generic. What I’ve created here could be the beginnings of a reusable library for writing test scripts in Python.

As my maths program grows, and I add more and more tests, there’ll likely be more helper functions I’ll find useful. But, in computing, before you set out to write a reusable library to help you with something, it’s usually a good idea to check if someone’s already written one.

For a problem as common as automating program tests, you won’t be surprised that such libraries already exist. Python has several, but the most commonly used test automation library actually comes as part of Python’s standard modules – unittest (formerly known as PyUnit.)

Here’s the sqrt tests I write translated into unittest tests.

There’s a lot to unittest, but this test fixture uses just some of its basic features.

To create a test fixture, you just need to declare a class that inherits from unittest.TestCase. Individual tests are methods of your fixture class that start with test_ – so that unittest knows it’s a test – and they accept no parameters, and return no data.

The TestCase class defines many useful helper methods for making assertions about the result of a test. Here, I’ve used assertEqual and assertRaisesRegex.

assertEqual takes an expected result value as the first parameter, followed by the actual result, and compares the two. If they don’t match, the test fails.

assertRaisesRegex is like my own assert_raises, except that it also matches the error message the exception is raised with using regular expressions – so we can check that it was the exact exception we expected.

I don’t need to write a test suite that directly invokes this test fixture’s tests. The unittest test runner will examine the test code, find the test fixtures and test methods, and build the suite out of all the tests it finds. This saves me a fair amount of coding.

I can run the sqrt tests from the command line:

C:\Users\User\PycharmProjects\pymaths\test>python -m unittest sqrt_test.py
…..
———————————————————————-
Ran 5 tests in 0.002s

OK

If any tests fail, unittest will tell me which tests failed and provide helpful diagnostic information.

C:\Users\User\PycharmProjects\pymaths\test>python -m unittest sqrt_test.py
F…F
======================================================================
FAIL: test_sqrt_0 (sqrt_test.SqrtTest)
———————————————————————-
Traceback (most recent call last):
File “C:\Users\User\PycharmProjects\pymaths\test\sqrt_test.py”, line 8, in test_sqrt_0
self.assertEqual(1.0, sqrt(0.0))
AssertionError: 1.0 != 0

======================================================================
FAIL: test_sqrt_minus1 (sqrt_test.SqrtTest)
———————————————————————-
Traceback (most recent call last):
File “C:\Users\User\PycharmProjects\pymaths\test\sqrt_test.py”, line 13, in test_sqrt_minus1
lambda: sqrt(1))
AssertionError: Exception not raised by <lambda>

———————————————————————-
Ran 5 tests in 0.002s

FAILED (failures=2)

I can run all of the tests in my project folder at the command line using unittest‘s test discovery feature.

C:\Users\User\PycharmProjects\pymaths\test>python -m unittest discover -p “*_test.py”
…………….
———————————————————————-
Ran 16 tests in 0.004s

OK

The test runner finds all tests in files matching ‘*_test.py’ in the current folder and runs them for me. Easy as peas!

You may have noticed that my tests are in a subfolder C:\Users\User\PycharmProjects\pymaths\test, too. It’s a very good idea to keep your test code separate from the code they’re testing, so you can easily see which is which.

Note how each test method has a meaningful name that identifies the test case, just like the test names in my hand-rolled unit tests before.

Note also that each test only asks one question – Is the sqrt of four 2? Is the factorial of five 120? And so on. When a test fails, it can only really be for one reason, which makes debugging much, much easier.

When I’m programming, I put in significant effort to make sure that as much of my code is tested by automated unit tests as possible. And, yes, this means I may well end up writing as much unit test code as solution code – if not more.

A common objection inexperienced programmers have to unit testing is that they have to write twice as much code. Surely this takes twice as long? Surely we could add twice as many features if we didn’t waste time writing unit test code?

Well, here’s the funny thing: as our program grows, we tend to find – if we rely on slow manual testing to catch the bugs we’ve introduced – that the proportion of the time we spend fixing bugs grows too. Teams who do testing the hard way often end up spending most of their time bug fixing.

timespent

Because bugs can cost exponentially more to fix the longer they go undiscovered, we find that the effort we put in up-front to write fast tests that will catch them more than pays for itself later on in time saved.

Sure, if the program you’re writing is only ever going to be 100 lines long, extensive unit tests might be a waste (although I would still write a few, as I’ve found even on relatively simple programs some unit testing has saved me time). But most programs are much larger, and therefore unit tests are a good idea most of the time. You wouldn’t fit a smoke alarm in a tiny Lego house, but in a real house that people live in, you might be very grateful of one.

One final thought about unit tests. Consider this code that calculates rental prices of movies based on their IMDb ratings:

This code fetches information about a video, using its IMDb ID, from a web service. Using that information, it decides whether to charge a premium of £1 because the video has a high IMDb rating or knock off £1 because the video has a low IMDb rating.

If we wrote a unittest test for this, when it runs our code will connect to an external web service to fetch information about the video we’re pricing. Connecting to web services is slow in comparison to things that happen entirely in memory. But we want our unit tests to run as fast as possible.

How could we test that prices are calculated correctly without connecting to this external service?

Our pricing logic requires movie information that comes from someone else’s software. Could we fake that somehow, so a rating is available for us to test with?

What if, instead of the price method connecting directly to the web service itself, we were to provide it with an object that fetches video information for it? i.e., what if we made fetching video information somebody else’s problem? The object is passed in as a parameter of Pricer‘s constructor like this.

Because videoInfo is passed as a constructor parameter, Pricer only knows what that object looks like from the outside. It knows it has to have a fetch_video_info method that accepts an IMDb ID as a parameter and returns the title and IMDb rating of that video.

Thanks to Python’s duck typing – if it walks like a duck and quacks like a duck etc – any object that has a matching method should work inside Pricer, including one that doesn’t actually connect to the web service.

We could write a class that provides whatever title and IMDb rating we tell it to, and use that in a unit test for Pricer.

When I run this test, it checks the pricing logic just as thoroughly as if we’d fetched the video information from the real web service. How video titles and ratings are obtained has nothing to do with how rental prices are calculated. We achieved flexibility in our design by cleanly separating those concerns. (Separation of Concerns is fancy software architecture-speak for “make it someone else’s problem”.)

The object that fetches video information is passed in to the Pricer. We call this dependency injection. Pricer depends on VideoInfo, but because the dependency is passed in as a parameter from the outside, the calling code can decide which implementation to use – the stub, or the real thing.

A stub is a kind of what we call a test double. It’s an object that looks like the real thing from the outside, but has a different implementation inside. The job of a stub is to provide test data that would normally come from some external source – like video titles and IMDb ratings.

Test doubles require us to introduce flexibility into our code, so that objects (or functions) can use each other without knowing exactly which implementation they’re using – just as long as they look the same as the real thing from the outside. This not only helps us to write fast-running unit tests, but is good design generally. What if we need to fetch video information from a different web service? Because we provide video information by dependency injection, we can easily swap in a different web service with no need to rewrite Pricer.

This is what we really mean by ‘separation of concerns’ – we can change one part of the program without having to change any of the other parts. This can make changing code much, much easier.

Let’s look at one final example that involves an external dependency. Consider this code that totals the number of copies of a song sold on a digital download service, then sends that total to a web service that compiles song charts at the end of each day.

How can we unit test that song sales are calculated correctly without connecting to the external web service? Again, the trick here it to separate those two concerns – to make sending sales information to the charts somebody else’s problem.

Before we write a unit test for this, notice how this situation is different to the video pricing example. Here, our charts object doesn’t return any data. So we can’t use a stub in this case.

When we want to swap in a test double for an object that’s going to be used, but doesn’t return any data that we need to worry about, we can choose from two other kinds of test double.

A dummy is an object that looks like the real thing from the outside, but does nothing inside.

In this test, we don’t care if the sales total for the song is sent to the charts. It’s all about calculating that total.

But what if we do care if the total is sent to the charts once it’s been calculated? How could we write a test that will fail if charts.send isn’t invoked?

A mock object is a test double that remembers when its methods are called so we can test that call happened. Using the built-in features of the unittest.mock library, we can create a mock charts object and verify that send is invoked with the exact parameter values we want.

In this test, we create an instance of the real Charts class that connects to the web service, but we replace its send method with a MagicMock that records when it’s invoked. We can then assert at the end that when sales_of is executed, charts.send is called with the correct song and sales total.

 

So there you have it. Unit tests – tests that test part of our program, and execute without connecting to any external resources like web services, file systems, databases and so on – are fast-running tests that allow us to test and re-test our program very frequently, ensuring as much as possible that our code’s always working.

As you’ll see in later posts, good, fast-running unit tests are an essential foundation of code craft, enabling many of the techniques we’ll be covering next.