Srikanth Sastry A Techie in Boston

In unit tests, I favor Detroit over London

Recall the two schools of thought around unit test: Detroit, and London. Briefly, the Detroit school considers a ‘unit’ of software to be tested as a ‘behavior’ that consists of one or more classes, and unit tests replace only shared and/or external dependencies with test doubles. In contrast, the London school consider a ‘unit’ to be a single class, and replaces all dependencies with test doubles.

School Unit Isolation Speed
Detroit Behavior Replace shared and external dependencies with test doubles ‘fast’
London Class Replace all dependencies (internal, external, shared, etc.) with test doubles ‘fast’

See this note for a more detailed discussion on the two schools.

Each school have it’s proponents and each school of thought has it’s advantages. I, personally, prefer the Detroit school over the London school. I have noticed that following the Detroit school has made my test suite more accurate and complete.

Improved Accuracy (when refactoring)

In the post on attributes of a unit test suite, I defined accuracy as the measure of how likely it is that a test failure denotes a bug in your diff. I have noticed that unit test suites that follow the Detroit school tended to have high accuracy when your codebase has a lot of classes that are public de jour, but private de facto.

Codebases I have worked in typically have hundreds of classes, but only a handful of those classes are actually referenced by external classes/services. Most of the classes are part of a private API that is internal to the service. Let’s take a concrete illustration. Say, there is a class Util that is used only by classes Feature1 and Feature2 within the codebase, and has no other callers; in fact, Util exists only to help classes Feature1 and Feature2 implement their respective user journies. Here although Util is a class with public methods, in reality Util really represents the common implementation details for Feature1 and Feature2.

In London

According to the London school, all unit tests for Feature1 and Fearure2 should be replacing Util with a test double. Thus, tests for Feature1 and Feature2 look as follows.

Now, say we want to do some refactoring that spans Feature1, Feature2, and Util. Since Util is really has a private API with Feature1 and Feature2, we can change the API of Util in concert with Feature1 and Feature2 in a single diff. Now, since the tests for Feature1 and Feature2 use test doubles for Util, and we have changed Util’s API, we need to change the test doubles’ implementation to match the new API. After making these changes, say, the tests for Util pass, but the tests for Feature1 fail.

Now, does the test failure denote a bug in our refactoring, or does it denote an error in how we modified the tests? This is not easy to determine except by stepping through the tests manually. Thus, the test suite does not have high accuracy.

In Detroit

In contrast, according to the Detroit school, the unit tests for Feature1 and Feature2 can use Util as such (without test doubles). The tests for Feature1 and Feature2 look as follows.

If we do the same refactoring across Feature1, Feature2, and Util classes, note that we do not need to make any changes to the tests for Feature1 and Feature2. If the tests fail, then we have a very high signal that the refactoring has a bug in it; this makes for a high accuracy test suite!

Furthermore, since Util exists only to serve Feature1 and Feature2, you can argue that Util doesn’t even need any unit tests of it’s own; the tests for Feature1 and Feature2 cover the spread!

Improved Completeness (around regressions)

In the post on attributes of a unit test suite, I defined completeness as the measure of how likely a bug introduced by your diff is caught by your test suite. I have seen unit tests following the Detroit school catching bugs/regressions more easily, especially when the bugs are introduced by API contract violations.

It easier to see this with an example. Say, there is a class Outer that uses a class Inner, and Inner is an internal non-shared dependency. Let’s say that the class Outer depends on a specific contract, (let’s call it) alpha, that Inner’s API satisfies, for correctness. Recall that we practically trade off between the speed of a test suite and it’s completeness, let us posit that the incompleteness here is that we do not have a test for Inner satisfying contract alpha.

In London

Following the London school, the tests for Outer replace the instance of Inner with a test double, and since the test double is a replacement for Inner, it also satisfies contract alpha. See the illustration below for clarity.

Image not found: /images/London-School-Completeness-Before.png

Now, let’s assume that we have a diff that ‘refactors’ Inner, but in that process, it introduces a bug that violates contract alpha. Since we have assumed an incompleteness in our test suite around contract alpha, the unit test for Inner does not catch this regression. Also, since the tests for Outer use a test double for Inner (which satisfies contract alpha), those tests do not detect this regression either.

In Detroit

If we were to follow the Detroit school instead, then the unit tests for Outer instantiate and use Inner when testing the correctness of Outer, as shown below. Note that the test incompletness w.r.t. contract alpha still exists.

Here, like before, assume that we have a diff that ‘refactors’ Inner and breaks contract alpha. This time around, although the test suite for Inner does not catch the regression, the test suite for Outer will catch the regression. Why? Because the correctness of Outer depends on Inner satisfying contract alpha. When that contract is violated Outer fails to satisfy correctness, and is therefore, it’s unit tests fail/

In effect, even though we did not have an explicit test for contract alpha, the unit tests written according to the Detroit school tend to have better completeness than the ones written following the London school.

Defining unit tests: two schools of thought

Definitions: What is a unit test?

There are several definitions for unit tests. Roy Osherove defines it as “piece of code that invokes a unit of work in the system and then checks a single assumption about the behavior of that unit of work”; Kent Beck turns the idea of defining unit tests on it’s head by simply stating a list of properties, and any code that satisfies those properties in a “unit test”.

I like Vladimir Khorikov’s definition of a unit test in his book Unit Testing Principles, Practices, and Patterns. According to him, a unit test is a piece of code that (1) verifies a unit of software, (2) in isolation, and (3) quickly. The above definition only balkanizes a unit test into three undefined terms: (1) unit of software, (2) isolation, and (3) quick/fast/speed. Of the three, the third one is the easiest to understand intuitively. Being fast simply means that you should be able to run the test in real time and get the results quickly enough to enable interactive iteration of modifying the unit of software you are changing. However, the other two terms: unit of software, and isolation merit more discussion.

Are you from Detroit, or London?

In fact, there are two schools of thought around how the above two terms should be defined. The ‘original/classic/Detroit’ school, and the ‘mockist/London’ school. Not surprisingly, the school of thought you subscribe to has a significant impact on how you write unit tests. For a more detailed treatment of the two schools of thought, I suggest Martin Folwer’s excellent article on the subject of Mocks and Stubs. Chapter 2 of Khorikov’s book Unit Testing Principles, Practices, and Patterns also has some good insights into it. I have distilled their contents as it pertains to unit test definitions.

The Detroit School

The Classical or Detroit school of thought originated with Kent Beck’s “Test Driven Development”.

Unit of software. According to this school, the unit of software to test is a “behavior”. This behavior could be implemented in a single class, or a collection of classes. The important property here is that the the code that comprises the unit must be (1) internal to the software, (2) connected with each other in the dependency tree, and (3) not shared by another other part of the software.

Thus, a unit of software cannot include external entities such as databases, log servers, file systems etc. They also cannot include external (but local) libraries such as system time and timers. Importantly, it is ok to include a class that depends on another class via a private non-shared dependency.

Isolation. Given the above notion of a “unit” of software, isolation simply means that the test is not dependent on anything outside that unit of software. In practical terms, it means that a unit test needs to replace all external and shared dependencies with test doubles.

The London School

The mockist or London school of thought was popularized by Steve Freeman (twitter) and Nat Pryce in their book “Growing Object- Oriented Software, Guided by Tests”.

Unit of Software. Given the heavy bias Object-Oriented software, unsurprisingly, the unit of software for a unit test is a single class (in some cases, it can be a single method). This is strictly so. ANy other class that this the ‘class under test’ depends on cannot be part of the unit being tested.

Isolation. What follows from the above notion of a “unit” is that everything that is not the class under test must be replaced by test doubles. If you are instantiating another class inside the class under test, then you must replace that instantiation with an injected instance or a factory that can be replaced with a test double in the tests.

Here is a quick summary of the definitions of a unit tests under the two schools.

School Unit Isolation Speed
Detroit Behavior Replace shared and external dependencies with test doubles ‘fast’
London Class Replace all dependencies (internal, external, shared, etc.) with test doubles ‘fast’

What does this mean?

The school of thought you subscribe to can have a significant impact on your software design and testing. There is nothing I can say here that hasn’t already been explained by Martin Fowler in his article “Mocks aren’t stubs”. So, I highly recommend you read it for yourself.

Primary attributes of unit test suites and their tradeoffs

Unit test suites have three primary attributes.

  1. accuracy,
  2. completeness, and
  3. speed.

Accuracy says that if a test fails, then there is a bug. Completeness says that if there is a bug, then a unit test will fail. Speed says that tests will run ‘fast’. These three attributes are in opposition with each other, and you can only satisfy any two of the three attributes!

Before discussing these attributes, it is important to note that they are not properties of test suite at rest, but rather, of the test suite during changes. That is, these attributes are measured only when you are making changes to the code and running the test suite in response to those changes. Also, these attributes are not applicable to a single unit test. Instead, they apply to the test suite as a whole. Furthermore, the quality of your test suite is determined by how well the suite measures up along these attributes.

Attributes’ descriptions

Let’s describe each of these attributes, and then we can see any unit test suite is forced to trade off these attributes.

  1. Accuracy. It is a measure of robustness of the test suite to changes in the production code. If you make a change to the production code without changing your unit tests, and your test suite has a failure, then how likely is it that your changes introduced a bug? Accuracy is a measure of this likelihood. High quality unit tests typically have very good accuracy. If your test suite has poor accuracy, then it suggests that either your tests are brittle, they are actually testing implementation details instead of functionality, or your production code is poorly designed with leaky abstractions. Inaccurate tests reduce your ability to detect regressions. They fail to provide early warning when a diff breaks existing functionality (because the developer cannot be sure that the test failure is a genuine bug, and not an artifact of test brittleness). As a result, developers are more likely to ignore test failure, or modify the tests to make it ‘pass’, and thus introduce bugs in their code.
  2. Completeness. This is a measure of how comprehensive the test suite really is. If you make a change to the production code without changing your unit tests, and you introduce a bug in an existing functionality, then how likely is it that your test suite will fail? Completeness is a measure of this likelihood. A lot of the test coverage metrics try to mimic the completeness of your test suite. However, we have seen how coverage metrics are often a poor proxy for completeness.
  3. Speed. This is simply a measure of how quickly a test suite runs. If tests are hermetic with the right use of test doubles, then each test runs pretty quickly. However, if the tests are of poor quality or the test suite is very large, then they can get pretty slow. It is most noticeable when you are iterating on a feature, and with each small change, you need to run the test suite that seems to take forever to complete. Slow tests can have a disproportionate impact on developer velocity. It will make developer less likely to run tests eagerly, it increases the time between iterations, and it increases the CI/CD latency to where the gap between your code landing and the changes making it to prod can be unreasonably large. If this gets bad enough, it will discourage developers from running tests as needed, and thus allow bugs to creep in.

Attribute constraints and trade offs

There is a tension among attributes, and how these attributes contribute to overall unit test suite quality.

Among accuracy, completeness, and speed, you cannot maximize all three; that is, you cannot have a fast test suite that will fail if and only if there is a bug. Maximizing any two will minimize the third.

  • A prefect test suite with high accuracy and completeness will inevitably be huge, and thus very slow.
  • A fast test suite with high accuracy will often only test only the most common user journeys, and thus be incomplete.
  • A test suite with very high coverage is often made ‘fast’ through extensive use of test doubles and ends up coupling tests with the implementation details, which makes the tests brittle, and therefore inaccurate.

What’s the right trade off?

Image not found: /images/balance-scale.jpg

A natural follow up to the trade offs among accuracy, completeness, and speed is “What is the right trade off?”. It helps to notice that, empirically, we are always making this trade off and naturally settling on some point in the trade-off surface. What is this natural resting point for these trade offs? Let’s examine a few things to help us answer the above question.

  1. From experience, we know that bugs in software are inevitable, and we have learned to deal with it. While bug-free code might be the ideal, no one reasonably expects bug-free software, and we accept some level of incorrectness in our implementations.
  2. Flaky/brittle tests can have very significant negative consequences. Such tests are inherently untrustworthy, and therefore, serve no useful purpose. In the end, we tend to ignore such tests, and for all practical purposes they just don’t exist in our test suite.
  3. While extremely slow tests are an issue, we have figured out ways to improve test speeds through infrastructure developments. For instance,our CI/CD systems can run multiple tests in the test suite in parallel, and thus we are delayed only by the slowests tests in the test suite; we have figured out how to prune the affected tests in a diff by being smart about the build and test targets affected by the changes, and thus, we need not run the entire test suite for a small change; the machines that execute tests have just gotten faster, thus alleviating some of the latency issues, etc.

From the above three observations, we can reasonably conclude that we cannot sacrifice accuracy. Accurate tests are the bedrock of trustworthy (and therefore, useful) test suites. Once we maximize accuracy, that leaves us with completeness and speed. Here there is a sliding scale between completeness and speed, and we could potentially rest anywhere on this scale.

So, is it ok to rest anywhere on the tradeoff spectrum between completeness and accuracy? Not quite. If you dial completeness all the way up and ignore speed, then you end up with a test suite that no one wants to run, and therefore, not useful at all. On the other hand, if you ignore completeness in favor of speed, then you are likely going to see a lot of regressions in your software and completely undermine consumer confidence in your product/service. In effect, the quality of your test suite is determined by the lowest score among the three attributes. Therefore, it is important to rest between completeness and speed, depending on the tolerance to errors and the minimum developer velocity you can sustain. For instance, if you are developing software for medical imaging, then your tolerance to errors is very very low, and so you should be favoring completeness at the expense of speed (and this is evident in how long it takes to make changes to software in the area of medical sciences). On the other hand, if you are building a web service that can be rolled back to a safe state quickly and with minimal external damage, then you probably want to favor speed over completeness (but only to a point; remember that your test quality is now determined by the completeness, or the lack thereof).

Thus, in conclusion, always maximize accuracy, and trade off between completeness and speed, depending on your tolerance of failures in production.

The big WHY about unit tests

Why unit test? When you ask “why do we write need unit tests?”, you will get several answers including

These seems like a collection of very good reasons, but it seems inelegant to state that the common phenomenon of unit testing has such disparate causes. There must be a ‘higher’ cause for writing unit tests. I argue that this cause is “maintainability”.

Maintainability

Maintainable software Here is a potentially provocative statement; “The final cause of unit tests is software maintainability”. To put it differently, if your software was immutable and could not be altered in any way, then that software does not need any unit tests.

Given that almost all software is mutable, unit tests exist to ensure that we can mutate the software to improve upon its utility in a sustainable manner. All the aforementioned answers to the question “why do we write unit tests” are ultimately subsumed by the cause of maintainability.

  • Unit tests help you find bugs in your code, thus allowing safe mutations that add functionality.
  • Unit tests protect against regression, especially when refactoring, thus allowing safe mutation of the software in preparation for functional changes.
  • Unit tests act as de facto documentation. It allows developers who change the code to communicate across time and space on how best to use existing code for mutating other code.
  • Unit tests help improve software design. It some code/class is difficult to unit test, then the software design is poor. So, you iterate until unit testing becomes easier.
  • Unit test help improve the usability of your API. Unit tests are the first customers of your API. If unit tests using your API are inelegant, then you iterate towards more usuable APIs. A more usable API is often a more used API, and thus, aids software evolution.

Interestingly, looking at maintainability as the primary motivation for unit tests allows us to look at some aspects of unit tests differently.

Looking at unit tests differently

Unit tests incur a maintenance cost.

If it code incurs a maintenance cost, and unit tests help reduce that cost, then you can naturally ask the following; since unit tests are also code, do they not incur a maintenance cost?

Obviously the answer to the question above is an unequivocal “yes!”. Thus, unit tests are only useful if the cost of maintaining them exceeds the savings they provide as a buttress against production code. This observation has significant implications for how to design and write unit tests. For instance, unit tests must be simple straight line code that is human readable, even at the expense of performance and redundancy. See the post on DRY unit tests for a more detailed treatment on this topic.

Unit tests can have diminishing returns.

If unit tests incur a maintenance cost, then their utility is the difference between the maintainability they provide and the cost they incur. Since software is a living/evolving entity, both this utility changes over time. Consequently, if you are not careful with your tests, then could become the proverbial Albatross across your neck. Consequently, it is important to tend to your unit test suite and pay attention when the utility of a test starts to diminish. Importantly, refactor your tests to ensure that you do not hit the point of diminishing, or even negative returns on your unit test.

Unit tests should be cognitively simple.

An almost necessary way to reduce the maintenance cost of a unit tests is to make it very simple to read and understand. It helps with maintenance in two ways. First, it makes it easy to understand the intent of the test, and the coverage that the test provides. Second, it makes it easy to modify the test (if needed) without having to worry about an unintended consequences such modifications might have; a degenerate case is that of tests that have hit the point of diminishing returns; more simple a test is, the easier it is to refactor and/or delete it. See the post on DRY unit tests for mote details.

A bad unit test is worse than no unit test.

If unit test incur a maintenance cost, then a bad unit test has all the costs associated with unit tests and none of the benefits. It is a net loss. Your code base is much better off without that unit test. In fact, a bad unit test can have an even higher cost if it sends developers on a wild goose chase looking for bugs when such unit tests fail. So, unless a unit test is of high quality, don’t bother with it. Just delete it.

A flaky unit test is the worst.

This is a corollary of the previous observation, but deserves some explanation. Flaky tests have the side effect of undermining the trust in the entire test suite. If a test is flaky, then developers are more likely to ignore red builds, because ‘that flaky test is the culprit, and so the failure can be ignored’. However, inevitably, some legitimate failure does occur. But, at this point, developers have been conditioned to ignore build/test failures. Consequently, a buggy commit makes it’s way to prod and causes a regression, which would never have happened if you didn’t have that flaky test.

Unit test the brains and not the nerves

Note: This is inspired from the book “Unit Testing: Principles, Practices, and Patterns” by Vladimir Khorikov.

brain

Unit tests are typically your first line of defense against bugs. So, it is tempting to add unit tests for all functionality that your code supports. But that begs the following question. “Why do we need integration and end-to-end tests?”

Categorizing production code

To better understand the primary motivations for unit tests vs. integration (and end-to-end) tests, it is helpful to categorize your production code into four categories along two dimensions: thinking, and talking.

  • Thinking code. There are parts of your codebase that are focused mostly on the business logic and the complex algorithmic computations. I refer to these as the thinking code.
  • Talking code. There are parts of your codebase that are focused mostly on communicating with other dependencies such as key-value stores, log servers, databases, etc. I refer to these as talking code.

Each part of your codebase can be either thinking, talking, or both. Based on that observation, we can categorize each unit of code into one of four categories (in keeping with the biology theme).

Thinking Talking Category
Yes No Brain
No Yes Nerves
Yes Yes Ganglia
No No Synapse

Testing for each category

Each category needs a distinct approach to testing.

Brains → Unit Tests

Brains are one of the most complex parts of your codebase that often requires the most technical skill and domain knowledge to author, read, and maintain. Consequently, they are best tested with unit tests. Furthermore, they also have very few direct external dependencies, and as a result require limited use of test doubles.

Nerves → Integration Tests

Nerves have very little logic, but focus mostly on external communication with dependencies. As a result, there isn’t much to unit test here, except perhaps that the protocol translation from the outside world into the brains is happening correctly. By their very nature, the correctness of nerves cannot be tested hermetically, and therefore, are not at all well suited to be unit tested. Nerves should really be tested in your integration tests, where you hook your production code with real test instances of external dependencies.

Ganglia → Refactor

Ganglia are units of code that have both complex business logic and have significant external dependencies. It is very difficult to unit test them thoroughly because such unit tests require heavy use of test doubles which can make the tests less readable and more brittle. You could try to test ganglia through integration tests, but it becomes very challenging to test low probability code paths, which is usually the source of difficult-to-debug issues. Therefore, my suggestion is to refactor such code into smaller pieces of code each of which are either a brain or a nerve, and tests each of those as described above.

See Chapter 7 of “Unit Testing: Principles, Practices, and Patterns” for suggestions on how to refactor your code to make it more testable.

Synapse → Ignore

Synapses are trivial pieces of code (often utilities) that have neither complex business logic, nor do they have any external dependencies. My recommendation is to simply not focus on testing them. Adding unit tests for them simply increases the cost of testing and maintenance without really providing any benefit. They are often simple enough to be verified visually, and they exist only to serve either the brains or the nerves, and so will be indirectly tested via unit tests or integration tests.