Srikanth Sastry A Techie in Boston

Defining unit tests: two schools of thought

Definitions: What is a unit test?

There are several definitions for unit tests. Roy Osherove defines it as “piece of code that invokes a unit of work in the system and then checks a single assumption about the behavior of that unit of work”; Kent Beck turns the idea of defining unit tests on it’s head by simply stating a list of properties, and any code that satisfies those properties in a “unit test”.

I like Vladimir Khorikov’s definition of a unit test in his book Unit Testing Principles, Practices, and Patterns. According to him, a unit test is a piece of code that (1) verifies a unit of software, (2) in isolation, and (3) quickly. The above definition only balkanizes a unit test into three undefined terms: (1) unit of software, (2) isolation, and (3) quick/fast/speed. Of the three, the third one is the easiest to understand intuitively. Being fast simply means that you should be able to run the test in real time and get the results quickly enough to enable interactive iteration of modifying the unit of software you are changing. However, the other two terms: unit of software, and isolation merit more discussion.

Are you from Detroit, or London?

In fact, there are two schools of thought around how the above two terms should be defined. The ‘original/classic/Detroit’ school, and the ‘mockist/London’ school. Not surprisingly, the school of thought you subscribe to has a significant impact on how you write unit tests. For a more detailed treatment of the two schools of thought, I suggest Martin Folwer’s excellent article on the subject of Mocks and Stubs. Chapter 2 of Khorikov’s book Unit Testing Principles, Practices, and Patterns also has some good insights into it. I have distilled their contents as it pertains to unit test definitions.

The Detroit School

The Classical or Detroit school of thought originated with Kent Beck’s “Test Driven Development”.

Unit of software. According to this school, the unit of software to test is a “behavior”. This behavior could be implemented in a single class, or a collection of classes. The important property here is that the the code that comprises the unit must be (1) internal to the software, (2) connected with each other in the dependency tree, and (3) not shared by another other part of the software.

Thus, a unit of software cannot include external entities such as databases, log servers, file systems etc. They also cannot include external (but local) libraries such as system time and timers. Importantly, it is ok to include a class that depends on another class via a private non-shared dependency.

Isolation. Given the above notion of a “unit” of software, isolation simply means that the test is not dependent on anything outside that unit of software. In practical terms, it means that a unit test needs to replace all external and shared dependencies with test doubles.

The London School

The mockist or London school of thought was popularized by Steve Freeman (twitter) and Nat Pryce in their book “Growing Object- Oriented Software, Guided by Tests”.

Unit of Software. Given the heavy bias Object-Oriented software, unsurprisingly, the unit of software for a unit test is a single class (in some cases, it can be a single method). This is strictly so. ANy other class that this the ‘class under test’ depends on cannot be part of the unit being tested.

Isolation. What follows from the above notion of a “unit” is that everything that is not the class under test must be replaced by test doubles. If you are instantiating another class inside the class under test, then you must replace that instantiation with an injected instance or a factory that can be replaced with a test double in the tests.

Here is a quick summary of the definitions of a unit tests under the two schools.

School Unit Isolation Speed
Detroit Behavior Replace shared and external dependencies with test doubles ‘fast’
London Class Replace all dependencies (internal, external, shared, etc.) with test doubles ‘fast’

What does this mean?

The school of thought you subscribe to can have a significant impact on your software design and testing. There is nothing I can say here that hasn’t already been explained by Martin Fowler in his article “Mocks aren’t stubs”. So, I highly recommend you read it for yourself.

Primary attributes of unit test suites and their tradeoffs

Unit test suites have three primary attributes.

  1. accuracy,
  2. completeness, and
  3. speed.

Accuracy says that if a test fails, then there is a bug. Completeness says that if there is a bug, then a unit test will fail. Speed says that tests will run ‘fast’. These three attributes are in opposition with each other, and you can only satisfy any two of the three attributes!

Before discussing these attributes, it is important to note that they are not properties of test suite at rest, but rather, of the test suite during changes. That is, these attributes are measured only when you are making changes to the code and running the test suite in response to those changes. Also, these attributes are not applicable to a single unit test. Instead, they apply to the test suite as a whole. Furthermore, the quality of your test suite is determined by how well the suite measures up along these attributes.

Attributes’ descriptions

Let’s describe each of these attributes, and then we can see any unit test suite is forced to trade off these attributes.

  1. Accuracy. It is a measure of robustness of the test suite to changes in the production code. If you make a change to the production code without changing your unit tests, and your test suite has a failure, then how likely is it that your changes introduced a bug? Accuracy is a measure of this likelihood. High quality unit tests typically have very good accuracy. If your test suite has poor accuracy, then it suggests that either your tests are brittle, they are actually testing implementation details instead of functionality, or your production code is poorly designed with leaky abstractions. Inaccurate tests reduce your ability to detect regressions. They fail to provide early warning when a diff breaks existing functionality (because the developer cannot be sure that the test failure is a genuine bug, and not an artifact of test brittleness). As a result, developers are more likely to ignore test failure, or modify the tests to make it ‘pass’, and thus introduce bugs in their code.
  2. Completeness. This is a measure of how comprehensive the test suite really is. If you make a change to the production code without changing your unit tests, and you introduce a bug in an existing functionality, then how likely is it that your test suite will fail? Completeness is a measure of this likelihood. A lot of the test coverage metrics try to mimic the completeness of your test suite. However, we have seen how coverage metrics are often a poor proxy for completeness.
  3. Speed. This is simply a measure of how quickly a test suite runs. If tests are hermetic with the right use of test doubles, then each test runs pretty quickly. However, if the tests are of poor quality or the test suite is very large, then they can get pretty slow. It is most noticeable when you are iterating on a feature, and with each small change, you need to run the test suite that seems to take forever to complete. Slow tests can have a disproportionate impact on developer velocity. It will make developer less likely to run tests eagerly, it increases the time between iterations, and it increases the CI/CD latency to where the gap between your code landing and the changes making it to prod can be unreasonably large. If this gets bad enough, it will discourage developers from running tests as needed, and thus allow bugs to creep in.

Attribute constraints and trade offs

There is a tension among attributes, and how these attributes contribute to overall unit test suite quality.

Among accuracy, completeness, and speed, you cannot maximize all three; that is, you cannot have a fast test suite that will fail if and only if there is a bug. Maximizing any two will minimize the third.

  • A prefect test suite with high accuracy and completeness will inevitably be huge, and thus very slow.
  • A fast test suite with high accuracy will often only test only the most common user journeys, and thus be incomplete.
  • A test suite with very high coverage is often made ‘fast’ through extensive use of test doubles and ends up coupling tests with the implementation details, which makes the tests brittle, and therefore inaccurate.

What’s the right trade off?

Image not found: /images/balance-scale.jpg

A natural follow up to the trade offs among accuracy, completeness, and speed is “What is the right trade off?”. It helps to notice that, empirically, we are always making this trade off and naturally settling on some point in the trade-off surface. What is this natural resting point for these trade offs? Let’s examine a few things to help us answer the above question.

  1. From experience, we know that bugs in software are inevitable, and we have learned to deal with it. While bug-free code might be the ideal, no one reasonably expects bug-free software, and we accept some level of incorrectness in our implementations.
  2. Flaky/brittle tests can have very significant negative consequences. Such tests are inherently untrustworthy, and therefore, serve no useful purpose. In the end, we tend to ignore such tests, and for all practical purposes they just don’t exist in our test suite.
  3. While extremely slow tests are an issue, we have figured out ways to improve test speeds through infrastructure developments. For instance,our CI/CD systems can run multiple tests in the test suite in parallel, and thus we are delayed only by the slowests tests in the test suite; we have figured out how to prune the affected tests in a diff by being smart about the build and test targets affected by the changes, and thus, we need not run the entire test suite for a small change; the machines that execute tests have just gotten faster, thus alleviating some of the latency issues, etc.

From the above three observations, we can reasonably conclude that we cannot sacrifice accuracy. Accurate tests are the bedrock of trustworthy (and therefore, useful) test suites. Once we maximize accuracy, that leaves us with completeness and speed. Here there is a sliding scale between completeness and speed, and we could potentially rest anywhere on this scale.

So, is it ok to rest anywhere on the tradeoff spectrum between completeness and accuracy? Not quite. If you dial completeness all the way up and ignore speed, then you end up with a test suite that no one wants to run, and therefore, not useful at all. On the other hand, if you ignore completeness in favor of speed, then you are likely going to see a lot of regressions in your software and completely undermine consumer confidence in your product/service. In effect, the quality of your test suite is determined by the lowest score among the three attributes. Therefore, it is important to rest between completeness and speed, depending on the tolerance to errors and the minimum developer velocity you can sustain. For instance, if you are developing software for medical imaging, then your tolerance to errors is very very low, and so you should be favoring completeness at the expense of speed (and this is evident in how long it takes to make changes to software in the area of medical sciences). On the other hand, if you are building a web service that can be rolled back to a safe state quickly and with minimal external damage, then you probably want to favor speed over completeness (but only to a point; remember that your test quality is now determined by the completeness, or the lack thereof).

Thus, in conclusion, always maximize accuracy, and trade off between completeness and speed, depending on your tolerance of failures in production.

The big WHY about unit tests

Why unit test? When you ask “why do we write need unit tests?”, you will get several answers including

These seems like a collection of very good reasons, but it seems inelegant to state that the common phenomenon of unit testing has such disparate causes. There must be a ‘higher’ cause for writing unit tests. I argue that this cause is “maintainability”.

Maintainability

Maintainable software Here is a potentially provocative statement; “The final cause of unit tests is software maintainability”. To put it differently, if your software was immutable and could not be altered in any way, then that software does not need any unit tests.

Given that almost all software is mutable, unit tests exist to ensure that we can mutate the software to improve upon its utility in a sustainable manner. All the aforementioned answers to the question “why do we write unit tests” are ultimately subsumed by the cause of maintainability.

  • Unit tests help you find bugs in your code, thus allowing safe mutations that add functionality.
  • Unit tests protect against regression, especially when refactoring, thus allowing safe mutation of the software in preparation for functional changes.
  • Unit tests act as de facto documentation. It allows developers who change the code to communicate across time and space on how best to use existing code for mutating other code.
  • Unit tests help improve software design. It some code/class is difficult to unit test, then the software design is poor. So, you iterate until unit testing becomes easier.
  • Unit test help improve the usability of your API. Unit tests are the first customers of your API. If unit tests using your API are inelegant, then you iterate towards more usuable APIs. A more usable API is often a more used API, and thus, aids software evolution.

Interestingly, looking at maintainability as the primary motivation for unit tests allows us to look at some aspects of unit tests differently.

Looking at unit tests differently

Unit tests incur a maintenance cost.

If it code incurs a maintenance cost, and unit tests help reduce that cost, then you can naturally ask the following; since unit tests are also code, do they not incur a maintenance cost?

Obviously the answer to the question above is an unequivocal “yes!”. Thus, unit tests are only useful if the cost of maintaining them exceeds the savings they provide as a buttress against production code. This observation has significant implications for how to design and write unit tests. For instance, unit tests must be simple straight line code that is human readable, even at the expense of performance and redundancy. See the post on DRY unit tests for a more detailed treatment on this topic.

Unit tests can have diminishing returns.

If unit tests incur a maintenance cost, then their utility is the difference between the maintainability they provide and the cost they incur. Since software is a living/evolving entity, both this utility changes over time. Consequently, if you are not careful with your tests, then could become the proverbial Albatross across your neck. Consequently, it is important to tend to your unit test suite and pay attention when the utility of a test starts to diminish. Importantly, refactor your tests to ensure that you do not hit the point of diminishing, or even negative returns on your unit test.

Unit tests should be cognitively simple.

An almost necessary way to reduce the maintenance cost of a unit tests is to make it very simple to read and understand. It helps with maintenance in two ways. First, it makes it easy to understand the intent of the test, and the coverage that the test provides. Second, it makes it easy to modify the test (if needed) without having to worry about an unintended consequences such modifications might have; a degenerate case is that of tests that have hit the point of diminishing returns; more simple a test is, the easier it is to refactor and/or delete it. See the post on DRY unit tests for mote details.

A bad unit test is worse than no unit test.

If unit test incur a maintenance cost, then a bad unit test has all the costs associated with unit tests and none of the benefits. It is a net loss. Your code base is much better off without that unit test. In fact, a bad unit test can have an even higher cost if it sends developers on a wild goose chase looking for bugs when such unit tests fail. So, unless a unit test is of high quality, don’t bother with it. Just delete it.

A flaky unit test is the worst.

This is a corollary of the previous observation, but deserves some explanation. Flaky tests have the side effect of undermining the trust in the entire test suite. If a test is flaky, then developers are more likely to ignore red builds, because ‘that flaky test is the culprit, and so the failure can be ignored’. However, inevitably, some legitimate failure does occur. But, at this point, developers have been conditioned to ignore build/test failures. Consequently, a buggy commit makes it’s way to prod and causes a regression, which would never have happened if you didn’t have that flaky test.

Unit test the brains and not the nerves

Note: This is inspired from the book “Unit Testing: Principles, Practices, and Patterns” by Vladimir Khorikov.

brain

Unit tests are typically your first line of defense against bugs. So, it is tempting to add unit tests for all functionality that your code supports. But that begs the following question. “Why do we need integration and end-to-end tests?”

Categorizing production code

To better understand the primary motivations for unit tests vs. integration (and end-to-end) tests, it is helpful to categorize your production code into four categories along two dimensions: thinking, and talking.

  • Thinking code. There are parts of your codebase that are focused mostly on the business logic and the complex algorithmic computations. I refer to these as the thinking code.
  • Talking code. There are parts of your codebase that are focused mostly on communicating with other dependencies such as key-value stores, log servers, databases, etc. I refer to these as talking code.

Each part of your codebase can be either thinking, talking, or both. Based on that observation, we can categorize each unit of code into one of four categories (in keeping with the biology theme).

Thinking Talking Category
Yes No Brain
No Yes Nerves
Yes Yes Ganglia
No No Synapse

Testing for each category

Each category needs a distinct approach to testing.

Brains → Unit Tests

Brains are one of the most complex parts of your codebase that often requires the most technical skill and domain knowledge to author, read, and maintain. Consequently, they are best tested with unit tests. Furthermore, they also have very few direct external dependencies, and as a result require limited use of test doubles.

Nerves → Integration Tests

Nerves have very little logic, but focus mostly on external communication with dependencies. As a result, there isn’t much to unit test here, except perhaps that the protocol translation from the outside world into the brains is happening correctly. By their very nature, the correctness of nerves cannot be tested hermetically, and therefore, are not at all well suited to be unit tested. Nerves should really be tested in your integration tests, where you hook your production code with real test instances of external dependencies.

Ganglia → Refactor

Ganglia are units of code that have both complex business logic and have significant external dependencies. It is very difficult to unit test them thoroughly because such unit tests require heavy use of test doubles which can make the tests less readable and more brittle. You could try to test ganglia through integration tests, but it becomes very challenging to test low probability code paths, which is usually the source of difficult-to-debug issues. Therefore, my suggestion is to refactor such code into smaller pieces of code each of which are either a brain or a nerve, and tests each of those as described above.

See Chapter 7 of “Unit Testing: Principles, Practices, and Patterns” for suggestions on how to refactor your code to make it more testable.

Synapse → Ignore

Synapses are trivial pieces of code (often utilities) that have neither complex business logic, nor do they have any external dependencies. My recommendation is to simply not focus on testing them. Adding unit tests for them simply increases the cost of testing and maintenance without really providing any benefit. They are often simple enough to be verified visually, and they exist only to serve either the brains or the nerves, and so will be indirectly tested via unit tests or integration tests.

Mocks, Stubs, and how to use them

Photo by Polina Kovaleva from Pexels Photo by Polina Kovaleva from Pexels

Test doubles are the standard mechanism to isolate your System-Under-Test (SUT) from external dependencies in unit tests. Unsurprisingly, it is important to use the right test double for each use case for a maintainable and robust test suite. However, I have seen a lot of misuse of test doubles, and suffered through the consequences of it enough number of times to want to write down some (admittedly subjective) guidelines on when an how to use test doubles.

Briefly, test doubles are replacements for a production object used for testing. Depending on who you ask, there are multiple different categorizations of test doubles; but two categories that appears in all of these categorizations are mocks and stubs. So I will focus on on these two. I have seen mocks and stubs often conflated together. The problem is made worse by all the test-double frameworks’ terminology: they are often referred to as ‘mocking’ frameworks, and the test doubles they generate are all called ‘mocks’.

Mocks

woman wearing an emoji mask

Image by Andii Samperio from Pixabay

Mocks are objects that are used to verify ‘outbound’ interactions of the SUT with external dependencies. This is different from the notion of ‘mocks’ that ‘mocking frameworks’ generate. Those ‘mocks’ are more correctly the superclass of test doubles. Examples where mocks are useful include the SUT logging to a log server, or sending an email, or filing a task/ticket in response to a given input/user journey. This becomes clearer with an illustration.

from unittest.mock import MagicMock

class TestSUT(unittest.TestCase):
    def test_log_success(self) -> None:
        mock_log_server = MagicMock(spec=LogServerClass, autospec=True)
        mock_log_server.log = MagicMock(return_value=True)
        sut = SUT(log_server=mock_log_server)
        
        sut.test_method(input="foo")
        
        # This is ok!
        mock_log_server.log.assert_called_once_with(message="foo")

Note that in the above illustration, we verify that the message is sent to the the log server exactly once. This is an important part of the SUT’s specification. It the SUT were to start logging multiple messages/records for the request, then it could pollute the logs or even overwhelm the log server. Here, even though logging appears to be a side effect of test_method, this side effect is almost certainly part of SUT’s specification, and needs to be verified correctly. Mocks play a central role in such verifications.

Stubs

Robot imitating family

Unlike mocks, stubs verify ‘inbound’ interactions from external dependencies to the SUT. Stubs are useful when replacing external dependencies that ‘send’ data to the SUT in order for the SUT to satisfy its specification. Examples include key value stores, databases, event listeners, etc. The important note here is that the outbound interaction to the stub should not be asserted in the tests; that’s an anti pattern (it results in over-specification)! Here is an illustration.

from unittest.mock import MagicMock

class TestSUT(unittest.TestCase):
    def test_email_retrieval(self) -> None:
        stub_key_value_store = MagicMock(spec=KeyValueStoreClass, autospec=True)
        stub_key_value_store.get = MagicMock(return_value="user@special_domain.com")
        sut = SUT(key_value_store=stub_key_value_store)
        
        email_domain = sut.get_user_email_domin(username="foo")
        
        # This is ok!
        self.assertEquals("special_domain.com", email_domain)
        
        # THIS IS NOT OK!
        stub_key_value_store.get.assert_called_once_with(username="foo")

In the above illustration, we create a stub for the key value store (note that this is a stub even thought the object is a ‘mock’ class) that returns "user@special_domain.com" as a canned response to a get call. The test verifies that the SUT’s get_user_email_domain is called, it returns the correct email domain. What is important here is that we should not assert that there was a get call to the stub. Why? Because the call to the key value store is an implementation detail. Imagine a refactor that causes a previous value to be cached locally. If the unit tests were to assert on calls to the stubs, then such refactors would result in unit test failures, which undermines the utility, maintainability, and robustness of unit tests.

Fakes, instead of stubs

A small detour here. When using a stub, always consider if you can use a fake instead. There are multiple definitions of a fake, and the one I am referring to is the following. A fake is a special kind of stub that implements the same API as the production dependency, but the implementation is much more lightweight. This implementation may be correct only within the context of the unit tests where it is used. Let’s reuse the previous illustration of using a stub, and replace the stub with a fake. Recall that we stubbed out the get method of KeyValueStoreClass to return the canned value "user@special_domain.com". Instead, we can implement a fake KeyValueStoreClass that uses a Dict as follows.

from unittest.mock import MagicMock
from typing import Dict

# We assume a simplistic API for KeyValueStoreClass with just
# update and get methods.
class KeyValueStoreClass:
    def update(self, k: str, v: str) -> None:
        ...
    def get(self, k: str) -> str:
        ...

class FakeKeyValueStoreClassImpl:
    def __init__(self):
        self.kvs: Dict[str, str] = {}
    
    def update(self, k:str, v:str) -> None:
        self.kvs[k] = v

    def get(self, k: str) -> str:
        return self.kvs[k]


class TestSUT(unittest.TestCase):
    def test_email_retrieval(self) -> None:
        FakeKeyValueStoreClass = MagicMock(return_value=FakeKeyValueStoreClassImpl())
        fake_key_value_store = FakeKeyValueStoreClass()
        fake_key_value_store.update(k="foo", v="user@special_domain.com")
        sut = SUT(key_value_store=fake_key_value_store)
        
        email_domain = sut.get_user_email_domin(username="foo")
        
        self.assertEquals("special_domain.com", email_domain)

The advantage of using a fake is that the test becomes much more robust and is more resistant to refactoring. It also becomes more extensible. When using a stub, if we wanted to test a different user journey, we would need to inject a new return value for KeyValueStoreClass.get method. We would in one of two ways: (1) resetting the mock, which is a bad anti-pattern, or (2) initialize the stub to return a preconfigured list of canned values, in order, which makes the test more brittle (consider what happens if the SUT chooses to call get for the same key twice vs. calls get for different keys once each). Using a fake sidesteps these issues.

But my dependency has both inbound and outbound interactions!

Photograph of man double exposure

Despite all your efforts to separate out the test cases that need stubs and the ones that need mocks, you will inevitably find yourself needing to test a scenario in which you need to verify both inbound and outbound interactions with an external dependency. How do we address that?

First, if you need to assert on the outbound interaction of the same call that is stubbed, then you really don’t need that test. Just use a stub/fake and do not assert on the outbound interaction. Next, the only legitimate case of needing to verify both inbound and outbound interactions is if they are on distinct APIs of the same dependency. For example, the SUT could be reading from a file, and you need to test that (1) the contents of the file were read correctly, and (2) the file object was closed after the file was read. In this case, it is perfectly ok to stub the file read method while mocking the close method. Here is an illustration.

from unittest.mock import MagicMock, patch

class TestSUT(unittest.TestCase):
    def test_file_read(self) -> None:
        file_mock_stub_combo = MagicMock()
        # Using this as a stub by injecting canned contents of the file
        file_mock_stub_combo.__iter__.return_value = ["1234"]
        
        # Next, we treat the file open call as a mock.
        with patch("builtins.open",
                   return_value=file_mock_stub_combo, 
                   create=True
                  ) as mock_file:
            sut = SUT(filename="foo")
            file_contents = sut.get_contents()
            
            # Assertions on call to file open.
            # Treating the 'open' call as a mock.
            mock_file.assert_called_once_with("foo")
        
            # Assertion on the contents returned.
            # Treating the `read` as a stub.
            self.assertEquals("1234", file_contents)
        
            # Assertion on the outbound interaction of file close.
            # Treating the 'close' call as a mock.
            file_mock_stub_combo.close.assert_called_once()