Srikanth Sastry A Techie in Boston

Primary attributes of unit test suites and their tradeoffs

Unit test suites have three primary attributes.

  1. accuracy,
  2. completeness, and
  3. speed.

Accuracy says that if a test fails, then there is a bug. Completeness says that if there is a bug, then a unit test will fail. Speed says that tests will run ‘fast’. These three attributes are in opposition with each other, and you can only satisfy any two of the three attributes!

Before discussing these attributes, it is important to note that they are not properties of test suite at rest, but rather, of the test suite during changes. That is, these attributes are measured only when you are making changes to the code and running the test suite in response to those changes. Also, these attributes are not applicable to a single unit test. Instead, they apply to the test suite as a whole. Furthermore, the quality of your test suite is determined by how well the suite measures up along these attributes.

Attributes’ descriptions

Let’s describe each of these attributes, and then we can see any unit test suite is forced to trade off these attributes.

  1. Accuracy. It is a measure of robustness of the test suite to changes in the production code. If you make a change to the production code without changing your unit tests, and your test suite has a failure, then how likely is it that your changes introduced a bug? Accuracy is a measure of this likelihood. High quality unit tests typically have very good accuracy. If your test suite has poor accuracy, then it suggests that either your tests are brittle, they are actually testing implementation details instead of functionality, or your production code is poorly designed with leaky abstractions. Inaccurate tests reduce your ability to detect regressions. They fail to provide early warning when a diff breaks existing functionality (because the developer cannot be sure that the test failure is a genuine bug, and not an artifact of test brittleness). As a result, developers are more likely to ignore test failure, or modify the tests to make it ‘pass’, and thus introduce bugs in their code.
  2. Completeness. This is a measure of how comprehensive the test suite really is. If you make a change to the production code without changing your unit tests, and you introduce a bug in an existing functionality, then how likely is it that your test suite will fail? Completeness is a measure of this likelihood. A lot of the test coverage metrics try to mimic the completeness of your test suite. However, we have seen how coverage metrics are often a poor proxy for completeness.
  3. Speed. This is simply a measure of how quickly a test suite runs. If tests are hermetic with the right use of test doubles, then each test runs pretty quickly. However, if the tests are of poor quality or the test suite is very large, then they can get pretty slow. It is most noticeable when you are iterating on a feature, and with each small change, you need to run the test suite that seems to take forever to complete. Slow tests can have a disproportionate impact on developer velocity. It will make developer less likely to run tests eagerly, it increases the time between iterations, and it increases the CI/CD latency to where the gap between your code landing and the changes making it to prod can be unreasonably large. If this gets bad enough, it will discourage developers from running tests as needed, and thus allow bugs to creep in.

Attribute constraints and trade offs

There is a tension among attributes, and how these attributes contribute to overall unit test suite quality.

Among accuracy, completeness, and speed, you cannot maximize all three; that is, you cannot have a fast test suite that will fail if and only if there is a bug. Maximizing any two will minimize the third.

  • A prefect test suite with high accuracy and completeness will inevitably be huge, and thus very slow.
  • A fast test suite with high accuracy will often only test only the most common user journeys, and thus be incomplete.
  • A test suite with very high coverage is often made ‘fast’ through extensive use of test doubles and ends up coupling tests with the implementation details, which makes the tests brittle, and therefore inaccurate.

What’s the right trade off?

Image not found: /images/balance-scale.jpg

A natural follow up to the trade offs among accuracy, completeness, and speed is “What is the right trade off?”. It helps to notice that, empirically, we are always making this trade off and naturally settling on some point in the trade-off surface. What is this natural resting point for these trade offs? Let’s examine a few things to help us answer the above question.

  1. From experience, we know that bugs in software are inevitable, and we have learned to deal with it. While bug-free code might be the ideal, no one reasonably expects bug-free software, and we accept some level of incorrectness in our implementations.
  2. Flaky/brittle tests can have very significant negative consequences. Such tests are inherently untrustworthy, and therefore, serve no useful purpose. In the end, we tend to ignore such tests, and for all practical purposes they just don’t exist in our test suite.
  3. While extremely slow tests are an issue, we have figured out ways to improve test speeds through infrastructure developments. For instance,our CI/CD systems can run multiple tests in the test suite in parallel, and thus we are delayed only by the slowests tests in the test suite; we have figured out how to prune the affected tests in a diff by being smart about the build and test targets affected by the changes, and thus, we need not run the entire test suite for a small change; the machines that execute tests have just gotten faster, thus alleviating some of the latency issues, etc.

From the above three observations, we can reasonably conclude that we cannot sacrifice accuracy. Accurate tests are the bedrock of trustworthy (and therefore, useful) test suites. Once we maximize accuracy, that leaves us with completeness and speed. Here there is a sliding scale between completeness and speed, and we could potentially rest anywhere on this scale.

So, is it ok to rest anywhere on the tradeoff spectrum between completeness and accuracy? Not quite. If you dial completeness all the way up and ignore speed, then you end up with a test suite that no one wants to run, and therefore, not useful at all. On the other hand, if you ignore completeness in favor of speed, then you are likely going to see a lot of regressions in your software and completely undermine consumer confidence in your product/service. In effect, the quality of your test suite is determined by the lowest score among the three attributes. Therefore, it is important to rest between completeness and speed, depending on the tolerance to errors and the minimum developer velocity you can sustain. For instance, if you are developing software for medical imaging, then your tolerance to errors is very very low, and so you should be favoring completeness at the expense of speed (and this is evident in how long it takes to make changes to software in the area of medical sciences). On the other hand, if you are building a web service that can be rolled back to a safe state quickly and with minimal external damage, then you probably want to favor speed over completeness (but only to a point; remember that your test quality is now determined by the completeness, or the lack thereof).

Thus, in conclusion, always maximize accuracy, and trade off between completeness and speed, depending on your tolerance of failures in production.

The big WHY about unit tests

Why unit test? When you ask “why do we write need unit tests?”, you will get several answers including

These seems like a collection of very good reasons, but it seems inelegant to state that the common phenomenon of unit testing has such disparate causes. There must be a ‘higher’ cause for writing unit tests. I argue that this cause is “maintainability”.

Maintainability

Maintainable software Here is a potentially provocative statement; “The final cause of unit tests is software maintainability”. To put it differently, if your software was immutable and could not be altered in any way, then that software does not need any unit tests.

Given that almost all software is mutable, unit tests exist to ensure that we can mutate the software to improve upon its utility in a sustainable manner. All the aforementioned answers to the question “why do we write unit tests” are ultimately subsumed by the cause of maintainability.

  • Unit tests help you find bugs in your code, thus allowing safe mutations that add functionality.
  • Unit tests protect against regression, especially when refactoring, thus allowing safe mutation of the software in preparation for functional changes.
  • Unit tests act as de facto documentation. It allows developers who change the code to communicate across time and space on how best to use existing code for mutating other code.
  • Unit tests help improve software design. It some code/class is difficult to unit test, then the software design is poor. So, you iterate until unit testing becomes easier.
  • Unit test help improve the usability of your API. Unit tests are the first customers of your API. If unit tests using your API are inelegant, then you iterate towards more usuable APIs. A more usable API is often a more used API, and thus, aids software evolution.

Interestingly, looking at maintainability as the primary motivation for unit tests allows us to look at some aspects of unit tests differently.

Looking at unit tests differently

Unit tests incur a maintenance cost.

If it code incurs a maintenance cost, and unit tests help reduce that cost, then you can naturally ask the following; since unit tests are also code, do they not incur a maintenance cost?

Obviously the answer to the question above is an unequivocal “yes!”. Thus, unit tests are only useful if the cost of maintaining them DOES NOT EXCEED the savings they provide as a buttress against production code. This observation has significant implications for how to design and write unit tests. For instance, unit tests must be simple straight line code that is human readable, even at the expense of performance and redundancy. See the post on DRY unit tests for a more detailed treatment on this topic.

Unit tests can have diminishing returns.

If unit tests incur a maintenance cost, then their utility is the difference between the maintainability they provide and the cost they incur. Since software is a living/evolving entity, both this utility changes over time. Consequently, if you are not careful with your tests, then could become the proverbial Albatross across your neck. Consequently, it is important to tend to your unit test suite and pay attention when the utility of a test starts to diminish. Importantly, refactor your tests to ensure that you do not hit the point of diminishing, or even negative returns on your unit test.

Unit tests should be cognitively simple.

An almost necessary way to reduce the maintenance cost of a unit tests is to make it very simple to read and understand. It helps with maintenance in two ways. First, it makes it easy to understand the intent of the test, and the coverage that the test provides. Second, it makes it easy to modify the test (if needed) without having to worry about an unintended consequences such modifications might have; a degenerate case is that of tests that have hit the point of diminishing returns; more simple a test is, the easier it is to refactor and/or delete it. See the post on DRY unit tests for mote details.

A bad unit test is worse than no unit test.

If unit test incur a maintenance cost, then a bad unit test has all the costs associated with unit tests and none of the benefits. It is a net loss. Your code base is much better off without that unit test. In fact, a bad unit test can have an even higher cost if it sends developers on a wild goose chase looking for bugs when such unit tests fail. So, unless a unit test is of high quality, don’t bother with it. Just delete it.

A flaky unit test is the worst.

This is a corollary of the previous observation, but deserves some explanation. Flaky tests have the side effect of undermining the trust in the entire test suite. If a test is flaky, then developers are more likely to ignore red builds, because ‘that flaky test is the culprit, and so the failure can be ignored’. However, inevitably, some legitimate failure does occur. But, at this point, developers have been conditioned to ignore build/test failures. Consequently, a buggy commit makes it’s way to prod and causes a regression, which would never have happened if you didn’t have that flaky test.

Unit test the brains and not the nerves

Note: This is inspired from the book “Unit Testing: Principles, Practices, and Patterns” by Vladimir Khorikov.

brain

Unit tests are typically your first line of defense against bugs. So, it is tempting to add unit tests for all functionality that your code supports. But that begs the following question. “Why do we need integration and end-to-end tests?”

Categorizing production code

To better understand the primary motivations for unit tests vs. integration (and end-to-end) tests, it is helpful to categorize your production code into four categories along two dimensions: thinking, and talking.

  • Thinking code. There are parts of your codebase that are focused mostly on the business logic and the complex algorithmic computations. I refer to these as the thinking code.
  • Talking code. There are parts of your codebase that are focused mostly on communicating with other dependencies such as key-value stores, log servers, databases, etc. I refer to these as talking code.

Each part of your codebase can be either thinking, talking, or both. Based on that observation, we can categorize each unit of code into one of four categories (in keeping with the biology theme).

Thinking Talking Category
Yes No Brain
No Yes Nerves
Yes Yes Ganglia
No No Synapse

Testing for each category

Each category needs a distinct approach to testing.

Brains → Unit Tests

Brains are one of the most complex parts of your codebase that often requires the most technical skill and domain knowledge to author, read, and maintain. Consequently, they are best tested with unit tests. Furthermore, they also have very few direct external dependencies, and as a result require limited use of test doubles.

Nerves → Integration Tests

Nerves have very little logic, but focus mostly on external communication with dependencies. As a result, there isn’t much to unit test here, except perhaps that the protocol translation from the outside world into the brains is happening correctly. By their very nature, the correctness of nerves cannot be tested hermetically, and therefore, are not at all well suited to be unit tested. Nerves should really be tested in your integration tests, where you hook your production code with real test instances of external dependencies.

Ganglia → Refactor

Ganglia are units of code that have both complex business logic and have significant external dependencies. It is very difficult to unit test them thoroughly because such unit tests require heavy use of test doubles which can make the tests less readable and more brittle. You could try to test ganglia through integration tests, but it becomes very challenging to test low probability code paths, which is usually the source of difficult-to-debug issues. Therefore, my suggestion is to refactor such code into smaller pieces of code each of which are either a brain or a nerve, and tests each of those as described above.

See Chapter 7 of “Unit Testing: Principles, Practices, and Patterns” for suggestions on how to refactor your code to make it more testable.

Synapse → Ignore

Synapses are trivial pieces of code (often utilities) that have neither complex business logic, nor do they have any external dependencies. My recommendation is to simply not focus on testing them. Adding unit tests for them simply increases the cost of testing and maintenance without really providing any benefit. They are often simple enough to be verified visually, and they exist only to serve either the brains or the nerves, and so will be indirectly tested via unit tests or integration tests.

Mocks, Stubs, and how to use them

Photo by Polina Kovaleva from Pexels Photo by Polina Kovaleva from Pexels

Test doubles are the standard mechanism to isolate your System-Under-Test (SUT) from external dependencies in unit tests. Unsurprisingly, it is important to use the right test double for each use case for a maintainable and robust test suite. However, I have seen a lot of misuse of test doubles, and suffered through the consequences of it enough number of times to want to write down some (admittedly subjective) guidelines on when an how to use test doubles.

Briefly, test doubles are replacements for a production object used for testing. Depending on who you ask, there are multiple different categorizations of test doubles; but two categories that appears in all of these categorizations are mocks and stubs. So I will focus on on these two. I have seen mocks and stubs often conflated together. The problem is made worse by all the test-double frameworks’ terminology: they are often referred to as ‘mocking’ frameworks, and the test doubles they generate are all called ‘mocks’.

Mocks

woman wearing an emoji mask

Image by Andii Samperio from Pixabay

Mocks are objects that are used to verify ‘outbound’ interactions of the SUT with external dependencies. This is different from the notion of ‘mocks’ that ‘mocking frameworks’ generate. Those ‘mocks’ are more correctly the superclass of test doubles. Examples where mocks are useful include the SUT logging to a log server, or sending an email, or filing a task/ticket in response to a given input/user journey. This becomes clearer with an illustration.

from unittest.mock import MagicMock

class TestSUT(unittest.TestCase):
    def test_log_success(self) -> None:
        mock_log_server = MagicMock(spec=LogServerClass, autospec=True)
        mock_log_server.log = MagicMock(return_value=True)
        sut = SUT(log_server=mock_log_server)
        
        sut.test_method(input="foo")
        
        # This is ok!
        mock_log_server.log.assert_called_once_with(message="foo")

Note that in the above illustration, we verify that the message is sent to the the log server exactly once. This is an important part of the SUT’s specification. It the SUT were to start logging multiple messages/records for the request, then it could pollute the logs or even overwhelm the log server. Here, even though logging appears to be a side effect of test_method, this side effect is almost certainly part of SUT’s specification, and needs to be verified correctly. Mocks play a central role in such verifications.

Stubs

Robot imitating family

Unlike mocks, stubs verify ‘inbound’ interactions from external dependencies to the SUT. Stubs are useful when replacing external dependencies that ‘send’ data to the SUT in order for the SUT to satisfy its specification. Examples include key value stores, databases, event listeners, etc. The important note here is that the outbound interaction to the stub should not be asserted in the tests; that’s an anti pattern (it results in over-specification)! Here is an illustration.

from unittest.mock import MagicMock

class TestSUT(unittest.TestCase):
    def test_email_retrieval(self) -> None:
        stub_key_value_store = MagicMock(spec=KeyValueStoreClass, autospec=True)
        stub_key_value_store.get = MagicMock(return_value="user@special_domain.com")
        sut = SUT(key_value_store=stub_key_value_store)
        
        email_domain = sut.get_user_email_domin(username="foo")
        
        # This is ok!
        self.assertEquals("special_domain.com", email_domain)
        
        # THIS IS NOT OK!
        stub_key_value_store.get.assert_called_once_with(username="foo")

In the above illustration, we create a stub for the key value store (note that this is a stub even thought the object is a ‘mock’ class) that returns "user@special_domain.com" as a canned response to a get call. The test verifies that the SUT’s get_user_email_domain is called, it returns the correct email domain. What is important here is that we should not assert that there was a get call to the stub. Why? Because the call to the key value store is an implementation detail. Imagine a refactor that causes a previous value to be cached locally. If the unit tests were to assert on calls to the stubs, then such refactors would result in unit test failures, which undermines the utility, maintainability, and robustness of unit tests.

Fakes, instead of stubs

A small detour here. When using a stub, always consider if you can use a fake instead. There are multiple definitions of a fake, and the one I am referring to is the following. A fake is a special kind of stub that implements the same API as the production dependency, but the implementation is much more lightweight. This implementation may be correct only within the context of the unit tests where it is used. Let’s reuse the previous illustration of using a stub, and replace the stub with a fake. Recall that we stubbed out the get method of KeyValueStoreClass to return the canned value "user@special_domain.com". Instead, we can implement a fake KeyValueStoreClass that uses a Dict as follows.

from unittest.mock import MagicMock
from typing import Dict

# We assume a simplistic API for KeyValueStoreClass with just
# update and get methods.
class KeyValueStoreClass:
    def update(self, k: str, v: str) -> None:
        ...
    def get(self, k: str) -> str:
        ...

class FakeKeyValueStoreClassImpl:
    def __init__(self):
        self.kvs: Dict[str, str] = {}
    
    def update(self, k:str, v:str) -> None:
        self.kvs[k] = v

    def get(self, k: str) -> str:
        return self.kvs[k]


class TestSUT(unittest.TestCase):
    def test_email_retrieval(self) -> None:
        FakeKeyValueStoreClass = MagicMock(return_value=FakeKeyValueStoreClassImpl())
        fake_key_value_store = FakeKeyValueStoreClass()
        fake_key_value_store.update(k="foo", v="user@special_domain.com")
        sut = SUT(key_value_store=fake_key_value_store)
        
        email_domain = sut.get_user_email_domin(username="foo")
        
        self.assertEquals("special_domain.com", email_domain)

The advantage of using a fake is that the test becomes much more robust and is more resistant to refactoring. It also becomes more extensible. When using a stub, if we wanted to test a different user journey, we would need to inject a new return value for KeyValueStoreClass.get method. We would in one of two ways: (1) resetting the mock, which is a bad anti-pattern, or (2) initialize the stub to return a preconfigured list of canned values, in order, which makes the test more brittle (consider what happens if the SUT chooses to call get for the same key twice vs. calls get for different keys once each). Using a fake sidesteps these issues.

But my dependency has both inbound and outbound interactions!

Photograph of man double exposure

Despite all your efforts to separate out the test cases that need stubs and the ones that need mocks, you will inevitably find yourself needing to test a scenario in which you need to verify both inbound and outbound interactions with an external dependency. How do we address that?

First, if you need to assert on the outbound interaction of the same call that is stubbed, then you really don’t need that test. Just use a stub/fake and do not assert on the outbound interaction. Next, the only legitimate case of needing to verify both inbound and outbound interactions is if they are on distinct APIs of the same dependency. For example, the SUT could be reading from a file, and you need to test that (1) the contents of the file were read correctly, and (2) the file object was closed after the file was read. In this case, it is perfectly ok to stub the file read method while mocking the close method. Here is an illustration.

from unittest.mock import MagicMock, patch

class TestSUT(unittest.TestCase):
    def test_file_read(self) -> None:
        file_mock_stub_combo = MagicMock()
        # Using this as a stub by injecting canned contents of the file
        file_mock_stub_combo.__iter__.return_value = ["1234"]
        
        # Next, we treat the file open call as a mock.
        with patch("builtins.open",
                   return_value=file_mock_stub_combo, 
                   create=True
                  ) as mock_file:
            sut = SUT(filename="foo")
            file_contents = sut.get_contents()
            
            # Assertions on call to file open.
            # Treating the 'open' call as a mock.
            mock_file.assert_called_once_with("foo")
        
            # Assertion on the contents returned.
            # Treating the `read` as a stub.
            self.assertEquals("1234", file_contents)
        
            # Assertion on the outbound interaction of file close.
            # Treating the 'close' call as a mock.
            file_mock_stub_combo.close.assert_called_once()

DRY unit tests are bad... mkay

DRY

“Don’t Repeat Yourself” (DRY) is arguably one of the most important principles in software engineering. It is considered a truism among many. A consequence of such dogmatic allegiance to DRYness is that we see a lot of DRY unit tests; this is where the utility of the DRY principle breaks downs and starts causing more problems that it solves.

TL;DR. Simplicity should be a core property of unit tests. This is motivated, both by arguments in this post against DRY unit tests, and by software maintainability as the primary motivation for unit tests. Unit tests should be as simple as reasonable. It should be easy to ready, understand, and modify (it should be easy to modify any single test in isolation). It is perfectly acceptable for this simplicity to come at the expense of code-reuse, performance, and efficiency.

So, what’s wrong with DRY Unit Tests?

Presumably, we are all convinced of the benefits of DRYing your code (interested readers can go the Wikipedia page). It does have some downsides, and so you have the notion of the DAMP/MOIST/AHA principle. Interestingly, the reasons why DRYness doesn’t always work out in production code are different from why it is a bad idea to write DRY unit tests. I see five ways in which (a) test code is different from production code and (b) it contributes to why test code should not be DRY.

  1. Tests (conceptually) do not yield well to common abstractions.
  2. Test code’s readability always takes precedence over performance, but not so for production code.
  3. Production code enjoys the safety net of test code, but test code has no such backstop.
  4. DRY production code can speed up developer velocity, but DRY test code hinders developer velocity.
  5. Complex changes to production code can be reviewed faster with pure green/pure red test code changes, but complex changes to test code cannot be reviewed easily.

Let’s explore each one in more detail.

DRYness and Abstraction

Abstract In practice, DRYing out code results in building abstractions that represents a collection of semantically identical operations into common procedure. If done prematurely, then DRYing can result in poorer software. In fact, premature DRYing is the motivation for advocating the AHA principle. While that argument against DRYness works well in production code, it does not apply for test code.

Test code is often a collection of procedures, and each procedure steps the System-Under-Test (SUT) through a distinct user journey and compares the SUT’s behavior against pre-defined expectations. Thus, almost by design, test code does not yield itself semantically similar abstractions. The mistake that I have seen software engineers make is to mistake syntactic similarly for semantic similarity. Just because the tests’ ‘Arrange’ sections look similar does not mean that they are doing semantically the same thing in both places; in fact, they are almost certainly doing semantically different things because otherwise, the tests are duplicates of each other!

By DRYing out such test code, you are effectively forcing abstractions where none exist, and that leads to the same issues that DRYness leads to in production code (See [1], [2], [3], [4] for examples).

Readability

Abstract Most code is read more often than is written/edited. Unsurprisingly, it is important to favor code readability, even in production code. However, in production code, if this comes at a steep cost in performance and/or efficiency, then it is common (and prudent) to favor performance over readability. Test code, on the other hand, is less subject to the (potential) tension between readability and performance. Yes, unit tests need to be ‘fast’, but given the minuscule amount of data/inputs that unit tests process, speed is not an issue with hermetic unit tests. The upshot here is that there is no practical drawback to keeping test code readable.

DRYing out test code directly affects its readability. Why? Remember that we read unit tests to understand the expected behavior of the system-under-test (SUT), and we do so in the context of a user journey. So, a readable unit test needs to explain the user journey it is executing, the role played by the SUT in realizing that user journey, and what a successful user journey looks like. This is reflected in the Arrange-Act-Assert structure of the unit test. When you DRY out your unit tests, you are also obfuscating at least one of those sections in your unit test. This is better illustrated with an example.

A common DRYing in unit tests I have seen looks as follows:

class TestInput(typing.NamedTuple):
    param1: str
    param2: typing.Optional[int]
    ...

class TestOutput(typing.NamedTuple):
    status: SomeEnum
    return_value: typing.Optional[int]
    exception: typing.Optional[Exception]
    ...

class TestCase(typing.NamedTuple):
    input: TestInput
    expected_output: TestOutput
        
class TestSequence(unittest.TestCase):
    
    @parameterized.expand([
        [test_input1, expected_output1],
        [test_input2, expected_output2],
        ...
    ])
    def test_somethings(self, test_input: TestInput, expected_output: TestOutput) -> None:
        self._run_test(test_input, expected_output)
        
    def _run_test(self, test_input: TestInput, expected_output: TestOutput) -> None:
        sut = SUT(...)
        prepare_sut_for_tests(sut, test_input)
        output = sut.do_something(test_input.param2)
        test_output = make_test_output(output, sut)
        self.assertEquals(expected_output, test_output)

On the face of it, it looks like DRY organized code. But for someone reading this test to understand what SUT does, it is very challenging. They have no idea why the set of test_inputs were chosen, what is the material difference among the inputs, what user journeys do each of those test cases represent, what are the preconditions that need to be satisfied for running sut.do_something(), why is the expected output the specified output, and so on.

Instead, consider a non-DRY alternative.

class TestSequence(unittest.TestCase):
    
    def test_foo_input_under_bar_condition(self):
        """
        This test verifies that when condition bar is true, then calling `do_something()`
        with input foo results in sigma behavior
        """
        sut = SUT()
        ensure_precondition_bar(sut, param1=bar1, param2=bar2)
        output = sut.do_something(foo)
        self.assertEquals(output, sigma)

This code tests one user journey and is human readable at a glance by something who does not have in-depth understanding of SUT. We can similarly define all the other test cases with code duplication and greater readability, with negligible negative impact.

Who watches the watchmen?

colink., CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0&gt [Original Image posted to Flickr by colink. License: Creative Commons ShareAlike]

Production code has the luxury of being fine tuned, optimized, DRY’d out, and subject to all sorts of gymnastics mostly because production code is defended by tests and test code. For instance, to improve performance, if you replaced a copy with a reference, and accidentally mutated that reference inside a function, you have a unit test that can catch such unintended mutations. However, test code has no such backstop. If you introduce a bug in your test code, then only a careful inspection by a human will catch it. The upshot is the following: the less simple/obvious the test code is, the more likely it is that a bug in that test code will go undetected, at least for a while. If a buggy test is passing, then it means your production code has a bug that is undetected. Conversely, if a test fails then, it might just denote a bug in the test code. If this happens, you lose confidence in your test suite, and nothing good can come from that.

DRY code inevitably asks the reader to jump from one function to another and requires the reader to keep the previous context when navigating these functions. In other words, it increases the cognitive burden on the reader compared to straight line duplicated code. That makes it difficult to verify the correctness of the test code quickly and easily. So, when you DRY out your test code, you are increasing the odds that bugs creep into your test suite, and developers lose confidence in the tests, which in turn significantly reduces the utility if your tests.

Developer Velocity

Woman developer

Recall from the previous section that while tests might have duplicate code, they do not actually represent semantic abstractions replicated in multiple places. If you do mistake them for common semantic abstractions and DRY them out, then eventually there will an addition to the production code whose test breaks this semantic abstraction. At this point, the developer who is adding this feature will run into issues when trying to modify the existing test code to add the new test case. For instance, consider a class that is hermetic, stateless, and does not throw exceptions. It would not be surprising to organize DRY tests for this class that assumes that exceptions are never thrown. Now there is a new feature added to this class that requires an external dependency, and now can throw exceptions. Added a new test case into the DRY’d out unit test suite will not be easy or straightforward. The sunk cost fallacy associated with the existing test framework makes it more likely that the developer will try to force-fit the new test case(s) into existing framework. As a result:

  1. It slows the developer down because they now have to grok the existing test framework, think of ways in which to extend it for a use case that it was not designed for, and make those changes without breaking existing tests.
  2. Thanks to poor abstractions, you have now incurred more technical debt in your test code.

Code Reviews

Developers doing code reviews

DRY’d out tests not only impede developer velocity, they also make it less easy to review code/diffs/pull requests. This is a second order effect of DRYing out your test code. Let’s revisit the example where we are adding a new feature to an existing piece of code, and this is a pure addition in behavior (not modification to existing behavior). If the tests were not DRY’d out, then adding tests for this new feature would involve just adding new test cases, and thus, just green lines in the generated diff. In contrast, recall from the previous subsection that adding tests with DRY test code is likely going to involve modifying existing code and then adding new test cases. In the former case, reviewing the tests is much easier, and as a result, reviewing that the new feature is behaving correctly is also that much easier. Reviewing the diff in the latter case is cognitively more taxing because not only does the reviewer need to verify that the new feature is implemented correctly, they also have to verify that the changes to the test code is also correct, and is not introducing new holes for bugs to escape testing. This can significantly slow down code reviews in two ways (1) it requires more time to review the code, and (2) because it requires longer to review the code, the reviewers are more likely to delay even starting the code review.