Srikanth Sastry A Techie in Boston

Do not index on test coverage metrics

Coverage Chart

We live in a data driven world, and as the saying goes “[…] What is not measured, cannot be improved. […]”

What is not defined cannot be measured. What is not measured, cannot be improved. What is not improved, is always degraded.

– William Thomson Kelvin

The temptation, therefore, is to measure everything. Even the quality of your unit tests, and there where the trouble usually begins. For an detailed explanation of why indexing on the test coverage metrics is a bad idea, I highly recommend Jason Rudolph’s collection of posts on this topic here. To drive home the point more explicitly (and motivate you to actually go read Jason’s posts), here are some illustrative explanations.

There are many coverage metrics including function coverage, statement coverage, line coverage, branch coverage, condition coverage, etc. Here, we will only look at line coverage and branch coverage, because those are the most popular.

Line coverage

Let’s start with line coverage, which is the number of lines of code executed by tests vs. the total number of lines of code. The most common target for the line coverage metric is 80%. That is, 80% of your code should be executed by your tests. While that might seem like a good idea, indexing on this metric can actually take you away from good quality test coverage! How? Consider the following (contrived example).

def has_three_digits(value: int) -> bool:
    strlen = len(str(value))
    if strlen == 3:
        return True
    return False

class TestHasThreeDigits(unittest.TestCase):
    def test_has_three_digits_234(self) -> None:
        output_value = has_three_digits(234)
        self.assertTrue(output_value)

Clearly TestHasThreeDigits inadequate as a test suite for testing has_three_digits. Tests only the True case, and misses the False cases completely! The line coverage of the test suite is 3/4 = 75%. You could say that the test coverage is less than 80%, and therefore not adequate. Here, it appears that the line coverage metric does indeed point of inadequate testing. However, this confidence in the metric is severely misplaced! Consider the following refactoring of has_three_digits

def has_three_digits(value: int) -> bool:
    value_as_str = str(value)
    strlen = len(value_as_str)
    if strlen == 3:
        return True
    return False

Now, TestHasThreeDigits line coverage magically improves to 4/5 = 80%, and as per the 80% target, the metrics seems to suggest adequate coverage! In fact, you can play this game some more and refactor has_three_digits to

def has_three_digits(value: int) -> bool:
    value_as_str = str(value)
    strlen = len(value_as_str)
    return (strlen == 3)

Now, with the same test suite TestHasThreeDigits now has 100% coverage! Recall that semantically the test still do the same thing; they still test only the True case, and ignore the False case completely.

Branch coverage

An easy retort to the above example is that line coverage is not a sufficiently nuanced metric, and what you really need is branch coverage, which is the number of branches executed by the tests vs. the number of branches in the code.

Looking at the branch coverage of TestHasThreeDigits, we can see that it has a 50% branch coverage, which is inadequate. Well, here’s an easy way to improve that.

class TestHasThreeDigits(unittest.TestCase):
    def test_has_three_digits_true(self) -> None:
        true_output_value = has_three_digits(234)

    def test_has_three_digits_false(self) -> None:
        false_output_value = has_three_digits(23)

See, now the test suite has 100% branch coverage! However, not that it has no assertions at all. So, despite having 100% line and branch coverage, this test suite is completely useless! (This is a form of incidental coverage anti-pattern.)

Here is a more nuanced example:

class HasThreeDigits:
    def __init__(self) -> None:
        self.counter = 0
        
    def test(x: int) -> bool:
        self.counter += 1
        return (len(str(x)) == 3)
    
class TestHasThreeDigits(unittest.TestCase):
    def test_has_three_digits_234(self) -> None:
        output_value = has_three_digits(234)
        self.assertTrue(output_value)

    def test_has_three_digits_23(self) -> None:
        output_value = has_three_digits(23)
        self.assertFalse(output_value)

The code coverage is 100%, branch coverage is 100%. But self.counter is never verified!

Wait, there’s more!

Coverage metrics only consider the code the are under your project, and ignore all external libraries. However, your code is correct only if you are satisfying the preconditions of your external library calls, and test coverage metrics do not capture any of that. Here is an illustration with an contrived example.

from external.lib import convert_to_num

def has_three_digits(value: str) -> bool:
    v = convert_to_num(value)
    return v is not None and v > -1000 and v < 1000

The above code is expected return True if value is and integer with 3 digits. Here is test suite.

class TestHasThreeDigits(unittest.TestCase):
    def test_has_three_digits_234(self) -> None:
        output_value = has_three_digits('234')
        self.assertTrue(output_value)

    def test_has_three_digits_23(self) -> None:
        output_value = has_three_digits('23')
        self.assertFalse(output_value)

    def test_has_three_digits_minus_23(self) -> None:
        output_value = has_three_digits('-23')
        self.assertFalse(output_value)

    def test_has_three_digits_minus_234(self) -> None:
        output_value = has_three_digits('-234')
        self.assertTrue(output_value)

    def test_has_three_digits_ten_times_ten(self) -> None:
        output_value = has_three_digits('10*10')
        self.assertTrue(output_value)

The test suite looks reasonable. You line and branch coverage is a 100%, and so nothing in the metrics suggestg anything is amiss. Except that we have said nothing about how convert_to_num is implemented. It is easy to imagine some preconditions for the input to convert_to_num; for instance, it throws a ValueError exception if you pass in an input of the form 3/0. Now, you can see how the test suite is not adequate! (has_three_digits('10/0') will throw an exception). But your test coverage metrics will never be able to help here.

Beware of using patch.object to test your Python code

Software Testing

Liskov substitution principle states that a class and its subclass must be interchangeable without breaking the program. Unfortunately, Python’s patch.object breaks this principle in a big way. In fact, this can make your tests untrustworthy and become a maintenance headache with failures every time you extended your base class. Here is a contrived, but concrete example.

Say, you decide to build a special class called ImmutableList with a factory that looks as follows:

from typing import List, Sequence
class ImmutableList:
  @staticmethod
  def create_list(input: List[int]) -> "ImmutableList":
    return ImmutableList(input)

  def __init__(self, input: List[int]) -> None:
    self._inner_list = tuple(input)

  def get_inner_list(self) -> Sequence[int]:
    self._inner_list

Next, your system under test is a class SUT that uses an instance of ImmutableList as an injected dependency.

class SUT:
  def __init__(self, wrapper: ImmutableList) -> None:
    self.wrapper = wrapper

  def get_wrapper_length(self) -> int:
    return len(self.wrapper.get_inner_list())

Now, when testing SUT, say, we patch the get_inner_list() method with patch.object:

from unittest import mock

with mock.patch.object(ImmutableList, 'get_inner_list', return_value=[1, 2, 3]) as mock_method:
  sut = SUT(ImmutableList.create_list([]))
  assert sut.get_wrapper_length() == 3, "FAILURE"
  print("SUCCESS")

When you run this test, it does print SUCCESS, and therefore, works as intended.

Now, let’s say that we found a special case of ImmutableList that corresponds to a zero length list, and we implement it as follows:

class ZeroLengthImmutableList(ImmutableList):
  def __init__(self):
    super().__init__([])
  
  def get_inner_list(self) -> Sequence[int]:
    return tuple()

Next, we modify the factory method to return this ZeroLengthImmutableList, when the input is an empty list, as follows:

  @staticmethod
  def create_list(input: List[int]) -> "ImmutableList":
    if len(input) == 0:
      return ZeroLengthImmutableList()
    return ImmutableList(input)

Thus, the two classes look as follows:

class ImmutableList:
  @staticmethod
  def create_list(input: List[int]) -> "ImmutableList":
    if len(input) == 0:
      return ZeroLengthImmutableList()
    return ImmutableList(input)

  def __init__(self, input: List[int]) -> None:
    self._inner_list = input

  def get_inner_list(self) -> Sequence[int]:
    self._inner_list

class ZeroLengthImmutableList(ImmutableList):
  def __init__(self):
    super().__init__([])
  
  def get_inner_list(self) -> Sequence[int]:
    return tuple()

Now, let’s go back to our test, which is still

from unittest import mock

with mock.patch.object(ImmutableList, 'get_inner_list', return_value=[1, 2, 3]) as mock_method:
  sut = SUT(ImmutableList.create_list([]))
  assert sut.get_wrapper_length() == 3, "FAILURE"
  print("SUCCESS")

Since sut.wrapper is still an ImmutableList, by the Liskov Substitution Principle, mock.patch.object(ImmutableList, 'get_inner_list', return_value=[1, 2, 3]) should still return [1, 2, 3] when sut.get_wrapper_length(). However, this does not happen! The above test fails with

AssertionError                            Traceback (most recent call last)

<ipython-input-21-1c1b12b89ff3> in <module>()
     23 with mock.patch.object(ImmutableList, 'get_inner_list', return_value=[1, 2, 3]) as mock_method:
     24   sut = SUT(ImmutableList.create_list([]))
---> 25   assert sut.get_wrapper_length() == 3, "FAILURE"
     26   print("SUCCESS")

AssertionError: FAILURE

This forces you to change the tests every time you refactor ImmutableList.create_list to return a ‘better’ implementation of ImmutableList!

When I first realized my privilege

Oppressed

Ironically, a sure marker of privilege is not realizing your own privilege. I grew up being told that I am a Brahmin, which is the highest caste, and that it makes us superior and better than others. Unsurprisingly, I was taught that we were, in fact, an oppressed minority. The government reservations for the so-called Scheduled Castes and Scheduled Tribes were often cited as evidence of such oppression. So, naturally, I grew up knowing nothing about the privilege that I enjoyed.

For the first 23 years of my life, I was convinced that everything that came my way was hard earned, and it was despite the oppression against our community. At 23, I was working as a young software engineer in Bangalore. I needed the house deep cleaned, and a contractor said that he would send a couple of people over who would take care of it for us. I was about to get my first glimpse into the privilege that I had enjoyed all my life.

The two people that the contractor send over were completely clueless. They had no idea how to go about the job. They showed up empty handed and asked us what they should be using to clean the house. They required constant supervision and direction, and it consumed most of our time, and defeated the purpose of hiring them in the first place. By the end of the day, less than half the work was done, I was completely frustrated.

It is important to mention that these two folks’ work ethic was never in doubt. They worked as hard and diligently as you could be expected. They were from a village some hours away, quite illiterate, and desperate for work. They hadn’t seen houses in cities before, and so no idea what is involved in the upkeep of a proper house. They probably lived in shanty small houses, and this was all alien to them.

Coming back to the main story, it was close to evening, and very little of the work was done. At this point, the they said that they had to leave because if they didn’t leave right away, then they’d miss the last bus to their village and they’d have to walk home. So if we could pay them, then they will be on their way (and settle the account with the contractor later).

I was pretty upset at this point, and I told them that they hadn’t completed even half the work, and so I’ll pay them only half. They just looked at each other and simply nodded at me. They had this look of someone who has always been helpless and has resigned themselves to this fate for so long that they couldn’t imagine life any other way. I saw all of that, and but my frustration overrode that, and I gave them just half the agreed upon amount and sent them their way.

As soon as they closed the gate behind them, I felt incredibly sorry for them. It wasn’t their fault that they were not skilled. It wasn’t their fault that the contractor sent them our way. And us paying them just half the amount simply means that the contractor will take a larger cut of the money. Despite all of that, all they did was meekly nod. Next, I felt shame and guilt. Me paying them the entire amount would not make a slightest difference to my finances. I spend more going out with friends on a weekend evening.

All of this took about a minute to register, and I immediately called them back and gave them the full amount that was promised to them.

That look of helplessness and resignation stayed with me a long time. But I simply couldn’t understand it in it’s larger context. For a while, I saw this as a fault in their character that will get them swindled, and almost felt good about me not being one of the many who cheated them out of what they earned. Nonetheless, this event stayed with me, as I learned more, I kept recasting that experience with new sociological lenses. It took me years to recognize it as a natural consequence of multi-generational societal oppression, and with that recognize my own privilege.

Reuse Code, Not Objects

We all know of the importance of code reuse and DRY code. It is important to remember that this applies to code, and not to objects!
A common anti-pattern I see in stateful classes is something along the following lines:

class SomeStatefulClass {
  private ResponseType responseType;
  private final ErrMessages errMessages;
  
  public SomeStatefulClass() {
    this.ResponseType = null;
    this.errMessages = new ErrMessages();
  }
  
  public void init() {
    errMessages.clear();
  }
  
  public Response process(Request request) {
    try {
      // Process the request.
      // Update stats.
      // Update otherState.
      Response response = computeResponse();
      responseType = computeResponseType();
      return response;
    } catch (Exception e) {
      this.errMessages.append(e);
    }
    return null;
  }
  
  public LogMessages getErrMessages() {
    return this.errMessages;
  }
  
  public ResponseType getResponseType() {
    return responseType;
  }
}

This design pattern is a major code smell. Ironically, such classes are prone to be misused through reuse. A common example of this is reusing the object within a loop:

public void process(List<Request> requests) {
  final SomeStatefulClass statefulObject
      = new SomeStatefulClass();
  Response response;
  for (Request request: requests) {
    statefulObject.init();
    response = statefulObject.process(request);
    appendResponse(response, statefulObject.getResponseType());
    
  }
}

The issue here is subtle, but dangerous. Consider what happens if one of the requests in the list of requests passed to process() causes statefulObjectto throw an exception inside computeResponse(). Dutifully, this exception is caught by process() and it returns null. However, note that the value of responseType in statefulObject was never modified by processing of this request, and therefore, it still hold the ResponseType of the previous request! Therefore, the line appendResponse(response, statefulObject.getResponseType()); is now passing in a null response and the response type of the previous request!
These types of bugs are subtle and a pain to track down.
And this happened because we chose to reuse the statefulObject. If we were to use a new instance each time, this would not really be an issue.
Moral: If feasible, do not reuse objects; create new instances and throw them away when done!

Code reuse can prevent bugs

I am as surprised as you are at the title. The statement does seem like "Duh!", but much to my embarrassment, I have seen enough code in many code bases to need to reiterate this.

The latest example was the following bug that I spotted. First some background. The caller to this function can choose any subset of a given set of 'modes' or 'features'. The caller represents this via a bit-mask. So 1 represents feature 1, 2 represents feature 2, 4 represents feature 3, 8 represents feature 4, and so on. So, if the caller requests with bit-mask 3, then it represents features 1 and 2, and bit-mask 10 represents features 2 and 4.

The following piece of code is support to filter out requests depending on the requested set of 'modes/'features'.

if ((bitMask & ModeEnum.FEATURE_FOO.getValue()) == 0) {
  // FEATURE_FOO has already taken care off the by some other module. So do nothing here.
  return null;
}

/* Some more code here */

if ((bitMask & ModeEnum.FEATURE_BAR.getValue()) == 0) {
  // For now, we process the request only if caller explicitly specified FEATURE_BAR.
  return null;
}
/* more code to process the request with FEATURE_BAR. */

The code above has a bug because the comment inside the first if-check does not match the if-check's logic. The intent was to skip processing the request if FEATURE_FOO is enabled. Instead, it does the exact opposite.

The naive way to fix it would be to replace

if ((bitMask & ModeEnum.FEATURE_FOO.getValue()) == 0) {

with

if ((bitMask & ModeEnum.FEATURE_FOO.getValue()) != 0) {

However, this misses the more important point of why this bug occurred in the first place. The simple answer to that question is that this bug occurred because the author ignored the principle of code reuse. By putting that principle into practice, a cleaner way to write this code (and therefore fix this bug and prevent similar bugs in the future) is as follows.

We first encapsulate the logic to detecting various modes via this function

boolean hasRequestedFeature(int bitMask, ModeEnum feature) {
  return (bitMask & feature.getValue()) != 1;
}

With that function in place, the new code looks as follows

if (!hasRequestedFeature(bitMask, ModeEnum.FEATURE_FOO)) {
  // FEATURE_FOO has already taken care off the by some other module. So do nothing here.
  return null;
}

/* Some more code here */

if (hasRequestedFeature(bitMask, ModeEnum.FEATURE_BAR)) {
  // For now, we process the request only if caller explicitly specified FEATURE_BAR.
  return null;
}
/* more code to process the request with FEATURE_BAR. */

This make the code more readable, and as long as we reuse the hasRequestedFeature() function, such bitwise operation fragility will not reoccur in the code.

Is this obvious? I think so. Was it necessary to belabor the point? Empirical evidence seems to scream "YES!".