RE: Model Myopia in Testing

Chris McMahon asks:

How do I find a useful model when unit tests are passing and show 100% code coverage?

Chris asked me this a bit tongue-in-cheek because he has heard me rant about these sorts of things before. I’m glad he asked the question though, because it brings up a different kind of model myopia. In yesterday’s post, we have modeling myopia with regards to the application we are testing. This question brings up model myopia regarding our potential testing contexts.

This is a common question, particularly on agile teams that utilize automated unit testing who are struggling with a tricky bug. Let’s examine this situation in more detail. What does the claim in the question tell us about testing?

This claim provides information about a certain testing context, the code context. We know that within this testing context, we have a measure of coverage that claims 100%, and that the tests that achieve this coverage claim pass. Is this enough information? We have a quantitative claim, but this doesn’t tell us much from a qualitative perspective.

Why does this matter?

One project I saw had 300 automated unit tests that achieved a high level of code coverage on a small application. When we did testing in another context (the user context, or through a Graphical User Interface), the application couldn’t pass basic tests. On further investigation, we found that most of the unit tests were programmed to pass, no matter what the result of the test was. The developers were measured on the amount of automated unit tests they wrote, on the code coverage attained as reported by a coverage tool, and they could only check in code when the tests passed. As a measurement maxim says: show me how people are measured, and I’ll show you how they behave.

This is an extreme example, but underlines how dangerous numbers can be without a context. Sure, we had a number of passing tests with a high degree of code coverage on a simple application, but the numbers were almost meaningless from a management perspective because the tests were of poor quality, and many of them didn’t really test anything. I’ve seen reams of automated test code not test anything too many times to take a quantitative claim on its own. (I’ve also seen a lot of manual tests that didn’t test anything much either.)

Now what happens if we agree that the quality of the unit tests are sufficiently good? We don’t have ridiculous measures like SMART goals where we “pay for performance” on automated unit tests, code coverage, bug counts, etc., so we don’t have this “meet the numbers” mentality that frustrates our goals. We know that our tests are well vetted, thought out and have helped steer our design. Our developers are intrinsically motivated to write solid tests, and they are talented and trustworthy. What now? All the tests passed. Where can the problem be?

It’s important to understand that “all the tests we could think of at a particular time passed”, not that “all the possible tests we could ever think of, express in program code and run have passed”. This is an important distinction. Brian Marick has good stories about faults of ommission, and reminds us that a test is only as good as what it tests. It doesn’t test what we didn’t write in the first place. Narrow modeling is usually a culprit, and once we learn more about the problem we can write a test for it. It’s hard to try to write a test for a particular problem that has stumped us and we don’t have enough information yet, or to predict all possible problems up front when doing TDD.

In other cases, I have seen tests pass in an automated framework that were due to using the framework incorrectly which caused false positives. This is rare, but I have seen it happen.

Michael Bolton has summed up automated unit testing assumptions eloquently:

When someone says “all the unit tests passed”, we should hear “the unit tests that we thought of, for theories of error of which we were aware, and that we considered important enough to write given the time we had, and that we presume to have run on the units of the product that we assume to have been built, and that we presume actually test something, and that we presume provide a trustworthy result 100% of the time, passed.”

Testing Contexts

Another important distinction to make is that we know this information comes from one testing context, the code context. We know there are others. I split an application with a user interface into three broad contexts: the code, the system the application is installed in, and the user interface. We know at the user interface layer, the application is greater than the sum of its parts. It will often react differently when tested from this context than it will from the code context. There are many potential testable interfaces within each context that can provide different information when testing. What interfaces, and in what contexts are we gathering test-generated information from?

When I see a claim like this: “we have 100% of our automated unit tests passing with 100% code coverage”, that only tells me about one of three possible model categories for testing. If we haven’t done testing in other contexts, we are only seeing part of the picture. We might be falling into a different type of modeling myopia – looking too narrowly at one testing execution model.

It’s common to only view the application through the context we are used to. I tend to use the term “interface within a context” rather than “black box” and “white box” after my experience with TDD. The interface we interact with the software with the most can cause us to narrow our focus on testing. We then model the testing we can do with a preference to this interface. Mike Kelly has a blog post that deals with bounded awareness and inattentional blindness, and how they can affect our decisions.

I see this frequently with developers who are working predominantly in the code context, and testers who are testing only in the user interface context. Frequently, each treat testing in the other context with some disdain. Often, testing diversity pays off, and when we work together bringing expertise and ideas from any testable interface we were using, we can combat model myopia.

Testing on one context does not mean testing in another context is duplicated effort or unnecessary. Often, testing using a different model reveals a bug that isn’t obvious, or easily reproducible in one context.

Not too long ago, I had two meetings. One was with two developers who asked why I bothered to test against the UI when we had so many other interfaces that were easier to test against behind the GUI. In another meeting, a tester asked why I bothered doing any testing behind the GUI when the end user only uses the GUI anyway. In both cases, my answer was the same: “When I stop finding bugs that you appreciate and find useful in one interface that I couldn’t find more easily in the other, I’ll stop testing against that interface.”

When testing in a context fails to provide us with useful information about the product we are testing, we can drop that model in its importance to our testing. Knowing about a testing model and choosing not to use it is different than thinking we have covered all ourtesting models, and still be stumped as to why we have problems.