Tracking Intermittent Bugs

Recognizing Patterns of Behavior

In my last post, the term “patterns” caused strong responses from some readers. When I use the term “pattern,” I do not mean a design pattern, or a rule to apply when testing. For my purposes, patterns are the rhythms of behavior in a system.

When we start learning to track down bugs, we learn to repeat exactly what we were doing prior to the bug occurring. We repeat the steps, repeat the failure, and then weed out extraneous details to create a concise bug report. However, with an intermittent bug, the sequence of steps may vary greatly, even to the point where they seem unrelated. In many cases, I’ve seen the same bug logged in separate defect reports, sometimes spanning two or three years. But there may be a pattern to the bug’s behavior that we are missing when we are so close to the operation of the program. This is when we need to take a step back and see if there is a pattern that occurs in the system as a whole.

Intermittent or “unrepeatable” bugs come into my testing world when:

  1. Someone tells me about an intermittent problem they have observed and needs help.
  2. I observe an intermittent problem when I’m testing an application.

How do I know when a bug is intermittent? In some cases, repeating the exact sequence of actions that caused the problem in the first place doesn’t cause the failure to reoccur. Later on, I run into the problem again, but again am frustrated in attempts to repeat it. In other cases, the problem occurs in a seemingly random fashion. As I am doing other kinds of testing, failures might occur in different areas of the application. Sometimes I see error messaging that is very similar; for example, a stack trace from a web server may be identical with several errors that come from different areas of the application. In still other cases my confidence that several bugs that were reported in isolation is the same bug is based on inference – I have a gut feeling based on experience (called abductive inference).

When I looked back at how I have tracked down intermittent bugs, I noticed that I moved out to view the system from a different perspective. I took a “big picture” view instead of focusing on the details. To help frame my thinking, I sometimes visualize what something in the physical world looks like from high above. If I am in traffic, driving from traffic light to traffic light, I don’t see the behavior of the traffic itself. I go through some intersections, wait at others, and can only see the next few lights ahead. But if I were to observe the same traffic from a tall building, patterns would begin to emerge in the traffic’s flow that I couldn’t possibly see from below. One system behavior that I see with intermittent bugs is that the problem is seemingly resistant to fixes that are completed by the team. The original fault doesn’t occur after a fix, but it pops up in another scenario not described in the test case. When several bugs are not getting fixed or are occurring in different places it is a sign to me that there is possibly one intermittent bug behind them. Sometimes a fault is reported by customers, but is not something I can repeat in the test lab. Sometimes different errors occur over time, but a common thread appears: similar error messages, similar back end processing, etc.

Once I observe a pattern that I am suspicious about, I work on test ideas. Sometimes it can be difficult to convince the team that it is a single, intermittent bug instead of a several similar bugs. With one application, several testers were seeing a bug we couldn’t repeat occur infrequently in a certain part of the application we were testing. At first we were frustrated with not being able to reproduce it, but I started to take notes and save error information whenever I stumbled upon it. I also talked to others about their experiences when they saw the intermittent failure. Once I had saved enough information from error logs, the developer felt he had a fix. He applied a fix, and I tested it and didn’t see the error again. We shipped the product thinking we had solved the intermittent problem. But we hadn’t. To my shock and dismay, I stumbled across it again in a slightly different area than we had been testing after the release.

It took a while to realize that the bug only occurred after working in one area of the application for a period of time. I kept notes of my actions prior to the failure, but I couldn’t reliably repeat it. I talked to the lead developer on the project, and he noticed a pattern in my failure notes. He told me that the program was using a third-party tool through an API in that area of the application. The error logs I had saved pointed to problems with memory allocation, so he had a hunch that the API was running out of allocated space and not handling the error condition gracefully. We had several other bug reports that were related to actions in that area of the application, and a sales partner who kept calling to complain about the application crashing after using it for an hour. When I asked the sales partner what area of the application they were using when it crashed, and it turned out to be the same problem area.

The team was convinced we had several intermittent bugs in that area of the application, based on their experience and bug reports. But the developer and I were suspicious it was one bug that could be triggered by any number of actions showing up in slightly different ways. I did more testing, and discovered that it didn’t matter what exactly you were doing with the application, it had to do with how the application was handling memory in one particular area. Our theory was that the failures occurring after a period of time passing while using the application had to do with memory allocation filling up, causing the application to get into an unstable state. To prove our theory, we had to step back and not focus on the details of each individual case. Instead, we quickly filled up the memory by doing actions that were memory intensive. Then, we could demonstrate to others on the team that various errors could occur using different types of test data and inputs within one area of the application. Once I recorded the detailed steps required to reproduce the bug, other testers and developers could consistently repeat it as well. Once we fixed that one bug, the other, supposedly-unrelated, intermittent errors went away as well.

I am sometimes told by testers that my thinking is “backwards” because I fill in the details of exact steps to repeat the bug only after I have a repeatable case. Until then, the details can distract me from the real bug.