Who Do We Serve? – Intermittent Problems Revisited

I seeded a problem for readers to spot in my recent post “Who Do We Serve?“. George Dinwiddie was the only reader who not only spotted the problem, he emailed me. George says:

…there’s [a] red flag that occurs early in your story that I think deserves more attention. “Sometimes the tests fail sporadically for no apparent reason, but over time the failures appear less frequently.” As Bob Pease (an analog EE) used to say in Electronic Design and EDN magazines, “If you notice anything funny, record the amount of funny.” When something fails intermittently or for an unknown reason, it’s time to sit up and take notice. The system should not be doing something the designers don’t understand.

George spotted the problem that I introduced, but didn’t expand on, and as I hoped a reader would do, emailed me to point out my omission. Good job George, and thanks for emailing me when you spotted it. Your thoughts are absolutely in line with my experience.

One area of expertise I have developed is the ability to track down repeatable cases for intermittent bugs. These are difficult problems for any team to deal with. They don’t fit the regular model, often requiring off-the-wall experimentation, lots of thinking time, tons of creativity, and a good deal of patience. One needs to look at the potentially enormous amount of data in front of them, and figure out how to weed out the things that aren’t important. There is trial and error involved as well, and a good deal of exploration. Naturally, exploratory testing is a fit when tracking intermittent problems.

On Agile teams, I have found we can be especially prone to ignoring intermittent problems. The first time I discovered an intermittent problem on a team using XP and Scrum, we did exactly as Ron Jeffries has often recommended: we came to a complete stop, and the whole team jumped in to work on the problem. We worked on it for a day straight, but people got tired. We couldn’t track down a repeatable case, and we couldn’t figure out the problem from the code. After the second day of investigation, the team decided to move on. The automated tests were all passing, and running through the manual acceptance tests didn’t reveal the problem. In fact, the only person who regularly saw the problem was me, the pesky tester, who shouldn’t have been on an XP team to begin with.

I objected to letting it go, but the team had several reasons, and they had more experience on Agile projects than I did. Their chief concerns were that tracking down this problem was dragging velocity down. They wanted to get working on the new story cards so we could meet our deadline, or finish early and impress the customer. Their other concern was that we weren’t following the daily XP processes when we were investigating. They said it would be better to get back into the process. Furthermore, the automated tests were passing, and the customer had no problem with the functionality. I stood down, and let it go, but made sure to watch for the problem with extra vigilance, and took it upon myself to find a repeatable case.

Our ScrumMaster talked to me and said I had a valid concern, but the code base was changing so quickly that we had probably fixed the bug already. This was a big red flag to me, but I submitted to the wishes of the team and helped test newly developed stories and tried to give the programmers feedback on their work as quickly as I could. That involved a lot of exploratory testing, and within days, my exploratory testing revealed that the intermittent bug had not gone away. We were at the end of a sprint, and it was almost time for a demo for all the business stakeholders. The team asked me to test the demo, run the acceptance tests, and support them in the demo. We would look at the intermittent bug as soon as I had a repeatable case for it.

Our demo was going well. The business stakeholders were blown away by our progress. They thought it was some sort of trick – it was hard for them to understand that in so short a time we had developed working software that was running in a production-like environment. Then the demo hit a snag. Guess what happened? The intermittent bug appeared during the demo, on the huge projector screen in front of managers, executives, decision makers and end users. It was still early in the project, so no one got upset, but one executive stood up and said: “I knew you guys didn’t have it all together yet.” He chuckled and walked out. The sponsor for the project told us that the problem better be fixed as soon as possible.

I don’t think management was upset about the error, but our pride was hurt. We also knew we had a problem. They were expecting us to fix it, but we didn’t know how. Suddenly the developers went from saying: “That works on my machine.” to: “Jonathan, how can I help you track down the cause?” In the end, I used a GUI automated testing tool as a simulator, and based on behavior I had seen in the past, and patterns I documented from the demo, I came up with a theory, and using exploratory testing, I designed and executed an experiment. Using a GUI test automation tool as a simulator while running manual test scenarios helped me find a cause quite quickly.

My experiment failed in that it showed my theory was wrong, but a happy side effect was that I found the pattern that was causing the bug. It turned out that it was an intermittent bug in the web application framework we were using. The programmers wrote tests for that scenario, wrote code to bypass the bug in the framework, and logged a bug with the framework project. At the next demo, we ran manual tests and automated tests showing what the problem was, and how we had fixed it. The client was pleased, and the team learned a valuable lesson. We learned that an XP team is not immune to general software problems, and as George said we practiced this in the future:

When something fails intermittently or for an unknown reason, it’s time to sit up and take notice.

Other Agile teams I’ve been on have struggled with intermittent errors, particularly teams that had high velocity and productivity. Regular xUnit tools and acceptance tests aren’t necessarily that helpful, unless they are used as part of a larger test experiment. Running them until they turn green, and then checking code in without understanding the problem and fixing it will not cause it to go away. An unfixed intermittent bug sits in the shadows, ticking away like a time bomb, waiting for a customer to find it.

Intermittent problems are difficult to deal with, and there isn’t a formula to follow that guarantees success. I’ve written about them before: Repeating the Unrepeatable Bug, and James Bach has the most thorough set of ideas on dealing with intermittent bugs here: How to Investigate Intermittent Problems. I haven’t seen intermittent bugs just disappear on their own, so as George says, take the time to look into the issue. The system should not be doing something the designers don’t understand.