Category Archives: bug hunting

Exploratory Testing: More than Superficial Bug Hunting

Sometimes people define exploratory testing (ET) quite narrowly, such as only going on a short-term bug hunt in a finished application. I don’t define ET that narrowly, in fact, I do ET during development whether I have a user interface to use or not. I often ask for and get some sort of testing interface that I can use to design and execute tests around before a UI appears. I’ll also execute traditional black-box testing through a UI as a product is being developed, and at the end, when development feels it is “code complete”. I’m not alone. Cem Kaner mentioned this on the context-driven testing mailing list, which prompted this blog post.
Cem wrote:

To people like James and Jon Bach and me and Scott Barber and Mike Kelly and Jon Kohl, I think the idea is that if you want useful exploratory testing that goes beyond the superficial bugs and the ones that show up in the routine quicktests (James Whittaker’s attacks are examples of quicktests), then you want the tester to spend time finding out more about the product than its surface and thinking about how to fruitfully set up complex tests. The most effective exploratory testing that I have ever seen was done by David Farmer at WordStar. He spent up to a week thinking about, researching, and creating a single test-which then found a single showstopper bug. On this project, David developed exploratory scenario tests for a database application for several months, finding critical problems that no one else on the project had a clue how to find.

In many cases, when I am working on a software development project, a good deal of analysis and planning go into my exploratory testing efforts. The strategies I outline for exploratory testing reflect this. Not only can they be used as thinking strategies in the moment, at the keyboard, testing software, but they can guide my preparation work prior to exploratory testing sessions. Sometimes, I put in a considerable amount of thought and effort modeling the system, identifying potential risk areas and designing tests that yield useful results.

In one case, a strange production bug occurred in an enterprise data aggregation system every few weeks. It would last for several days, and then disappear. I spent several days researching the problem, and learned that the testing team had only load tested the application through the GUI, and the real levels of load occurred through the various aggregation points communicating in. I had a hunch that there were several factors at work here and it took time to analyze them. It took several more days working with a customer support representative who had worked on the system for years before I had enough information to work with the rest of the team on test design. We needed to simulate not only the load on the system, but the amount of data that might be processed and stored over a period of weeks. I spent time with the lead developer, and the lead system administrator to create a home-grown load generation simulation tool we could run indefinitely to simulate production events and the related network infrastructure.

While the lead developer was programming the custom tool and the system administrator was finding old equipment to set up a testing environment, I created test scenarios against the well-defined, public Web Services API, and used a web browser library that I could run in a loop to help generate more light load.

Once we had completed all of these tasks, started the new simulation system, and waited for had the data and traffic statistics to be at the level that I wanted to generate, I began testing. After executing our first exploratory test, the system fell over, and it took several days for the programmers to create a fix. During this time, I did more analysis and we tweaked our simulation environment. I repeated this with the help of my team for several weeks, and we found close to a dozen show-stopping bugs. When we were finished, we had an enhanced, reusable simulation environment we could use for all sorts of exploratory testing. We also figured out how to generate the required load in hours rather than days with our home-grown tools.

I also did this kind of thing with an embedded device that was under development. I asked the lead programmer to add a testable interface into the new device he was creating firmware for, so he added a telnet library for me. I used a Java library to connect to the device using telnet, copied all the machine commands out of the API spec, and wrapped them in JUnit tests in a loop. I then created code to allow for testing interactively, against the API. The first time I ran a test with a string of commands in succession in the IDE, the device failed because it was writing to the input, and reading from the output. This caused the programmer to scratch his head, chuckle, and say: “so that’s how to repeat that behavior…”

It took several design sessions with the programmer, and a couple days of my time to be able to set up an environment to do exploratory testing against a non-GUI interface using Eclipse, a custom Java class, and JUnit. Once that was completed, the other testers used it interactively within Eclipse as well. We also used a simulator that a test toolsmith had created for us to great effect, and were able to do tests we just couldn’t do manually.

We also spent about a week creating test data that we piped in from real-live scenarios (which was a lot of effort to create as well, but well worth it.) We learned a good deal from the test data creation about the domain the device would work in.

Recently, I had a similar experience – I was working with a programmer who was porting a system to a Java Enterprise Edition stack and adding a messaging service (JMS.) I had been advocating testability (visibility and control – thanks James) in the design meetings I had with the programmer. As a result, he decided to use a topic reader on JMS instead of a queue so that we can see what is going on more easily, and added support for the testers to be able to see what the Object-Relational Mapping tool (JPA) is automatically generating map and SQL-wise at run-time. (By default, all you see is the connection information and annotations in Java code, which doesn’t help much when there is a problem.)

He also created a special testing interface for us, and provided me with a simple URL that passes arguments to begin exercising it. For my first test, I used JMeter to send messages to it asynchronously, and the system crashed. This API was so far below the UI, it would be difficult to do much more than scratch the surface of the system if you only tested through the GUI. With this testable interface, I could use several testing tools as simulators to help drive my and other tester’s ET sessions. Without the preparation through design sessions, we’d be trying to test this through the UI, and wouldn’t have near the power or flexibility in our testing.

Some people complain that exploratory testing only seems to focus on the user interface. That isn’t the case. In some of my roles, early in my career, I was designated the “back end” tester because I had basic programming and design skills. The less technical testers who had more knowledge of the business tested through the UI. I had to get creative to ask for and use testable interfaces for ET. I found a place in the middle that facilitated integration tests , while simulating a production environment, which was much faster than trying to do all the testing through the UI.

I often end up working with programmers to get some combination of tools to simulate the kinds of conditions I’m thinking about for exploratory testing sessions, with the added benefit of hitting some sort of component in isolation. These testing APIs allow me to do integration tests in a production-like environment, which complements the unit testing the programmers are doing, and the GUI-level testing the black box testers are doing. In most cases, the programmers also adopt the tools and use them to stress their components in isolation, or as I often like to use them for, to quickly generate test data through a non-user interface while still exercising the path the data will follow in production. This is a great way to smoke test minor database changes, or database driver or other related tool upgrades. Testing something like this through the UI alone can take forever, and many of the problems that are obvious at the API level are seemingly intermittent through the UI.

Exploratory testing is not limited to quick, superficial bug hunts. The learning, analyzing, executing, test idea generation and execution are parallel activities, but sometimes we need to focus harder in the learning and analyzing areas. I frequently spend time with programmers helping them design testable interfaces to help with exploratory testing at a layer behind the GUI. This takes preparation work including analysis and design, and testing of the interface itself, which all help feed into my learning about the system and into the test ideas I may generate. I don’t do all of my test idea generation on the fly, in front of the keyboard.

In other cases, I have tested software that was developed for very specialized use. In one case, the software was developed by scientists to be used by scientists. It took months to learn how to do the most basic things the software supported. I found some bugs in that period of learning, but I was able to find much more important bugs after I had a basic grasp of the fundamentals of the domain the software operated in. Jared Quinert has also had this kind of experience: “I’ve had systems where it took 6 months of learning before I could do ‘real’ testing.”

Tracking Intermittent Bugs

Recognizing Patterns of Behavior

In my last post, the term “patterns” caused strong responses from some readers. When I use the term “pattern,” I do not mean a design pattern, or a rule to apply when testing. For my purposes, patterns are the rhythms of behavior in a system.

When we start learning to track down bugs, we learn to repeat exactly what we were doing prior to the bug occurring. We repeat the steps, repeat the failure, and then weed out extraneous details to create a concise bug report. However, with an intermittent bug, the sequence of steps may vary greatly, even to the point where they seem unrelated. In many cases, I’ve seen the same bug logged in separate defect reports, sometimes spanning two or three years. But there may be a pattern to the bug’s behavior that we are missing when we are so close to the operation of the program. This is when we need to take a step back and see if there is a pattern that occurs in the system as a whole.

Intermittent or “unrepeatable” bugs come into my testing world when:

  1. Someone tells me about an intermittent problem they have observed and needs help.
  2. I observe an intermittent problem when I’m testing an application.

How do I know when a bug is intermittent? In some cases, repeating the exact sequence of actions that caused the problem in the first place doesn’t cause the failure to reoccur. Later on, I run into the problem again, but again am frustrated in attempts to repeat it. In other cases, the problem occurs in a seemingly random fashion. As I am doing other kinds of testing, failures might occur in different areas of the application. Sometimes I see error messaging that is very similar; for example, a stack trace from a web server may be identical with several errors that come from different areas of the application. In still other cases my confidence that several bugs that were reported in isolation is the same bug is based on inference – I have a gut feeling based on experience (called abductive inference).

When I looked back at how I have tracked down intermittent bugs, I noticed that I moved out to view the system from a different perspective. I took a “big picture” view instead of focusing on the details. To help frame my thinking, I sometimes visualize what something in the physical world looks like from high above. If I am in traffic, driving from traffic light to traffic light, I don’t see the behavior of the traffic itself. I go through some intersections, wait at others, and can only see the next few lights ahead. But if I were to observe the same traffic from a tall building, patterns would begin to emerge in the traffic’s flow that I couldn’t possibly see from below. One system behavior that I see with intermittent bugs is that the problem is seemingly resistant to fixes that are completed by the team. The original fault doesn’t occur after a fix, but it pops up in another scenario not described in the test case. When several bugs are not getting fixed or are occurring in different places it is a sign to me that there is possibly one intermittent bug behind them. Sometimes a fault is reported by customers, but is not something I can repeat in the test lab. Sometimes different errors occur over time, but a common thread appears: similar error messages, similar back end processing, etc.

Once I observe a pattern that I am suspicious about, I work on test ideas. Sometimes it can be difficult to convince the team that it is a single, intermittent bug instead of a several similar bugs. With one application, several testers were seeing a bug we couldn’t repeat occur infrequently in a certain part of the application we were testing. At first we were frustrated with not being able to reproduce it, but I started to take notes and save error information whenever I stumbled upon it. I also talked to others about their experiences when they saw the intermittent failure. Once I had saved enough information from error logs, the developer felt he had a fix. He applied a fix, and I tested it and didn’t see the error again. We shipped the product thinking we had solved the intermittent problem. But we hadn’t. To my shock and dismay, I stumbled across it again in a slightly different area than we had been testing after the release.

It took a while to realize that the bug only occurred after working in one area of the application for a period of time. I kept notes of my actions prior to the failure, but I couldn’t reliably repeat it. I talked to the lead developer on the project, and he noticed a pattern in my failure notes. He told me that the program was using a third-party tool through an API in that area of the application. The error logs I had saved pointed to problems with memory allocation, so he had a hunch that the API was running out of allocated space and not handling the error condition gracefully. We had several other bug reports that were related to actions in that area of the application, and a sales partner who kept calling to complain about the application crashing after using it for an hour. When I asked the sales partner what area of the application they were using when it crashed, and it turned out to be the same problem area.

The team was convinced we had several intermittent bugs in that area of the application, based on their experience and bug reports. But the developer and I were suspicious it was one bug that could be triggered by any number of actions showing up in slightly different ways. I did more testing, and discovered that it didn’t matter what exactly you were doing with the application, it had to do with how the application was handling memory in one particular area. Our theory was that the failures occurring after a period of time passing while using the application had to do with memory allocation filling up, causing the application to get into an unstable state. To prove our theory, we had to step back and not focus on the details of each individual case. Instead, we quickly filled up the memory by doing actions that were memory intensive. Then, we could demonstrate to others on the team that various errors could occur using different types of test data and inputs within one area of the application. Once I recorded the detailed steps required to reproduce the bug, other testers and developers could consistently repeat it as well. Once we fixed that one bug, the other, supposedly-unrelated, intermittent errors went away as well.

I am sometimes told by testers that my thinking is “backwards” because I fill in the details of exact steps to repeat the bug only after I have a repeatable case. Until then, the details can distract me from the real bug.

User Profiles and Exploratory Testing

Knowing the User and Their Unique Environment

As I was working on the Repeating the Unrepeatable Bug article for Better Software magazine, I found consistent patterns in cases where I have found a repeatable case to a so-called “unrepeatable bug”. One pattern that surprised me was how often I do user profiling. Often, one tester or end-user sees a so-called unrepeatable bug more frequently than others. A lot of my investigative work in these cases involves trying to get inside an end-user’s head (often a tester) to emulate their actions. I have learned to spend time with the person to get a better perspective on not only their actions and environment, but their ideas and motivations. The resulting user profiles fuel ideas for exploratory testing sessions to track down difficult bugs.

Recently I was assigned the task of tracking down a so-called unrepeatable bug. Several people with different skill levels had worked on it with no success. With a little time and work, I was able to get a repeatable case. Afterwards, when I did a personal retrospective on the assignment, I realized that I was creating a profile of the tester who had come across the “unrepeatable” cases that the rest of the dev team did not see. Until that point, I hadn’t realized to what extent I was modeling the tester/user when I was working on repeating “unrepeatable” bugs. My exploratory testing for this task went something like this.

I developed a model of the tester’s behaviour through observation and some pair testing sessions. Then, I started working on the problem and could see the failure very sporadically. One thing I noticed was that this tester did installations differently than others. I also noticed what builds they were using, and that there was more of a time delay between their actions than with other testers (they often left tasks mid-stream to go to meetings or work on other tasks). Knowing this, I used the same builds and the same installation steps as the tester; I figured out that part of the problem had to do with a Greenwich Mean Time (GMT) offset that was set incorrectly in the embedded device we were testing. Upon installation, the system time was set behind our Mountain Time offset, so the system time was back in time. This caused the system to reboot in order to reset the time (known behavior, working properly). But, as the resulting error message told me, there was also a kernel panic in the device. With this knowledge, I could repeat the bug about every two out of five times, but it still wasn’t consistent.

I spent time in that tester’s work environment to see if there was something else I was missing. I discovered that their test device had connections that weren’t fully seated, and that they had stacked the embedded device on both a router and a power supply. This caused the device to rock gently back and forth when you typed. So, I went back to my desk, unseated the cables so they barely made a connection, and—while installing a new firmware build—tapped my desk with my knee to simulate the rocking. Presto! Every time I did this with a same build that this tester had been using, the bug appeared.

Next, I collaborated with a developer. He went from, “that can’t happen,” to “uh oh, I didn’t test if the system time is back in time, *and* that the connection to the device is down during installation to trap the error.” The time offset and the flakey connection were causing two related “unrepeatable” bugs. This sounds like a simple correlation from the user’s perspective, but it wasn’t from a code perspective. These areas of code were completely unrelated and weren’t obvious when testing at the code level.

The developer thought I was insane when he saw me rocking my desk with my knee while typing to repeat the bug. But when I repeated the bugs every time, and explained my rationale, he chuckled and said it now made perfect sense. I walked him through my detective work, how I saw the device rocking out of the corner of my eye when I typed at the other tester’s desk. I went through the classic conjecture/refutation model of testing where I observed the behavior, set up an experiment to emulate the conditions, and tried to refute my proposition. When the evidence supported my proposition, I was able to get something tangible for the developer to repeat the bug himself. We moved forward, and were able to get a fix in place.

Sometimes we look to the code for sources of bugs and forget about the user. When one user out of many finds a problem, and that problem isn’t obvious in the source code, we dismiss it as user error. Sometimes my job as an exploratory tester is to track down the idiosyncrasies of a particular user who has uncovered something the rest of us can’t repeat. Often, there is a kind of chaos-theory effect that happens at the user interface, that only a particular user has the right unique recipe to cause a failure. Repeating the failure accurately not only requires having the right version of the source code and having the test system deployed in the right way, it also requires that the tester knows what a that particular user was doing at that particular time. In this case, I had all three, but emulating an environment I assumed was the same as mine was still tricky. The small differences in test environments, when coupled with slightly different usage by the tester, made all the difference between repeating the bug and not being able to repeat it. The details were subtle on their own, but each nuance, when put together, amplified each other until the application had something it couldn’t handle. Simply testing the same way we had been in the tester’s environment didn’t help us. Putting all the pieces together yielded the result we needed.

Note: Thanks to this blog post by Pragmatic Dave Thomas, this has become known as the “Knee Testing” story.


How many testing projects have you been on that seemed to be successful only to have a high impact bug be discovered by a customer once the software is in production? Where did the testing team go wrong? I would argue that the testing team didn’t necessarily do anything wrong.

First of all, a good tester knows (as Cem Kaner points out) that it is impossible to completely test a program. Another reason we get surprised is due to underdetermination. The knowledge about the entire system is gathered by testers throughout the life of the project. It is not complete when the requirements are written, and it probably isn’t complete when the project ships. The knowledge can be difficult to obtain, and is based on many aspects not limited to: access to subject experts, the skills of the testers involved and their ability to extract the right information on an ongoing basis. Realizing that you are dealing with a situation where you probably do not have all the information is key. This helps guide your activities and helps you keep an open mind about what you might be missing.

Underdetermination is usually used to describe how scientific theories go far beyond empirical evidence (what we can physically observe and measure), yet they are surprisingly accurate. One example of underdetermination is described by Noam Chomsky. He states that the examples of language that a child has in their environment underdetermines the actual language that they learn to speak. Languages have rule sets and many subtleties that are not accurately represented by the common usage which the child learns from.

Testers regularly face problems of underdetermination. The test plan document underdetermines the actual testing strategies and techniques that will be employed. The testers knowledge of the system underdetermines what the actual system looks like. Often, key facts about the system come in very late in the testing process which can send testing efforts into a tailspin.

Just knowing that the testing activities on a project underdetermine what could possibly be tested is a good start. Test coverage metrics are at best a very blunt measurement. Slavishly sticking to these kinds of numbers, or signing off that testing is only complete when a certain percentage of coverage is complete can be misleading at best, and at worst dangerous.

If testing efforts do fail to catch high impact bugs, there are a couple of things to remember:

  1. It wasn’t necessarily a failure – it is impossible to test everything
  2. Testing knowledge at any given point in a project is underdetermined by what could be tested

If this happens to your testing team, instead of just incorporating this problem as a test case to ensure this bug doesn’t occur again, evaluate *why* it happened. What information were the testers missing, and why were they missing it? How could they have got this information when testing? The chances of this particular bug cropping up again is pretty slim, the chances of one like it popping up in another area of the program are probably much greater than one might initially think.

Instead of evaluating solely on coverage percentages on a project, be self critical of your testing techniques. Realize that the coverage percentages do not really give you much information. They tell you nothing about the tests you haven’t thought to run – and those tests could be significant in number as well as in potential impact. Evaluate what and how you are testing throughout the project, and periodically call in experts from other parts of the system to help you evaluate what you are doing. Think of what you could be missing and realize that you can do a very good job, even without all the information.

The scientific community does quite well even though they frequently only work with a small part of the whole picture. Testers should be able to as well. One interesting side note is that many significant discoveries come about by accident. Use these “testing accidents” to learn more about the system you are testing, the processes you are using, and more importantly what they tell you about *you*, your testing and what you can learn from it.


My wife and I have friends and family in the health-care profession who tell us about “superbugs” – bacteria which are resistant to antibiotics. In spite of all the precautions, new technology and the enormous efforts of health care professionals, bugs still manage to mutate and respond to the environment they are in and still pose a threat to human health. In software development projects, I have encountered bugs that at least on the surface appear to exhibit this “superbug” behavior.

Development environments that utilize test-driven development, automated unit testing tools and other test-infected development techniques, in my experience, tend to generate very robust applications. When I see how much the developers are testing, and how good the tests are, I wonder if I’ll be able to find any bugs in their code at all. I do find bugs (sometimes to my surprise), but it can be much harder than in traditional development environments. Gone are the easy bugs an experienced tester can find in minutes in a newly developed application component. These include bounds conditions tests, integration tests and others that may not be what first come to mind to a developer testing their own code. However, in a test-infected development environment, most of these have already been thought of, and tested for by developers. As a tester, I have to get creative and inventive to find bugs in code that has already been thoroughly tested by the developers.

In some cases, I have collaborated with the developer to help in their unit test development efforts. <shameless plug> I talk about this more in next month’s edition of Better Software. </shameless plug> The resulting code is very hard for me to find bugs in. Sometimes to find any bugs at all, I have to collaborate with the developer to generate new testing ideas based on their knowledge of interactions in the code itself. The bugs that are found in these efforts are often tricky, time consuming and difficult to replicate. Nailing down the cause of these bugs often requires testers and developers pair testing. These bugs are not only hard to find, they are often difficult to fix, and seem to be resistant to the development efforts which are so successful in catching many bugs during the coding process. I’ve started calling these bugs “superbugs”.

It may be the case that certain bugs are resistant to the developer testing techniques, but I’m not sure if this is the case or not. I’ve thought until recently that these bugs also exist in traditionally developed code, but since testers spend so much time dealing with the bugs that test-infected development techniques tend to catch, they don’t have the time in the life of the project to find these types of bugs as frequently. Similarily, since they are difficult to replicate, they may not get reported as much by actual users, or several users may report the same problem in the form of several “unrepeatable” bugs.

Another reason *I* find them difficult to find might be due to my own testing habits and rules of thumb, particularly if the developer and I are working together quite closely. When we test together, I teach the developer some of my techniques, and they teach me theirs. When I finally test the code, both of our usual techniques have been tested quite well in development. Now I’m left with usability problems, some integration bugs that the unit testing doesn’t catch, and these so-called “superbugs”. Maybe the superbugs aren’t superbugs at all. Another tester might think of them as regular bugs, and may find them much more easily than I can because of their own toolkit of testing techniques and rules of thumb.

This behavior intrigues me none the less. Are we now able to find bugs that we didn’t have the time to find before, or are we now having to work harder as testers and push the bounds of our knowledge to find bugs in thoroughly developer-tested code? Or is it possible that our test-infected development efforts have resulted in a new strain of bugs?