Category Archives: philosophy

Tim’s Comments on Software Testing and Scientific Research

Tim Van Tongeren commented on one of my recent posts, building on my thoughts on software testing and the philosophy of science. I like the correlation he made between more scripted testing and exploratory testing to quantitative vs. qualitative scientific research. When testing, what do we value more on a project? Tim says that this depends on project priorities.

He recently expanded more on this topic, and talks about similarities in qualitative research and exploratory testing.

Tim researches and writes about a discipline that can teach us a lot about software testing: the scientific process.

The Kick in the Discovery

“Why do you like software testing?” is a question that I get asked frequently. A phrase from Richard Feynman comes to mind. When Feynman was asked about how he felt about the reward of his Nobel Prize, he said one of the real rewards of the work he did was “the kick in the discovery.”1 This has stuck with me. As a software tester, I enjoy discovering bugs. I seem to be one of those people who enjoys seeing how a system works when stressed to its limits. I get a kick out of discovering something new in a system, or being one of the first people to use a new system. Scientists like Feynman fascinate me, and a lot of what they say resonates with my thoughts on testing. Software testing can learn a lot from scientific theory; the parallels are very interesting.

Exploratory Testing: Exploring Unintended Test Results

Many of the great scientific discoveries have come about by accident during a typical scientific process of conjecture and refutation. A controlled experiment is a movement towards proving the fallibility of a hypothesis. However, when an experiment has unintended consequences, some scientists do a great job handling things that don’t go according to plan. This often leads to great discoveries.

One parallel to software testing is Ernest Rutherford and his work in nuclear physics. (Someone who spent time “smashing atoms” sounds like someone who might be good at software testing.) Rutherford observed unintended consequences, built on other work and collaborated with peers. He saw unintended consequences in experiments done by his peers, and developed patterns of thought around what was being observed.

The work that eventually led to the discovery of the nucleus of an atom is an interesting topic for software testers to study. If those doing the experiment to push alpha particles through the gold foil had executed the experiment like routine-school testers, and had scripted the test case of firing the particles through the gold foil, and the expected results of all the particles having to go through the foil, would they have noticed the ones that bounced back? If so, what would they have done if the particles that bounced back weren’t in the plan, and weren’t in the experiment test script’s expected results? What if they were so focused on the particles that were supposed to go through the foil that they didn’t notice the ones that did not?

Not knowing exactly how much Rutherford and his colleagues had formalized the experiment, I can’t make any claims on exactly what they did. However, we see the results of the way Rutherford thought about scientific experimentation. What he and his colleagues observed changed the initial hypothesis, and subsequent experiments led to discovering the nucleus of the atom. They had an idea, tested it out, got unintended results and Rutherford explored around those unintended results. He found something new that contradicted conventional knowledge, but transformed the face of modern physics. Like a good exploratory tester, we might infer that Rutherford was more concerned with thinking about what he was doing than in following a formula by rote to prove his hypothesis.

Software testers don’t make discoveries that transform scientific knowledge, but the discoveries that are made can transform project knowledge. At the very least, these discoveries potentially save companies a lot of money. Bug discoveries are hard to measure, but every high-impact bug that is discovered and fixed prior to shipping saves the vendor money. Discoveries of high-impact bugs may be minimized by the team at first, but many times those discoveries are the difference between project success and failure.

James Bach and Cem Kaner say that exploratory testing is a way of thinking about testing. Exploratory testing, like scientific experimentation allows for improvisation and for the exploration of unintended results. Those unintended results are where the real discoveries lie many times in science, and where the bugs often lie in software testing. Detailed test plans and pre-scripted test cases based on limited knowledge may discourage discovery. Tim Van Tongeren and others have done work researching directed observation and the weaknesses associated with it.

One way of thinking about exploratory testing is to see it as a way of observing unintended consequences, exploring the possibilities, forming a hypothesis or theory of the results, and experimenting again to see if the new theory works under certain circumstances. This cycle continues, and testing is as much about making new discoveries as it is confirming intended behaviours . Pre-scripting steps and intended consequences can discourage observing these unintended consequences in the first place. A hypothesis, or testing mission or test case is fine to detail prior to testing, but slavishly sticking to pre-scripted results can stifle discovery.

I have had some testers call exploratory testing “unscientific”. A good scientific experiment to them is about carefully scripted test cases that outline every step and the expected results of that test case. However, many times science doesn’t really work that way. A good deal of care is put into the variables in an experiment, but a lot of exploration also goes on. What is important is not necessarily the formula, but how to deal with unintended consequences. Scientific theory is often more about thinking, dealing with empirical data, and making inferences based on experiments.

Scientific theories go far beyond empirical data, and new experiments confirm and disconfirm theories all the time. Yesterday’s scientific truth becomes today’s scientific joke. “Can you believe that people once thought the world was flat?” As a software tester I’ve known “zero defect” project managers who thought the software was bug-free when it shipped. It wasn’t funny when they were proved wrong, but the software testers were treated like “round ballers” when they provided disconfirming information prior to release.

Good scientists deal with a lot of uncertainty. Good software testers need to be comfortable with uncertainty as well. Software systems are becoming so complicated, it is impossible to predict all the consequences of system interaction. Directed observation requires predictability and has a danger of not noticing the results that aren’t predictable.

Exploratory testing is a way of thinking about testing that can be modelled after the scientific method. It doesn’t need to be some ad-hoc, fly by the seat of your pants kind of testing that lacks discipline. Borrow a little thinking from the scientific community, and you can have very disciplined, adaptable, discovery-based testing that can reliably cope with unintended consequences.

1 p.12 The Pleasure of Finding Things Out, Richard Feynman

Javan Gargus on Underdetermination

Javan Gargus writes:

I was a bit taken aback by your assertion that the testing team may not have done anything wrong by missing a large defect that was found by a customer. Then, I actually thought about it for a bit. I think I was falling into the trap of considering Testing and Quality Assurance to be the same thing (that is a tricky mindset to avoid!! New testers should have to recite “Testing is not QA” every morning. ). Obviously, the testers are no more culpable than the developers (after all, they wrote the code, so blaming the testers is just passing the buck). But similarly, it isn’t fair to blame the developers either (or even the developer who wrote the module), simply because trying to find blame itself is wrongheaded. It was a failure of the whole team. It could be the result of an architecture problem that wasn’t found, or something that passed a code review, after all.

Clearly, there is still something to learn from this situation – there may be a whole category of defect that you aren’t testing for, as you mention. However, this review of process should be performed by the entire team, not just the testing team, since everyone missed it.

Javan raises some good points here, and I think his initial reaction is a common one. The key to me is that the people should be blamed last – the first thing to evaluate is the process. I think Javan is right on the money when he says that reviews should be performed by the entire team. After all, as Deming said, quality is everyone’s responsibility. What the development team (testers, developers and other stakeholders) should strive to do is to become what I’ve read James Bach call a “self critical community”. This is what has served the Open Source world so well over the years. The people are self critical in a constructive sense, and the process they follow flows from how they interact and create working software.

Underdetermination

How many testing projects have you been on that seemed to be successful only to have a high impact bug be discovered by a customer once the software is in production? Where did the testing team go wrong? I would argue that the testing team didn’t necessarily do anything wrong.

First of all, a good tester knows (as Cem Kaner points out) that it is impossible to completely test a program. Another reason we get surprised is due to underdetermination. The knowledge about the entire system is gathered by testers throughout the life of the project. It is not complete when the requirements are written, and it probably isn’t complete when the project ships. The knowledge can be difficult to obtain, and is based on many aspects not limited to: access to subject experts, the skills of the testers involved and their ability to extract the right information on an ongoing basis. Realizing that you are dealing with a situation where you probably do not have all the information is key. This helps guide your activities and helps you keep an open mind about what you might be missing.

Underdetermination is usually used to describe how scientific theories go far beyond empirical evidence (what we can physically observe and measure), yet they are surprisingly accurate. One example of underdetermination is described by Noam Chomsky. He states that the examples of language that a child has in their environment underdetermines the actual language that they learn to speak. Languages have rule sets and many subtleties that are not accurately represented by the common usage which the child learns from.

Testers regularly face problems of underdetermination. The test plan document underdetermines the actual testing strategies and techniques that will be employed. The testers knowledge of the system underdetermines what the actual system looks like. Often, key facts about the system come in very late in the testing process which can send testing efforts into a tailspin.

Just knowing that the testing activities on a project underdetermine what could possibly be tested is a good start. Test coverage metrics are at best a very blunt measurement. Slavishly sticking to these kinds of numbers, or signing off that testing is only complete when a certain percentage of coverage is complete can be misleading at best, and at worst dangerous.

If testing efforts do fail to catch high impact bugs, there are a couple of things to remember:

  1. It wasn’t necessarily a failure – it is impossible to test everything
  2. Testing knowledge at any given point in a project is underdetermined by what could be tested

If this happens to your testing team, instead of just incorporating this problem as a test case to ensure this bug doesn’t occur again, evaluate *why* it happened. What information were the testers missing, and why were they missing it? How could they have got this information when testing? The chances of this particular bug cropping up again is pretty slim, the chances of one like it popping up in another area of the program are probably much greater than one might initially think.

Instead of evaluating solely on coverage percentages on a project, be self critical of your testing techniques. Realize that the coverage percentages do not really give you much information. They tell you nothing about the tests you haven’t thought to run – and those tests could be significant in number as well as in potential impact. Evaluate what and how you are testing throughout the project, and periodically call in experts from other parts of the system to help you evaluate what you are doing. Think of what you could be missing and realize that you can do a very good job, even without all the information.

The scientific community does quite well even though they frequently only work with a small part of the whole picture. Testers should be able to as well. One interesting side note is that many significant discoveries come about by accident. Use these “testing accidents” to learn more about the system you are testing, the processes you are using, and more importantly what they tell you about *you*, your testing and what you can learn from it.

Describing Software Testing Using Inference Theories

I am re-reading Peter Lipton’s Inference To The Best Explanation which I first encountered in an Inductive Logic class I took in University. Lipton explores this model to help shed some light on how humans observe phenomena, explain what has been observed, and come to conclusions (make an inference) about what they have observed. Lipton says on p. 1:

We are forever inferring and explaining, forming new beliefs about the way things are and explaining why things are as we have found them to be. These two activities are central to our cognitive lives, and we usually perform them remarkably well. But it is one thing to be good at doing something, quite another to understand how it is done or why it is done so well. It’s easy to ride a bicycle, but very hard to describe how to do it. In the cases of inference and explanation, the contrast between what we can do and what we can describe is stark, for we are remarkably bad at principled description. We seem to have been designed to perform the activities, but not to analyze or defend them.

I had studied Deductive Logic and worked very hard trying to master various techniques in previous courses. I was taken aback in the first lecture on Inductive Logic when the professor told us that humans are terrible at Deductive Logic, and instead use Inductive Logic much more when making decisions. Deductive Logic is structured, has a nice set of rules, is measurable and can be readily explained. Inductive Logic is difficult to put parameters around, and the inductive activities are usually explained in terms of themselves. The result of explaining inductive reasoning is often a circular argument. For this reason, David Hume argued against induction in the 18th century, and attempts through the years to counter Hume rarely get much further than he did.

This all sounds familiar from a software testing perspective. Describing software testing projects in terms of a formalized theory is much easier than trying to describe what people actually do on testing projects, most of the time. It’s nice to have parameters around testing projects, and use a set of formal processes to justify the conclusions, but are the formalized policies an accurate portrayal of what actually goes on? My belief is that software testing is much more due to inference than deduction, and attempts to formalize testing into a nice set of instructions or policies are not a reflection of what good testing actually is.

What constitutes good software testing is very difficult to describe. I’m going to go out on a limb and use some ideas from Inductive Logic and see how they match software testing activities from my own experiences. Feel free to challenge my conclusions regarding inference and testing as I post them here.