“Don’t Shout at your Disks” and Performance Testing

I get asked about performance and load testing a great deal. It’s fascinating work, and sometimes the strangest causes of performance problems can occur in a system. When people ask me about what goes into performance and load testing, it’s difficult to explain simply. Here is a related video that will provide a tiny glimpse into what performance latency can look like: Unusal disk latency

The rest of the story with performance testing, how people can actually find things like this in systems without relying on chance or fluke is the enormous amount of analysis (using basic probability, statistics, counting rule) and the use of the scientific method to try to explain data anomalies.

When you see something strange in the data, and recreate those conditions using simulation, you start to look at points in the system and where failures can occur. When you make an observation during an experiment of phenonena that trigger the strange behavior in the system, you work to replicate that accurately through emulation or simulation. That’s why when you see the problem repeated for a bug report, you see someone knocking a device with their knee, or yelling at the disk drive. It looks crazy, random and like it was achieved by luck after the fact, but getting to that point is quite logical – you do whatever it takes to simulate those conditions to make the problem repeatable.

Often, looking at performance data in various ways leads you to the problem areas. In other words, creating and staring at graphs, histograms, mean/median/mode/percentile data is the first place you notice an area to target. From there, you use tools to simulate those conditions, and observe parts of the system under simulation. When you see strange behavior, you figure out how to repeat it. Believe it or not, I’ve used my knee, cold soda pop cans, soldering irons (with hardware) and other software (load testing and others) to get systems to the desired problem state in a repeatable fashion.

I’m not really a performance tester by trade, and I get my advice from people like Scott Barber, Ben Simo, Mike Kelly and Danny Faught. Each one of them have helped me in my performance testing work this year. As Scott says, 20% of a performance tester’s work is generating load, the other 80% is in the analysis, which can lead to strange and interesting conclusions, and the odd weird look from other team members. 🙂

Thanks to Aaron West for pointing out the video. It brought back several memories of various testing experiences I’ve had over the past couple of years.