In the Part 2 story, we saw what a load testing tool can do when it is used by someone who doesn’t have the right knowledge and skill about the tool and underlying systems. However, you also need to understand the environment where you would need to use the tool. Creating and using test environments that are optimized for load and performance testing is a must. If you use these tools on a regular network, you will likely disrupt everyone else at the office, causing lost productivity and extra work for IT staff. The last thing you want to do is try them out at home, and end up blacklisted by your ISP (internet service provider).
Bye Bye Network!
After a while, I was an old hand at load and performance testing. To bolster my hands-on experience, I attended workshops on how to overcome technical restrictions, how to accurately analyze the data and find problems others would miss, how to write reports and describe risk and problems, and I was adept with a handful of tools. I started to get hired for performance and load testing gigs, and under the right circumstances, I had some rewarding and fun projects. I worked with a lot of talented people with vastly different skills, and learned from each of them.
Since I had a lot of retail and telco experience, a work friend asked me to come in to help him with a large retail system that was going through an upgrade. One of my tasks was to provide load testing help, since they were upgrading all the software and hardware for their back end system. I was given a lot of freedom to choose the tools, to interview everyone I could about any backend system issues, how to simulate credit card processing, etc. I was given a lot of freedom to research and design exactly what they needed. However, I was not given a test network to run the tests, so I never used any load. I verified my load tests would work with only one user.
To find potential areas of concern, we set up monitoring at several key areas on the system, and I had test results output in a format we could utilize with statistical analysis software. We also monitored server utilization, and recommended moving some processes around to better utilize the system. We learned a lot, but I wasn’t ready to unleash full load testing capabilities without a dedicated test network. There was no way I wanted to use this on the corporate network, even though we knew it would only run against our internal test system. I knew from experience that we could overload the internal network and cause problems for others. My friend, the dev manager, ignored my concerns. He was confident that the internal network would handle the extra traffic, since the IT admins had shown him that it was perpetually under-utilized.
Despite my objections, the dev manager insisted I run the load tests on the regular internal network. To start, he wanted to run the tests with 1000 simultaneous users, but I suggested we try something smaller. I wanted to try 10, he insisted we try 100. Still objecting, I hit the “Enter” key on my machine to start the tests. Immediately, a collective howl started to swell across the entire floor of the office. Then people started calling out that they had no network access. The dev manager and the IT manager ran to the server room, and when they unlocked it, all we could see in the dark rook was a sea of blinking red and yellow lights. Clearly, my load tests had overwhelmed the entire network, and every piece of hardware was in an error state. No one in the office was able to do work until all of the equipment was restarted. It took about a half hour to get the network up and running again, and the first thing my friend said was: “TRY IT AGAIN!!!!” He insisted the network outage was coincidental.
I refused to run the tests again, and made him tap the button on my machine. No sooner had his hand lifted from my keyboard, when the collective howl swelled again. The IT admin opened the server room door, and again, it was all blinky lights, and no network access for the company. It was remarkable how quickly the network was getting overwhelmed. Technically, the dev manager and IT team felt it was impossible, but they agreed not to run the tests again until we had investigated the source of the problem. Furthermore, permission and a budget for a test network specifically for load and performance testing was immediately approved by stakeholders.
It turned out that it was an extraordinary event that caused the outage, but it was something that would have happened in production without us catching it internally first. In simple terms, the network cards on the new servers had been set to a default to broadcast to each other when under load, to try to load balance. This was a new feature, that looked good on paper. However there was already had a load balancing system in place, so this was redundant, and harmful. In effect, the servers spammed each other because they were all under load, and the traffic increased exponentially. Machine one would find itself under too much load, so it would message machine two to get it to process excess. Unfortunately, Machine two was also under extreme load and was also messaging machine one, who was messaging machine two for help, as were Machines three and four, messaging each other over and over and over with more and more messages.
To visualize what they were trying to process and the traffic they created themselves, imagine a geometric or hockey stick curve on a graph, or an infinite series in mathematics. The load tests were already creating a huge amount of traffic, but the servers themselves were generating more network traffic at an exponential rate. This traffic generation behavior instantly overwhelmed every component in the corporate network. We quickly turned off that setting in the network cards of the test servers, and then waited for a test network we could safely run the tests on.
The next time we ran the tests, I had several managers breathing down my neck, but the server outages they caused did not cause any network outages. There was no collective howl, no server room full of blinky error lights. We all breathed a sigh of relief, and we went on a find and fix cycle for a few weeks to get the systems ready for a production launch. We were able to ship with a lot of confidence due to this work, and the load tests were part of pre-production tests for years after that launch.
This was a relatively small company, and the impact was fairly low. The entire development team and IT team sat together, and the infrastructure was in a server room on the same floor as the office. We were able to deal with the outages quickly, and the incident became a part of office lore, brought up when a laugh was needed. It wasn’t without political fallout though, since it was disruptive and problematic. Now imagine if this was a larger company, with IT departments in another location, servers at a hosting provider or on the cloud, etc. There could be considerable downtime, and increased costs with hosting providers, etc. While this situation was more lighthearted due to friendships and a tight knit office environment, it could have been extremely serious.
Stay tuned for part 4…