In the Part 1 story, time, money and effort were wasted. This story is much more serious. Load and performance testing tools can be simple to get started on, but they belie a good deal of complexity. In other words, a little knowledge can be a dangerous thing. While the tool may look simple, and like there isn’t a lot going on, they have a lot of power and can unleash mayhem on a system. To simulate adequate load, the tools are generating a lot of traffic, which can have unintended consequences unless you know what you’re doing. Using record/playback can be handy when someone has skill and understanding of what they are doing, but when used by someone who is unskilled, can unleash absolute misery. Just because you can use a tool and generate load doesn’t mean that you should.
A Complete Clusterfuck
A year after the Part 1 story, I was brought in to work with some Agile teams that were helping an overwhelmed IT department. Load and performance testing were brought up, but since I had been down that road before, I explained the work and potential pitfalls to stakeholders. They agreed we should treat it as a separate project, and use a cross functional team. However, a high powered consultancy had brought in a team who were desperate to show their mettle. They were skilled, they had a great reputation for turning projects around, but they were extremely arrogant. I was pulled into a meeting with sneering programmers who mocked my experience and concerns about load testing without analysis and careful planning. After my treatment in the meeting, my manager told me to decline further invitations, and let them “sink or swim.”
I didn’t hear much about what they were doing for a few weeks, but then one day a concerned executive assistant called the CTO. The CTO called the IT manager, who in-turn called the people who were on my team. I was on a small cross-functional team that worked on development projects, but we would get pulled into helping fix any difficult production issues. The problem was that the CEO couldn’t access their work email. After rolling our eyes and asking if they had forgotten their password, we realized that webmail access for the entire company was down. The lead IT Admin and I sat next to each other, and he provided me with a play-by-play of what he was doing. He found that the webmail service was hanging, so restarted it. Webmail briefly came up again, but the service started to hang again. Then more reports came in of poor performance on the corporate network, and some services becoming unavailable. He had to restart the mail servers, which in a large organization is not a simple task. It requires communication to all staff, timing warnings over a few minutes, doing the restart, communicating and monitoring. Similarly, certain areas of the network seemed to be under some sort of attack. Was it a security breach? Did someone have a virus or trojan horse?
Eventually, we tracked down the excess traffic to a particular machine, and it was one of the staff consultants from the arrogant consultancy. The IT Admin blocked his IP from the network, and we went to management to figure out what to do next. We wandered over and initiated a chat with a now angry group of consultants who were furious that one of their team members had lost network access. After a brief explanation, and a query as to why they were nuking our network, they admitted they had tasked one of their junior consultants with researching load testing tools. He had downloaded an open source tool, recorded HTTP traffic, played it back, and then kept adding more simultaneous users. There were several problems here, and senior management were furious. The consultant was kicked off the project and escorted out the door, and the consultancy was warned that they were in breach of contract. They had ignored several directives that they had pledged to follow when they signed the contract. As time went on, more problems than the CEO not being able to access webmail started to emerge.
Internally, there were formal complaints to the IT team about a lack of access and downtime. IT was in violation of their commitments for network and tool availability, and management had to spend time mollifying angry managers in other groups. You have to imagine what can happen in an internal network when someone starts generating hundreds of simultaneous requests over and over. Devices get saturated and stop functioning, others go into error mode, and everything slows to a crawl. IT technicians need to identify areas of the network that need intervention, and try to remotely restart services. In some cases, they had to physically go and restart network infrastructure manually. This resulted in thousands of dollars worth of lost time that day.
Remember when I said that if you record traffic for a load testing scenario, it will capture ALL the protocol level traffic on your machine? It turns out that this programmer didn’t know that or think of that. Later that day, the consultancy found out that they were locked out of their corporate messaging system. This is a core tool for a company that has most of its employees distributed at various customer sites. The load test against our system included all the instant message traffic that occurred while he was recording the scenario. They were without their system for days, while they negotiated with the vendor and tried to explain why one of their employees had essentially executed a denial of service attack. They were able to reinstate their corporate account, but that employee was banned from using it.
A few weeks went by, and an IT Manager came storming into our development area with a credit card bill. There were several thousand dollars worth of mystery expenses on it. It turned out that the day of the tests, he had given the consultancy his corporate credit card number “to run a few tests”, and assumed that they would let him know what they had done, and he would call to cancel them. The day of the load test disaster, the credit card company called to let him know they had frozen his account, but he assured them it was ok, people were running a few tests. By the time he had approached the staff consultants, the load testing had been stopped. Unfortunately, no one thought to connect the dots and tell him how his corporate card had been used. Thankfully, the credit card company found the problem and shut down his card, but the damage was done. He had to get a new corporate card, and it took time to dispute the payments and get them refunded. It took time, energy, and other managers had to use their cards on his behalf.
In the end, the consultancy lost their MSA with the company, and they lost credibility due to one person ruining it for everyone else. Unfortunately, a consultancy with people who weren’t as skilled was hired instead, but they were much nicer to deal with. Internally in IT we had hoped the prior consultancy would work out, because they had the skills and experience to deliver. Due to their arrogance, we all lost out. Furthermore, IT lost credibility with the business for allowing a consultant to wreak that much havoc. Because of the sudden, repeated excess traffic from that location, even our corporate ISP had flagged us, and that required finessing and promises to not occur in the future. If we suggested a vendor, stories about this ridiculous situation would be recalled, and we would get stuck with less ideal providers that other groups chose for us. This, plus thousands of dollars of costs, not to mention all the staff work to clean up the mess was caused because someone without the knowledge and skills used a tool they didn’t understand and ran it on our network. Depending on who retells this story, it can even sound amusing, but it was extremely serious. This person downloaded an unauthorized tool against a client corporate policy, recorded some HTTP traffic, then ran this over and over with various sizes of payloads. A few hours of playing around with something they didn’t understand had extremely serious effects.
4 thoughts on “Load Testing Your Web Infrastructure: Please Be Careful. Part 2”