Load Testing Your Web Infrastructure: Please Be Careful. Part 4

Earlier, we looked at different ways that load testing can go wrong, if you aren’t informed, or if you don’t know what you’re doing. In part 1, we talked about a well meaning person who inadvertently created meaningless tests. In part 2, we saw the disastrous effects of someone with a little knowledge creating a mess. In part 3, we read about what can happen to a network if you unleash load tests while other people are working. In this section, we will talk a bit about some of the underlying math we need to use with load and performance testing. (On second thought, “underlying” is a bit misleading as a term, it is actually foundational, but it’s also lots of fun. It’s fun, even for math phobics, as long as you get help from time-to-time.)

NOTE: I am simplifying the math descriptions here for brevity. If you are a stats expert, please don’t be offended by my glossing over the details. The point here is to provide a basic amount of information so people get the gist of it.

What? We Need Math?

It’s one thing to generate load and point out potential issues, but the real key to performance and load testing is an understanding of probability and statistics. A lot of problems are uncovered through basic statistical analysis, and reports on this testing are also used to help with forecasting, service commitments and purchase decisions. Communicating anything useful and actionable about performance requires stats and probability knowledge and skill. It’s important to highlight that generating load and successfully taxing a test system is the easy part of load and performance testing. The hard part, and the time consuming part is to figure out what the results data is telling us, or not telling us. This requires a working knowledge of statistics, including:

Averages
Means, Medians, Modes
Standard deviation
Confidence intervals
Distribution types: normal vs uniform
Statistical significance, equivalence, and outliers
Percentiles
Probability

It’s also important to have a good knowledge of elementary math:

Addition and Multiplication
Exponentiation
Combinatorics

You don’t need deep expertise in these concepts, but a working knowledge is important, as well as the ability to work with these concepts in popular productivity or math tools.

It’s one thing to manage the math, it is quite another to communicate what the math means to stakeholders clearly, honestly, and with context. It’s also important to be able to explain the limitations of what your math work has revealed.

While I’m not an expert in probability and statistics, I had worked at conferences and workshops with performance testing luminaries Scott Barber and Ben Simo. I once spent hours in a conference hotel lounge with Ben Simo as he dumped game pieces on the table and would ask me to observe and describe what I saw. Little did I know that this data visualization practice would help me track down a nasty performance bug months later. I also took online courses, attended other workshops and talks, and tried out various tools. Once I was comfortable with generating suitable levels of load, working with the numbers started to take precedence in my work.

Basic Math and Exponentiation

Performance and load testing requires dealing with large numbers, and calculating and observing the effects of addition and multiples. While this sounds simple, it can be deceptively complex.

At its simplest, generating load against a test server requires generating multiple simulated users, which in-turn requires counting and observing. For example, if you generate 10 simulated users with a testing tool, you need to observe your test environment and see what effect that has on it. Does the machine work harder? What do CPU usage, I/O and other measurable aspects look like? For most systems, ten is a small number and may not even register, so what happens if you simulate 100 users? Furthermore, can the network infrastructure you are using handle that much load, or will it limit traffic in unintended ways?

Once you are absolutely sure that yes, your 100 simulated users are exercising the test server more or less like 100 real users would exercise your production server, now you can start to add on more. What happens with the 101st user? Nothing much? Ok, let’s add more and observe. The trick here is to find the point where unintended behaviors start to occur when you add that nth user to the tests. The temptation is to think of this as a linear graph, where nth amount of load will add n amount of server utilization, but that isn’t how this tends to work. What often happens is the nth user causes a surge in server activity, which looks like a geometric graph, or a hockey stick shape effect. Adding that nth user causes I/O to go out of control, or CPU utilization to stay at 100%, or memory usage to get used up, etc. In other words, that nth test user causes the system to get overwhelmed, rather than increment resource usage the way all the previous ones did. This forces us to move from thinking about addition and multiplication, or simple product calculations, and start looking at exponentiation.

Exponentiation in simplest terms deals with the rapid increase of numbers. This can occur in distributed systems for a lot of different reasons. There can be a massive influx of users for unpredictable reasons, there can be massive increases in utilization of hardware components, there can be data that grows unbearably large quickly … the possibilities are numerous. In other words, something unexpected happens, and suddenly there are huge numbers that are impacting things, and we get called in because these rapid increases upset the status quo, making things worse. This is a complicated topic with lots of discrete math concepts, but it is fun and rewarding to study, as long as you aren’t learning during a production outage.

Even simple product based calculations can be tricky, especially when small numbers can lead to large numbers. Without some thought and analysis, this can lead to poor results. Our brains struggle with large numbers (hence the need to create computers in the first place), and our shorthand for dealing with them can get us in trouble.

How many servers do we need??!??!??!!

One project I worked on required a backend overhaul due to the addition of a suite of mobile apps. The mobile apps used the existing server infrastructure differently than the legacy suite of web apps, and there were some nasty load-related surprises. Trouble was, these surprises were major bugs that required architectural changes in the code base, as well as the server hardware. There was little appetite to address those issues due to cost, and politics, so they were deferred for a later release. In the short term, that meant that they had to severely curtail the estimates of simultaneous users per server with the addition of mobile app usage. (Note, when I say severe, I mean severe, as in a factor of 10 reduction of users.) The thinking was to get a couple of friendly existing customers to take on the mobile app product as beta testers, and then slowly roll on more organizations as the existing code base and infrastructure was updated. Trouble was, some of the sales people weren’t on board with this, because they wanted the potentially lucrative sales and commissions for that now, not months in the future. One salesperson returned from a trip with a friendly, major customer, who had signed up for an early release of our mobile app suite. There was great rejoicing. However …

One of the most important things I do when I take on performance and load testing projects is to read all the published claims about the system. That includes the README files, the release notes, website and other pubs, blog posts, and most importantly, any contracts with user and performance commitments and SLAs (service level agreements.) I asked for the contract that the sales people had signed with the customer, and I was horrified. They agreed to an enormous number of licensed users, starting modestly, but increasing at 3 month intervals over two years. The numbers didn’t look too bad at a glance, but when you factored in that they committed to doubling, tripling, quadrupling, etc over time, it was cause for concern. The lead architect and I spent a few minutes calculating what these commitments looked like in server requirements, and the numbers were insane. If we were to support that number of users without substantial work and massive performance increases, it would require thousands of web servers to support the commitment of one customer.

Getting to the bottom of this required a bit of digging.

It turned out that the lead sales person who had signed the agreement said he had approached QA for information about how many simultaneous users we could support on the test server. He then went to IT and asked how much more powerful the production server was. Since they said it was at least 10x more powerful, he took the QA quote, and multiplied it by 10. He then massaged the numbers to increase to the extreme level to sweeten the sales offer, assuming a massive increase in performance every six months for two years. Of course when he talked to QA and IT, he did not make it clear what he needed the numbers for. We had to explain that you can’t take raw numbers that a server can sustain for a short period of time before crashing, and then multiply it and assume some sort of “half Moore’s law” for the product.

In the end, legal and senior managers had to approach the customer and try to salvage the sale. They were able to renegotiate the contract SLA into something achievable and sensible. It wasn’t pretty, and the company lost money, but they thankfully didn’t lose the customer. It could have been a serious outcome though, with lawsuits and other potentially calamitous outcomes.

Calculating and Communicating Probability and Statistics

The real fun of performance and load testing for me is in the various ways we can use math to uncover important problems. It can also get a bit messy, since we aren’t dealing in absolutes, but in likelihoods. There is some experience involved in how to manage the uncertainty, and that comes with risk. Taking some calculated risks with the math you use can help your clients greatly reduce the risk in the operations of their systems. I used to really enjoy that uncertainty, using mathematical tools, observation and background knowledge to help inform recommendations, and seeing those ideas pay off in better customer service. The only downside is that when you have in-depth work in this area, you will yell at your computer screen when you see polling data, media articles or marketing campaigns that get it wrong either purposefully to manipulate, or due to a lack of research.

What metrics can we publish?

One system I was brought in to test was updating to support a significant higher number of mobile users. They needed to publish some of their user metrics, especially within contracts that required licenses. They wanted to provide a safe number of simultaneous users for customers who were hosting their solution themselves, so they would know what to expect and plan accordingly. This is straight forward, but from a statistics perspective, it adds a lot of complication and time to our work. It is one thing to find problems to fix, and to anticipate what you need for your own systems, it is another to make commitments about that to others. For example, if you have too much traffic on your own system, you can quietly add more capacity and no one needs to know. If a customer who hosts your solution is budgeting for servers, they need to have specifics. Also, if they end up with more traffic than they can handle, you might be on the hook, determining on what claims you have made in your SLA.

Company leadership understood what I needed and were willing to provide everything, including a safe test network. What I had to do was determine safe, but enticing metrics that marketing could use to publish in advertising, and sales could use in service level agreements for contracts. The key was, how many simultaneous users could they safely advertise, and commit to supporting legally? The way forward with this task involved a lot of simulation, and a lot of math.

I started by analyzing their legacy product and their website traffic metrics. Unfortunately, the data seemed to be off somehow. When I asked for more information, it turned out that the data I wanted was from two different sources. To make up for that, IT had been asked to add the two datasets together, and divide by two, providing a sort of average. Unfortunately, this isn’t the way to approach this kind of data. When you are dealing with two separate, but related sets of data, it is sometimes called bivariate data. The reason for this was a bit complicated, but imagine that you could get a dataset for web browsers only, and then a dataset for operating systems only. You can use some deduction on this data to get a better sense of the reality of the metrics. For example, if you are seeing lots of Safari browsers, then you know you are dealing with Apple devices only. But if you are seeing Chrome browsers, they will be Android devices, but can also be Apple and other operating system providers. The “averaged” data provided earlier skewed the data in unintended ways because it didn’t account for those proportions.

To cope with the bivariate data, I reviewed Chi-Square analysis from university statistics, and read up on how to analyze bivariate data accurately. I use spreadsheets a lot, so I found some youtube videos on built in analysis I could use there. Fortunately, while I was struggling with my calculations, a programmer who had worked with complex statistical systems was sent my way. He happily took over the task and used a more suitable approach. The numbers he generated looked much more realistic. With a bit of research we were able to find the proportions of mobile operating systems and web browsers, and our analysis revealed something similar in these metrics.

Phew. Our first math problem was out of the way. However, this had implications for our testing. We had to repeat certain tests to increase our confidence in our analysis. I’m simplifying for the sake of brevity here, but essentially, we needed to figure out a realistic sample size, and calculate our margin of error, or confidence interval. It got a bit complex, and meant we had to have a production snapshot available for a few days and did nothing but re-run subsets of our load tests on it, and analyzed results based on our prior calculations.

Next, we analyzed the new system that would support much more mobile traffic. What might change now that we had better mobile support? Would the proportions of OS/web browser remain the same, only increase in amounts, or would traffic behaviour change completely? Since most people like to use their mobile devices first, we felt that it could have a much larger impact than just increasing the same traffic as the legacy system. The behaviour and type of traffic could change significantly. This was a prediction, or a hypothesis, and we needed to research published metrics of mobile usage when web sites became more mobile friendly to help bolster that prediction.

While we were researching and adapting our tests to better reflect production data, I was extremely fortunate to be on-site during a system outage. I was able to view errors, request snapshots of server logs, server utilization and other metrics, and anything related to data. What are queues doing, are there problematic processes, tables filling up, etc. Also, we were able to gather hardware and network infrastructure information. After the initial problem solving to get the system back up, failure point analysis and bug reports, we were able to pour over the data to get a picture of the weak points in the existing system. This also required some math, since server utilization and other metrics have different formulas. One type of hardware might use one set of metrics, while another might use something that sounds similar, but uses different calculations. In other words, a “one” might be a great measure for one type, while another might use a percentage, like “97% utilization”. Furthermore, “97% utilization” might be a good metric for one service, but a red flag for server CPU usage. Furthermore, monitoring a web server vs monitoring an RDBMS vs network activity can be very different. Also, different applications can behave differently, utilizing different infrastructure and services depending on their unique needs and client load. Context and an understanding of what tools to use and what the metrics mean is vital.

We identified problem areas in the existing system, and then created conditions in the test environment to reproduce this at lighter levels of user load. Then, we used real mobile devices with different OS and web browser combinations and captured their traffic information so we could add those into our load tests. We then used simulated mobile clients to analyze the system and observed how and where the increased mobile clients would impact the servers. Next, we figured out how to artificially create some of these unique conditions in key areas of the system. For example, we created tools to eat up machine memory, or to cause database queries to slow down or even hang. We tried to determine how an influx of mobile users might use the system differently, and created tests based on typical user scenarios mobile users would be interested in. We also determined peaks, such as peak usage by number of simultaneous users, as well as peak usage with regards to system utilization. This is important, since a lot of simultaneous users reading a marketing release is easier to support than fewer users who are taxing the system using applications. From there, we got a good sense of what how the system behaved under heavier load vs. lighter load. Once we had a suite of tests that had a good mix of mobile and PC users, doing simple things and more complex things, we were able to simulate our projected system behavior, once it was released into the wild. We could also force conditions that could be problematic, so we could determine outcomes with various combinations of things going wrong on the back end. For example, what happens if an influx of mobile users all do the most taxing thing that could be done to the system, from a user workflow perspective? In other words, we were modeling expected server behavior based on both web and mobile application usage.

Finally, we worked on what areas we were going to measure. Management had asked for the greatest number of simultaneous users that the system would support, but this is a bit too vague. It is one thing to measure how many users can connect to the home page, versus how many users can use the supported apps, versus a combination of browsing, lightweight processing and apps that require heavy processing. Furthermore, while a server might be able to handle many users without crashing, if the performance is poor, people will get frustrated. Similarly, a server may handle a certain level of traffic for a period of time, and then stop performing adequately, either by slowing down considerably, hanging or crashing, etc. Or, a server may manage many multiple users, but it may become unreliable, also negatively impacting their user experience. To determine what to measure, we needed to utilize the following related testing approaches:

Load testing
Stress testing
Duration testing
Performance testing

Load testing is about generating a number of simulated users, and analyzing the system. Stress testing involves simulating enough traffic to push the server to its limits, or to failure, in order to learn limitations, what behavior to be aware of in production, etc. Duration testing involves load testing over time. Finally, performance testing is all about the measurement. It’s one thing to survive load, stress and testing over a duration, but qualitatively, how is the performance? What measures can we do to signify “good”, or “adequate”, or “poor” performance? We determined to measure average times of connections to the website, and the duration of completing the most common tasks in the mobile apps. That meant we did the typical web measure of simultaneous users and page load times, but we also timed how long it would take to do important things. That said, we needed to be wary of averaging these values too quickly, since outliers are important to find and identify the underlying cause. Once we had a reasonable sample size we performed calculations such as standard deviation in addition to spotting outliers and repeating conditions to cause them and verify when they were eliminated. For example, one issue we ran into was a nasty database table that required a lot of processing time to read, write, update, etc, and that could impact the load times at seemingly random points in user workflows. Once we found a fix, a subset of time delays on certain pages were eliminated.

Next, we analyzed mean, median and mode for each of our measurement points. Mode is one of my favorites for analysis, because it shows the frequency of a result, which can look different when graphed than a mean or median. A mode can show a cluster points at unexpected parts of a graph, which are a sign that there is a performance problem that needs to be addressed. Once averages of our data are calculated, based on sample sizes that are sufficient, I then use one of my secret weapons: percentiles. Percentiles can be used in several ways with performance testing. A percentile takes a portion of the results, which you can then analyze as a subset of your full set of data. For example, with the 90th percentile, you eliminate the top ten percent of your result set, and look at the remaining 90%. I have found a lot of performance issues in systems using percentiles to analyze and visualize data that weren’t apparent when using the full data set. This works because the top results can skew the overall results, pulling the graph in an area beyond the mode, for example. There are several ways you can use percentile to find patterns and problems that are shown in test data, but this is one I use a lot to troubleshoot. I often use the 80th, 85th and 90th percentiles in various ways to find unexpected results in the data. Those three work really well for me to find problems that get flattened out when using 100th. Percentiles are used in other ways in performance testing, but this is a potent analysis tool when you are finding problems.

Once the system was tuned, anomalies discovered and reduced, and the response times are fitting in a normal distribution that coincides with mean, median, mode, etc. then we are ready to measure and communicate metrics. First, we need to create a sample set of test results that is reasonably statistically significant. We don’t necessarily need to have a great deal of rigour with these calculations (such as statistical significance), but we need to run the tests enough times to have confidence in them. For example, running the tests once is not enough for a sample set of data. On your project, running them 100 times with the same build, the same equipment and conditions, etc. might be large enough. Or, you may need to run them a thousand times. In general, the larger the sample size, the better, but diminishing returns can kick in too. This requires some experience and judgment. Other projects may budget for the time and expense to do an auditable, full set of statistical calculations. I will use percentile here again, but rather than using it to look for problems, I am using it to assess the validity of the set of test results we are working with. If I find something surprising, then there is either a bug we didn’t encounter, a server misconfiguration, or a problem with the tool or test environment itself. Once we are happy with the sample set data, we can start capturing metrics and generating reports. (Reporting results could take up several blog posts to cover, so I will just touch on it.)

Determining server performance metrics that we want to commit to isn ‘t an exact science. Our test environment is rarely identical to a production environment, and no matter what we do to distribute simulated test users, etc, we aren’t completely emulating real world conditions. As a result of the statistical calculations, and analyzing the probabilities of events occurring, we tend to deal with percentages. “We guarantee a 99% up time” is a common one we see in marketing materials. They don’t say “100%, because there are so many factors beyond their control that might temporarily cause down time. Server up time is a pretty simple metric to measure and communicate, whereas performance is even less exact. For example, in testing, 90% of users may experience page load times of a certain average, or falling in a certain range, 90% of the time. Furthermore, the metrics we publish to brag about versus the numbers we are legally required ot meet might look very different. For example, we may find that a certain type of server configuration is adequate for performance targets, using a certain number of users. An aggressive approach might be to publicize one particular set of data that is attractive. We reached that level once, so we will tell the world we can do it. When it comes to SLAs though, we will likely be much more conservative. In some cases, an average is determined, and then some breathing room is built in those metrics by diminishing them, just in case of some events in production that weren’t apparent in test.

Communicating and reporting results requires skill and experience. Figuring out what is useful to measure, how to accurately analyze and interpet those measurements is part of the picture, but communicating what that means, what the limitations are, and providing advice on how to proceed is much more difficult. It’s one thing to do the math, and it’s altogether another to do something useful and helpful with it.

Lies, Damned Lies and Statistics

One of the great side effects of load and performance testing is how formerly intermittent bugs start to become repeatable. This is due to high volume test automation, one of the most powerful and useful test automation approaches you can use. While it is often unintended, adding load starts to cause problems to bubble up. This is so common, I always recommend teams schedule time around their load and performance testing efforts to deal with the inevitable issues that crop up. This is a good thing, because it helps improve the overall system and the end user experience with your software. In the short term though, it can be frustrating and might threaten schedules. These problems tend to require time and effort to fix, so while testers get excited, project managers start to get nervous.

One performance testing project I worked on had a particularly nasty “unrepeatable bug.” Once in a while, a tester using one of the web apps would experience a crash. This crash would also cause the test web server to hang, requiring a manual restart. No one was able to repeat it, so it was put into the state where bugs go to get forgotten, otherwise known as: “We’ll monitor it.” One day, the QA team installed a major new build. The team was getting ready to release a new version of the software with some new features and important bug fixes included. We started to run our automated tests, and testers began to work through their daily tasks. Suddenly, there was the familiar crash, and the required server restart. We had four test servers at the time, with one dedicated to our load and performance testing, with the other three available for other testing work. The testers moved on to a new server as the frozen one was restarted, and then the bug happened again, a tester saw a crash report, and the server froze up. Now there were two. Once again, a server froze up, and the testers were all on one test server. It crashed, and so did the load testing server. “That’s odd.” At one point, we had all four test servers requiring a restart at the same time, and this was causing serious productivity issues in QA, not to mention the implications for the new release. We raised the issue with the product and project managers, and started to analyze it.

The testers all kept track of what they were doing when they saw the bug, but we quickly set that aside. There was a factor in the system that wasn’t observable through the UI that was the likely culprit. We started to monitor the servers, turned up logging to get more information, and when a crash occurred, we tried to investigate every component of the web infrastructure on that server. We used low level load testing traffic on the each of servers to cause the bug to occur even more frequently. It took a couple of days, but we realized there was a strange race condition, where two services were utilized at exactly the same time. In the previous version of the software, this happened infrequently, but now, it was happening a lot. But, at least we had a repeatable case, and with the aid of our automated tests for load testing, we could repeat it on command, within five minutes. That gave developers the opportunity to run their debugging tools and track the issue down so they could fix it.

Trouble was, the fix was not an easy one, and was extremely political. To fix the problem required some major architectural rework, and re-opened a major debate on the development team. There had been bitter disagreement on a particular direction, and the one that was chosen was not popular. Now that the unpopular architectural decision was shown to be problematic, the issue blew up. There were heated arguments, lots of negative back channel chatter, polarization over possible solution ideas. All of this caused a lot of hurt feelings and resentment on the team. Some minor server setting tweaks were proposed, and each of them helped reduce the frequency of the bug somewhat, but didn’t reduce it enough. The team now had a choice: proceed with the release as-is, delay the release to try to find a temporary fix to reduce the occurrence more, or put the release on hold until the rework could be done to remove the problem for good. I was tasked with coming up with an impact assessment to help management determine a course of action. Here is what we observed, so I recorded it:

“Intermittently, a catastrophic bug causes a web server to crash, requiring a reboot. This means that once the bug occurs, the server is not available for users until it has been restarted. It doesn’t corrupt data, but it deletes the work that the user was currently working on, so they have to start over. The user will see a crash message, and once they refresh and connect to a new server, they have to log in again, and start over. In the meantime, there are fewer servers available, which means that at times, some users are unable to connect until someone else logs off. We found that on average, one in five users who connected to the server would come across this bug. This is a high probability issue, and it affects more than just the person who triggers the crash, the server is now unavailable for anyone until there is IT intervention. It costs time and money, not to mention the extreme frustration of the users who experience this. With self-hosted equipment, there is time required by IT to go and reboot the server, often several times a day. With cloud-hosted infrastructure, moving to new servers could cause expenses to increase significantly.”

Unfortunately, the people with political power did not want to fix the problem, they wanted to release. They took my 1 in 5 occurrence metrics and reframed it. While it wasn’t technically a lie, they greatly minimized the impact of the bug. This is what they told senior management:

“There is a severe bug that QA have found a repeatable case for, but it is going to hold up the release to fix it. The bug only happens 20 percent of the time!”

They also heavily implied that it was happening in the test environment more frequently because the QA team were abusing the system to find more bugs. Technically, we were using load testing tools to generate very light levels of load, but they didn’t say that. “You know how QA are, and they are also running load testing!!!” which made it sound like it would happen more frequently in test than in production. However, we were extremely worried about how often it could occur in production, with thousands of users, instead of the 15 testers and light load we were generating in the lab. Senior management decided to move ahead with the release as it was, and take a risk on the bug not occurring at all, or occurring infrequently. Why did they do this?

A 1 in 5 chance of something occurring is quite high. So is 20%, but twenty percent sounds smaller. If you use that figure without context, and your attitude is to make it seem small and insignificant, people will generally interpret it according to how you spin it. A 1 in 5 chance of the bug occurring in production, could mean that 200 people out of the first 1000 could experience this bug. It wasn’t uncommon for client sites to have dozens or even hundreds of simultaneous users, and our servers would peak at 1000 simultaneous users at times. If you think of 200 people seeing this crash, and then many people having to log in to a new server and start over, until license or server capacity was filled, with the system being unavailable for everyone after them, it starts to look more serious. However, the political players decided to just say “It has a 20% chance of occurring.”

The product management lead approached me and asked for a second opinion. I had to tread carefully because of the political implications of what they had been told, but I explained that even a 20% chance is sky high. For a bug like this, we could risk a 0.02% (zero point zero two percent) chance. Even a 2 percent chance would result in outages that would anger our customer base. For example, if you were gambling in Vegas, you’d take a 20% all day long. Those are wonderful odds if you are gaming. To hedge their bets, I advised that they create and rehearse a roll back strategy in case the new release was as bad as we expected it to be. Thankfully, the team followed that advice, because the release was a disaster. Every client site had no access at all by mid morning, which meant that our IT and customer support teams were busy 24 hours a day, dealing with extremely angry people. The release was rolled back, and the difficult architectural change was implemented, and the bug disappeared. It was weeks of effort, but if they had decided to wait on their release, they would have been much better off than unleashing something so unstable to the public. They lost a lot of money, they lost face publicly, and they lost some customers. They also lost months of time on their product roadmaps, since everything ground to a halt to address the customer anger and problems, and then efforts were split between support and fixing the problem.

The most expensive combination were the cloud based hosting services of the system, in some cases causing a huge increase in hosting bills. When you couple a frequently occurring server outage and a wish to fix the problem quickly with an extremely easy way to add more servers, you can quickly end up over your hosting limit and incur costs. As you might imagine, there were some extremely angry customers whose IT teams fell into the “just add more” trap to try to minimize the problem.

What went wrong? Someone decided to use metrics to try to spin a narrative that was counter to reality. This happens all the time in the world! It is almost always by people who want to minimize the problems highlighted by scientific rigour, or to try to maximize public support for unpopular policy. Or it is used by people trying to sell you something. The concept lies, damned lies and statistics explains how metrics can be used to spin a narrative. It’s important to question narratives, especially if they lack context. What can go wrong? Who wins and who loses when a particular course of action is taken? Are methodologies with weaknesses and strengths explained, or are they glossed over? Is the person presenting the data a relevant authority, or are they just a good talker? What happens if you scale up the numbers (if they are small), or scale down the numbers if they are large? Does the message change? These are all important questions to ask yourself when you are shown data that is supposed to convince you of something. The math lesson here is how you communicate metrics is important. Spin can blunt a serious issue and problem minimizers can win out of they are clever, albeit dishonest, communicators.

Jonathan Kohl