7 Common Biases That Skew Big Data Results
Flawed data analysis leads to faulty conclusions and bad business outcomes. Beware of these seven types of bias that commonly challenge organizations' ability to make smart decisions.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/blt114d9d50b247c1a3/64cb57e8c9d8ca728f12a8ed/1-road-sign-663368_1280.jpg?width=700&auto=webp&quality=80&disable=upscale)
Data-driven decision-making is considered a smart move, but it can be costly or dangerous when something that appears to be true is not actually true. Even with the best of intentions, some of the world's most famous companies are challenged by skewed results because the data is biased, or the humans collecting and analyzing data are biased, or both.
"There are a lot of perils of data analysis. Some people say, 'The number says X, therefore we should do X,' but there is more nuance, [such as] how the number was calculated and whether you are able to make the inference from the number you think you can make," said Spencer Greenberg, a mathematician and founder of decision-making-tool provider ClearerThinking.org, in an interview.
Various forms of bias seem rather obvious in marketing and political campaigns; however, bias is not always obvious. Some types can be difficult to identify. Neither are biases always mutually exclusive, meaning that several biases may arise in a single use-case. Because bias is pervasive and not always obvious, business leaders and members of any data team should be aware of it and take steps to avoid it (or at least minimize the effect).
Data teams are familiar with different kinds of bias and their effects on analytical results and conclusions. Business leaders are less likely to know the details of specific bias types, but they likely have experienced the effects of bias firsthand when a project or initiative did not yield the expected results.
"Business leaders can ask data scientists for something as simple as a confusion matrix -- how many false positives are we seeing, how many false negatives are we seeing, and how many members of the population fall into the four boxes [true positive, true negative, false positive, and false negative]," said Keith Schleicher, managing director of decision science at Internet service provider CenturyLink in an interview. "They can also ask what kind of sampling was done and whether the sample size is accurate."
While such terminology has not been used much at the business level, it is becoming more common because business leaders are responsible for the data-based decisions they make and should question things like the quality of data, where it comes from, how it was gathered, and what's behind the numbers.
Here we present seven types of cognitive and data bias that commonly challenge organizations' decision-making. Once you've reviewed these, tell us in the comments section below whether you've experienced any in your organization, and how that worked out for you.
Confirmation bias occurs when there's an intentional or unintentional desire to prove a hypothesis, assumption, or opinion. It is one of the most common types of cognitive bias -- and one that humans easily fall victim to -- because the data often "feels right."
"You'd think producing a statistic is an objective process, but actually there's a lot of flexibility that goes into a process in terms of what variables you calculate and how you calculate them," said ClearerThinking.org founder Spencer Greenberg. "This happens a lot [when] people have an idea of what should be done or [as] a bad result of analysis. They end up finding a way to get the statistic to agree with [their original idea] and rationalize it afterwards."
Patrick Rice, CEO of predictive analytics startup Lumidatum, and others interviewed for this article, agree that there is a very real danger of business leaders transferring bias to data scientists and analysts. Quite often, business leaders treat the data team like the business intelligence team, meaning their approach to collaboration is, "Go build this report." However, in the case of confirmation bias, the directive is more along the lines of, "Make sure the data agrees with my point of view."
"[Confirmation bias transfer] can be especially dangerous given that the analysis in these scenarios is often used to make important business decisions," said Lumidatum's Rice. "[It] can lead to some pretty damaging results."
Selection bias commonly occurs when data is selected subjectively rather than objectively or when non-random data has been selected. Because the population selected does not represent the actual population, the results are skewed.
Surveys are a good example of selection bias, because specific questions are selected for the purpose of revealing particular insights. In addition, the surveys are sent to a select group of people, some of whom opt in. Although survey respondents are often regarded as representative of a total population, the layers of selection bias make that unlikely.
"Selection bias is one of the major flaws associated with the increased availability of big data," said Kevin Sheetz, CEO of market intelligence platform provider Powerlytics, in an interview. "Many businesses only capture a small piece of the pie when it comes to data available to their segment or industry, and this means their data and subsequent analysis are skewed. Much of the data companies use to make critical business decisions is incomplete, inaccurate, and of poor quality, and, as you can expect, this leads to inaccurate analysis and benchmarking."
Outliers are extreme data values that measure significantly above or below the range of normal values or the pattern of normal distribution. They are common and particularly dangerous for people whose day-to-day job does not involve statistics because it is easy to put faith in a simple average, such as annual revenue per customer, in which numbers are simply added up and divided by the total number of customers. Relying on such numbers at face value can mean placing faith in a number that does not paint an accurate picture.
"Sometimes we encounter data where outliers are so extreme that they completely bias the results of an analysis. In such cases, if the outliers are not properly removed, it could lead to a totally false and misleading analysis," according to ClearerThing.org's Spencer Greenberg.
A simple adjustment made a tenfold difference in one case at ClearerThinking.org. Before the outliers were removed from that dataset, the mean of the data was 68.5 and the standard deviation was 319.7. After the outliers were removed, the mean fell to 6.7 and the standard deviation fell to 7.0.
Max Galka, a data scientist who runs the Metrocosm data science website, warned that outlier bias is particularly prevalent in big data because the larger the dataset, the harder it is to find outliers.
"When you're working with really large data sets, they're often too big to do manual checks," said Galka in an interview. "In data sets where there are millions of records, manual checks can't even scratch the surface of what you would actually need to do to check for outliers, so, just practically speaking, the way you'd look for outliers with small data sets you just can't do when a data set gets too large."
Removing outliers is prudent in some circumstances and irresponsible in others, as in the case of insurance fraud.
"[I]n the case of insurance claims, you can't just throw out the outliers. In situations where you have extreme claims, you have to do something about them," said Keith Schleicher, CenturyLink's managing director of data science. "You have to analyze them separately and recognize they're a real risk to the business; you can't just throw them out because you might find out a handful of values are driving all of the bottom-line impact on the firm."
A trend that is indicated in groups of data can reverse when the groups of data are combined. This is called Simpson's Paradox. It is one reason medical findings and other types of research first report one thing and then the opposite result at another point in time. It is also one reason why seemingly successful marketing campaigns prove not to be successful after all.
"The most common bias in data analysis is called the Simpson's Paradox," said Rado Kotorov, chief innovation officer at business intelligence and analytics provider Information Builders. "It's important to realize with big data, using descriptive statistics or just data visualization can lead to bias and wrong decisions. The data analysts need to know when to evaluate the trends statistically to determine that the trend is real and also that the factors that contribute to the trend are significant and not random."
A sales and marketing campaign may result in little or no ROI when the customer incentives are based on faulty conclusions. Kotorov tried warning a former employer that more analysis was necessary to validate a trend upon which a marketing campaign was based, but the warning was ignored.
"Instead of driving sales up and increasing the profit, [the campaign] drove margin down and increased the loss. I couldn't convince anyone to take the time to investigate the trend and validate whether it was correct or wrong. When the testing control group measurements came, that's when we had to look at it and see what happened, and we found out that the trend was the wrong trend."
Today's marketers are using marketing analytics tools simultaneously with multivariate testing to slice and dice data which results in different levels of aggregation. However, the averages can be misleading.
"The typical fallacy is if you do things at a fine level of aggregation, but do not find contradictions immediately, then you'll follow your instinct that the trend is a valid trend," Kotorov said.
Overfitting involves an overly complex model that includes noise. Underfitting is the opposite; the model is overly simple. Either skews results.
"[Overfitting] is one of the most common (and worrisome) biases. It comes about from checking lots of different hypotheses in data. If each hypothesis you check has, say, a 1 in 20 chance of being a false positive, then if you check 20 different hypotheses, you're very likely to have a false positive occur at least once," Greenberg said.
He tested the effect that various behavioral interventions had on site participants. At first, it appeared that a particular behavior outperformed the control. However, when a correction was applied that adjusted for the number of hypotheses tested, the statistical significance vanished.
"If you're using a high-dimensionality, non-linear predictive algorithm that has lots of degrees of freedom that allow you to sift data to the tee, you can essentially take any function and map it point-by-point so that the model does a tremendous job at looking at the data that you fitted on it. It's excellent on that, but if you go beyond that realm, it does awfully in terms of predicting data points that are outside the spectrum you looked up," said CenturyLink's Schleicher. "We inevitably split up our data sets into training data sets and testing data sets, and do cross validation across such multiple sets to make sure we don't overfit."
Sometimes a perceived relationship between two variables may be proven partially false or entirely false because a confounding variable has been omitted (often because it has been overlooked).
"It could be that different populations are collected or reported differently or by different people, a causal variable that affects the behavior of each population, or an inherent quality that leads to autocorrelation," said Metrocosm's Max Galka.
Schleicher once worked on a survey that asked respondents which credit card brands they would consider. Over a three-year period, the data indicated that the consideration numbers for one credit card company nearly doubled, while those of several other companies remained flat. The obvious conclusion turned out to be the wrong conclusion.
"A confounding variable is cardholders have higher consideration for their current credit card companies than people who are not customers," said CenturyLink's Schleicher. "The company had gone through several M&As, and their portfolio had grown enormously over that three-year period. They hadn't improved their consideration, or customer experience, or how their customers valued them. They just acquired more customers through portfolio mergers.
Some statistical tests, such as a t-test, assume that a bell curve (normal distribution) exists, but if that is not the case the results may be biased and misleading.
When ClearerThinking.org's Greenberg examines people's moods following the completion of a training program, the assumption of a bell curve proves to be highly inaccurate. If he tried to force-fit the data into a bell curve, the shape would not be evenly distributed. It would be skewed significantly.
"A t-test is a statistical examination of two population means. A two-sample test examines whether two samples are different, and it is commonly used when the variances of two normal distributions are unknown, and when an experiment uses a small sample size," he said. "[F]or one of our interventions, we got p=0.03 using the t-test. On the other hand, we get a p=0.06 when we do a non-parametric analysis that doesn't assume that the data is normal."
Some statistical tests, such as a t-test, assume that a bell curve (normal distribution) exists, but if that is not the case the results may be biased and misleading.
When ClearerThinking.org's Greenberg examines people's moods following the completion of a training program, the assumption of a bell curve proves to be highly inaccurate. If he tried to force-fit the data into a bell curve, the shape would not be evenly distributed. It would be skewed significantly.
"A t-test is a statistical examination of two population means. A two-sample test examines whether two samples are different, and it is commonly used when the variances of two normal distributions are unknown, and when an experiment uses a small sample size," he said. "[F]or one of our interventions, we got p=0.03 using the t-test. On the other hand, we get a p=0.06 when we do a non-parametric analysis that doesn't assume that the data is normal."
-
About the Author(s)
You May Also Like