Data Outliers: 10 Ways To Prevent Big Data Damage
Most business decision-makers aren't trained to understand data outliers, but they can learn the basics. Executives, managers, and employees without math degrees can ask smarter questions about analyses they're basing crucial judgments on. Here are some things to know.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/blte05d6d250cc73fe4/64cb40b315a7e1a63982fe14/light-bulb-549090_1280.jpg?width=700&auto=webp&quality=80&disable=upscale)
Data analytics has its own vocabulary that business decision-makers are under pressure to learn. Beware, though, because technical terms are often used loosely, sometimes to the detriment of individuals and their companies. An outlier is a good example. A lot of people are talking about outliers, but not a lot of people understand why they exist, what causes them, and what should be done with them, if anything.
"An outlier is a member of a defined dataset which has a dramatically different value than the other members of the set. It can be the result of measurement or recording errors, or the unintended and truthful outcome resulting from the set's definition," said Tom Bodenberg, chief economist and data consultant at market research firm Unity Marketing in an interview.
Outliers make their way into reported statistics every day. Sometimes their inclusion or exclusion is obvious, and sometimes it isn't. For example, in 1984 the University of Virginia reported that the average starting salary of Rhetoric and Communications graduates was $55,000. However, an outlier was skewing the analysis. The dataset included one hundred graduates with $25,000 salaries and NBA first draft pick Ralph Sampson, another graduate. His starting salary exceeded $1 million.
Outliers can pop up for different reasons. Some are caused by mistakes made by humans or machines. Others represent actual data. Most business professionals haven't considered the difference, and they have no idea what to do with them.
One tactic is to include outliers in a dataset or exclude outliers from a dataset as a matter a course, without considering the potential consequences. While it's true that the inclusion or removal of outliers may have little or no effect on an analysis, the opposite may be true.
Learn to integrate the cloud into legacy systems and new initiatives. Attend the Cloud Connect Track at Interop Las Vegas, May 2-6. Register now!
"If you're working with data, or other people are giving you results based on data, it's useful to consider how outliers are detected and handled, and what you can learn from them," said Spencer Greenberg, mathematician and founder of decision-making tool provider ClearerThinking.org, in an interview. "Important questions to ask are, 'Were there outliers in the data? Why did they occur? What can we learn from them?' And 'How were they dealt with?'"
Some organizations analyze outliers to detect such things as fraudulent transactions, criminal activity, security breaches, and disease outbreaks. In fact, outliers can sometimes tell interesting stories that might not otherwise have been considered.
"Anyone who is trying to interpret data needs to care about outliers. It doesn't matter if the data is financial data, sociological data, medical data, or even qualitative data like a relationship. Any analysis of data or information must consider the presence and effect of outliers," said Sham Mustafa, founder and CEO of data scientist marketplace Correlation One, in an interview.
Some outliers are easy to spot. Others are more difficult. Here are a few things to consider.
People and machines may be responsible for poor quality data that makes its way into an analysis. Someone may have typed a wrong number or transposed digits in a number. Alternatively, a piece of equipment may report an erroneous value that skews the analysis. It is also possible for data to be corrupted while it is transported across a network.
"The importance of removing or fixing an outlier depends on how extreme it is. If it was caused by a minor mistake, it may not matter very much. But if the outlier is extreme, it may negatively impact your analysis and lead to the wrong conclusion," said Spencer Greenberg. "If your outlier is caused by a mistake, you want to remove the value or fix it. And the more extreme it is, the more important it is to do that."
Outliers can be hard to detect just by looking at the numbers. Data visualizations can make them immediately obvious.
"Visualizing data can make an outlier jump out at you," said Greenberg. "If you can see it, you can attempt to understand it. And if you can understand why an outlier is occurring, then you have the opportunity to learn from it and decide what you should do about it."
Most business professionals are aware of "the bell curve," or normal distribution, because they were taught something about it in high school or college. The concept is popular because it applies to many things in everyday life and in business, such as the ambient temperature range of certain equipment.
Bell curves tend to appear whenever a variable is the result of various influences added together for a result, for instance the sum of the effects of many genes that each adjust one's height a little bit. In a normal, bell-shaped distribution, the majority of a population clusters towards the middle. For example, the average height of an adult human male is 5'10". (Sixty-eight percent of males are between 5'7" and 6'0" tall.)
"The mean tells you about the center of the data. The standard deviation tells you how wide the data is or the scale of the data, but it's extremely sensitive to outliers," said Greenberg. "You have to be sure that you've carefully looked at your data, and you know that the result isn't massively affected by just one data point."
Another fairly common type of distribution is the so-called "fat-tailed distribution," which is more prone to extreme values than a normal distribution. Fat-tailed distributions are of particular interest in the financial community because they can model extreme outcomes, such as financial market crashes, more accurately than a bell curve can.
Outliers occur for a reason. The question is whether or not they belong in a particular dataset and the subsequent analysis.
"What causes an outlier is actually the critical question. Some outliers are clearly data errors and do not belong in an analysis. For the harder cases, we look to see what degree of influence a particular point is having on the overall analysis," said John Johnson, founder of Edgeworth Economics and author of Everydata: The Misinformation Hidden in the Little Data You Consume Every Day, in an interview.
Johnson and his team also look at the robustness of the results, since robust results are not prone to large swings when the outlier data points are removed or tested. Identifying which points have the most influence and their potential effect on the results helps people become smarter consumers of data, Johnson said.
The explosion of data and data analysis tools is causing more people to think about and work with data. Being able to think critically about data is extremely important.
"Everyone has to be aware and careful when they think about data. Outliers are lurking everywhere, and at least heightening one's awareness of the possibilities is important," said John Johnson, founder of Edgeworth Economics and author of Everydata: The Misinformation Hidden in the Little Data You Consume Every Day. "Evidence-based approaches don't have to be unnecessarily complicated or complex. [B]eing able to look at information and spot places where something looks unique or unusual and ascertaining how that fits into your analysis is key."
Outliers are merely an extreme or unexpected value. They can represent risk, opportunity, a mistake, an anomaly, or something else. Whether the indicator is positive or negative depends on the context, the purpose of the analysis, and the company's goals.
"The term, 'outlier' has a negative connotation, so some people naturally assume that outliers must be undesirable," said Innovizo's Bichutskiy. "In fact, outliers can be your 'needle in the haystack.' For example, in business, outliers could be customers who spend far more on your products than most other customers."
As with most things, assumptions about data can be misleading, and biases can impact the outcome of an analysis.
"Sampling mistakes and wrong assumptions about underlying statistical distributions are typical mistakes people make. It's also common to see people use statistical tests or analytical packages without understanding what the underlying assumptions are," said Parvez Ahammad, head of the data science and machine learning group at application delivery platform provider Instart Logic, in an interview.
"Check your prior assumptions or beliefs about the data and make sure they are valid, be open-minded about what the data tells you, collect as much data as you can so you have a good enough sample set to arrive at the decision, and, if you encounter outliers, check with other folks who may have the expertise [to understand them] and offer an alternative explanation."
Outliers can be more or less extreme, simple, or complex. There are many ways to define and describe them. While the average business person can't be expected to understand all the nuances, they should have a grasp of the basics and seek help from a data scientist to get help with interpretation and validation.
"Outliers can be characterized in many ways: from single characteristics to complex parametric characteristics and behaviors," said Sean McClure, director of Data Science Data visualization and predictive analytics solution provider Space-Time Insight, in an interview. Outliers can and do affect and impact all aspects of organizational management and business operations. Workers, managers, and executives all need to be aware of outliers to effectively manage their organizations and business operations."
Outliers can be more or less extreme, simple, or complex. There are many ways to define and describe them. While the average business person can't be expected to understand all the nuances, they should have a grasp of the basics and seek help from a data scientist to get help with interpretation and validation.
"Outliers can be characterized in many ways: from single characteristics to complex parametric characteristics and behaviors," said Sean McClure, director of Data Science Data visualization and predictive analytics solution provider Space-Time Insight, in an interview. Outliers can and do affect and impact all aspects of organizational management and business operations. Workers, managers, and executives all need to be aware of outliers to effectively manage their organizations and business operations."
-
About the Author(s)
You May Also Like