6 Causes Of Big Data Discrepancies
The same data can yield wildly different results. Here are some of the reasons for these fascinating, frustrating, or even dangerous discrepancies.
As the universe of big data continues to explode, organizations struggle to identify and leverage the data that matters most. As companies continue to add more data sources to the mix, the number of potential data sets grows, as do the opportunities for new and different insights. Because data can be combined in so many different ways, there are many possible outcomes.
"People have different choices from the same data set," said Kirk Borne, principal data scientist at Booz Allen Hamilton in an interview. "People will choose different things they think are informative, so the search is to find the most important variables."
As a result, the same data can result in very different interpretations.
"If I run a supernova simulation where the resolution is too low and two supernova scientists analyze that, if one knows the simulation is not a sufficient resolution and the other doesn't, they would come to very different conclusions," said Tony Mezzacappa, chair of theoretical and computation astrophysics at the University of Tennessee, in an interview. "Data completeness is part of data quality. People should understand what the dangers are in extracting conclusions based on such data."
Whether or not data is complete enough may not be obvious until later. For example, cosmic microwave background radiation confirmed the Big Bang theory, at least until the European Space Agency discovered that dust in the universe emits microwaves of its own that can introduce the same kind of polarization.
"That was a very big deal when it was announced. If [cosmic inflation] had been confirmed, it would have spoken volumes about the nature of our universe, its beginning, its evolution, and its end," said Mezzacappa. "Further analysis, and likely further astronomical data collection, will be required to include the effect of dust."
There are inherent uncertainties in algorithms, models, outcomes, and sometimes the data itself that can impact conclusions. Human nature also plays a part. Here, we explain six of the many reasons the same data can result in different interpretations.
Businesses often have a glut of redundant data stored in multiple systems and in different formats. While many companies realize the importance of normalizing the data to minimize data redundancy, transformation is also important.
"Some people will take the same data and not realize they have to do some data transformation before they can make sense of it," said Booz Allen Hamilton principal data scientist Kirk Borne. "Let's say you're looking for purchasing patterns based on income. You may have collected data sets from data sources, and even though it's the same data, some portion of the data had the monthly data on families and another part of it included the annual income of families. If you don't do the transformation, you end up completely misusing the data."
Different algorithms tend to yield different results. One algorithm may be better suited to one particular task, more efficient, or introduce less uncertainty than another algorithm. Take the Latent Dirichlet Allocation (LDA) algorithm, for example, which is used to identify related topics in unstructured text. Luis N.A. Amaral, a professor at the McCormick School of Engineering and the Feinberg School of Medicine at Northwestern University, tested the popular algorithm and found that it was 90% accurate and 80% reproducible. Amaral said the LDA algorithm is less accurate than it should be since it solves a simple problem.
Acceptable margins of error vary depending on several factors, including the acceptable level of risk.
Data models have parameters and other conditions that can cause results to differ. In a simulation, the values must be specified so the output only applies to those values.
When University of Tennessee theoretical and computation astrophysics chair Tony Mezzacappa runs a 3D simulation of supernovae, the outcomes of the simulations depend on the spatial resolution of the model.
"Mother Nature is a continuum. When we model it, we grid it into something that is manageable so we have a discreet set of points in 3D space or an arbitrary number of dimensions of some abstract space, and we model the phenomenon on that limited subset of special points or whatever that may be in abstract space," said Mezzacappa. He and his team rerun the simulations many times using finer and coarser resolutions to see whether the outcomes in the model change.
Data models can be too complex or too simple. When they are overly complex, they may include noise; if they are too simple, they may omit important data such as trends.
"Overfitting is a prime example of where people choose a model that uses every nuance and wiggle in the data, and they try to use all of that to predict something," said Booz Allen Hamilton principal data scientist Kirk Borne. "Underfitting the data is not really looking at the trends -- you're just looking at part of the data. From a data-science perspective, choosing the right algorithm is one of the biggest challenges."
Two individuals can draw different conclusions based on the same analysis of the same data, often because one or both of them have biased views of the outcome.
"Most of the time we're blind to our own biases, so it's really good if your model differs from mine. Then we can discuss biases and why they are different," said Booz Allen Hamilton principal data scientist Kirk Borne. "If you stay within boundaries where everything is a yes answer, then you're not going to learn anything. You want to get to the point when your model and your algorithm fails, the point where things went from being good to not being good, [because] that's where the real knowledge is discovered."
When interpretations differ, it's wise to figure out why.
Two individuals can draw different conclusions based on the same analysis of the same data, often because one or both of them have biased views of the outcome.
"Most of the time we're blind to our own biases, so it's really good if your model differs from mine. Then we can discuss biases and why they are different," said Booz Allen Hamilton principal data scientist Kirk Borne. "If you stay within boundaries where everything is a yes answer, then you're not going to learn anything. You want to get to the point when your model and your algorithm fails, the point where things went from being good to not being good, [because] that's where the real knowledge is discovered."
When interpretations differ, it's wise to figure out why.
-
About the Author(s)
You May Also Like