In 1996, Gerber aired ads that told parents: "Four out of five pediatricians who recommend baby food recommend Gerber."
Four out of five sounds pretty good. But in fact, only 12% of the pediatricians Gerber had surveyed actually recommended Gerber -- which the average consumer of data (or baby food) likely wouldn't have realized, had the US Federal Trade Commission not called out the company on the claim.
The bigger question in our data-centric world is: How did the ad get to "four out of five?"
In Everydata: The Misinformation Hidden in the Little Data You Consume Every Day, authors John H. Johnson and Mike Gluck explain, "Gerber didn't just cherry-pick the data. It cherry-picked data that had already been cherry-picked."
Gerber started with 562 pediatricians: 408 responded that they recommended baby food in general, 76 recommended a specific brand, and 67 of those 76 recommended Gerber. (The FTC eventually called them out on this.) According to the book:
If you're out in the cherry orchard with your bucket and ladder, your job is to fill the bucket with cherries that you can sell at the market. So you're going to skip any cherries that look bruised … and you're going to fill your bucket with the best-looking cherries you can pick. Hence, cherry picking -- when you're selecting only the data (cherries) that other people want.
Making Sense Of Data
Coauthor Johnson is president and CEO of Edgeworth Economics, a professional economist, and a frequent expert witness. He notes that even for him, studies that are cited in the media can be confusing.
"One thing I've gotten into is eating healthier, and every day there's a study about how you should eat avocados 12 times a day or not at all. Or that coffee both causes and prevents cancer," Johnson told Information Week. "I'm a trained statistician, and I struggle with these report."
Big data is a common topic, and so is the fact that the amount of data each of us consumes is multiplying exponentially and has been for years, said Johnson.
"But we think 'everydata,' or all the little data each of us encounters throughout the day, is more important to your everyday life. But most people don't understand it or know how to think about it," he said, explaining that he's often called as an expert witness to explain data in courtrooms.
"I want people to be empowered by data, so they can make smarter decisions," said Johnson, "from what type of car to buy to what type of employee do you hire."
Everydata breaks down a number of ways that data is misinterpreted or presented in misleading ways -- from cherry picking to sampling and correlation versus causation -- and how each of us can be smarter about using data.
"When you're asking a question, the thing to ask is: What is the right data to answer this question?" said Johnson. He offered the example of the Challenger space shuttle explosion, which came to be blamed on a data-sampling failure -- or, more literally, a failure of O-rings.
NASA scientists had looked at the correlations between temperatures and O-rings. But, the book explains, "By focusing only on flights with O-rings incidents, people were truncating the data set -- a fancy way of saying that they weren't looking at all of the data. And that error in how the data was analyzed would have significant repercussions."
In another example of failed correlations is that murder rates go up in the summertime, and so do ice cream sales. Though -- as logic tells us, though the numbers may seem to say something else -- one doesn't cause the other.
"In this case, the correlation is spurious because another variable (warm weather) exists -- it's just omitted when people show the correlation between ice cream sales and murder," states Everydata.
The book also points to the wonderful website Spurious Correlations, by Harvard Law Student Tyler Vigen, which shows striking but unrelated correlations between things like the revenue generated by bowling alleys in the US and per-capita consumption of sour cream; or the divorce rate in Maine and per-capita margarine consumption.
Data can also be misinterpreted through "omitted variable bias."
Consider that the No. 1 spot on a Google search result gets almost twice the traffic as a No. 2 spot. So you start to consider what Google considers when it ranks them -- which is 200 "unique signals or 'clues,'" according to Google.
"If you click over to Moz.com, you'll see charts showing how more than 160 factors correlate to search engine rankings," states Everydata. "It's interesting stuff, and probably very useful if you're looking for ways to increase your page ranking. But it's not definitive, because it's based largely on correlations."
(To Moz.com's credit, the book adds, it uses the word "correlation" 12 times.)
Learning You Data Lessons
Every chapter of Everydata ends with a list of good practices for being smarter about data.
How to be a good consumer of correlation and causation?
Everydata advises: Take note of whether an article uses the word causal; step back and apply common sense; ask yourself whether something else could be driving the action; and lookout for reverse causality.
To the last point, the book clarifies:
Do smart people stay up later? Or do people stay up later because they're smart? Don't discount the possibility of a feedback loop, where X affects Y and Y affects X at the same time (e.g., smart people stay up later, which gives them more time to get smart, which makes them stay up later…).
Johnson offered, "I solve business problems for clients, and I explain statistical concepts to juries. As a result, I've seen over and over, the power of a straightforward explanation."
When it comes to data, he added, "Sometimes people just haven't been taught how to think about it."