Big Data // Big Data Analytics
News
10/9/2013
10:37 AM
Connect Directly
Google+
RSS
E-Mail
50%
50%

5 Data Science Sins To Beware

Repent, ye data scientists! Avoid these five big data evils -- or pay with your immortal soul.

OK, perhaps our fire-and-brimstone headline goes a bit overboard. Then again, maybe it is time for a dose of data science atonement, particularly if you're guilty of any of the five deadly sins summarized below.

According to Michael Walker, founder and president of the nonprofit Data Science Association, a professional organization of data scientists with more than 500 members, these big-data sins are all too common. In fact, the Association's recently penned Code of Professional Conduct is designed to establish a set of ethical standards for the burgeoning data-science industry.

Not all big-data professionals are guilty of the five deadly sins, of course, which Walker summarized in a phone interview with InformationWeek. So here they are. Do any of these data-science transgressions hit home?

Sin #1: Cherry Picking

This is where a data scientist includes only data that confirms a particular position and ignores evidence of a contradictory position. "I see this all the time," Walker said.

[ For more on ethical best practices for big-data professionals, see Data Scientists Create Code Of Professional Conduct. ]

Cherry picking is all too common in university research, according to Walker, who referenced a 2005 paper, "Why Most Published Research Findings are False," by Stanford professor John Ioannidis. "What [Ioannidis] argues, in a nutshell, is that the overwhelming majority of research that he reviewed could not be replicated," said Walker.

Here's a hypothetical scenario that illustrates cherry picking in action:

"[Researchers] create a hypothesis they want to test out," Walker said. "So they run it 999 times, and it fails. There's no evidence to confirm their hypothesis. Then they tweak it, run it again, and all of a sudden they find evidence to confirm their hypothesis." But when these same researchers publish a paper proclaiming their success, they don't mention the 999 times they failed. "I think that's very unethical," Walker said.

Sin #2: Confirmation Bias

This is where researchers favor data that confirms their hypothesis.

"When you're dealing with very large data sets, you're going to find more relationships, more correlations," said Walker. And that can lead to causation confusion, especially in high causal density environments.

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.