OK, perhaps our fire-and-brimstone headline goes a bit overboard. Then again, maybe it is time for a dose of data science atonement, particularly if you're guilty of any of the five deadly sins summarized below.
According to Michael Walker, founder and president of the nonprofit Data Science Association, a professional organization of data scientists with more than 500 members, these big-data sins are all too common. In fact, the Association's recently penned Code of Professional Conduct is designed to establish a set of ethical standards for the burgeoning data-science industry.
Not all big-data professionals are guilty of the five deadly sins, of course, which Walker summarized in a phone interview with InformationWeek. So here they are. Do any of these data-science transgressions hit home?
Sin #1: Cherry Picking
This is where a data scientist includes only data that confirms a particular position and ignores evidence of a contradictory position. "I see this all the time," Walker said.
[ For more on ethical best practices for big-data professionals, see Data Scientists Create Code Of Professional Conduct. ]
Cherry picking is all too common in university research, according to Walker, who referenced a 2005 paper, "Why Most Published Research Findings are False," by Stanford professor John Ioannidis. "What [Ioannidis] argues, in a nutshell, is that the overwhelming majority of research that he reviewed could not be replicated," said Walker.
Here's a hypothetical scenario that illustrates cherry picking in action:
"[Researchers] create a hypothesis they want to test out," Walker said. "So they run it 999 times, and it fails. There's no evidence to confirm their hypothesis. Then they tweak it, run it again, and all of a sudden they find evidence to confirm their hypothesis." But when these same researchers publish a paper proclaiming their success, they don't mention the 999 times they failed. "I think that's very unethical," Walker said.
Sin #2: Confirmation Bias
This is where researchers favor data that confirms their hypothesis.
"When you're dealing with very large data sets, you're going to find more relationships, more correlations," said Walker. And that can lead to causation confusion, especially in high causal density environments.