Data falsification at research institutions to make results look better is nothing new. Here's what it can teach us about misuse of big data in business.
Publicizing infractions eliminates some repeat offenders, but there are some obvious warning signs and common sense measures that companies can use to prevent or reduce problems of data fabrication. The following red flags, drawn from funded academic research, are likely the same for big-data applications.
1. Data Emperor.
When just one person has access to and control of the data and this person blocks others from looking at it, there might be a problem. If the "emperor" is a department that resists data inspection, this too could be a red flag. Stapel is the poster boy of data emperors, faking data from the comfort of his office; 51 retractions and counting.
Astonishing, almost unbelievable output beyond what seems to be humanly possible should raise suspicions. Although there might be geniuses at your company, output that is three or more times as much as anyone else is worth exploring. At a Massachusetts crime lab, Annie Dookhan was processing 9,000 samples per year while her colleagues did on the order of 3,000. She was faking the results and was fired along with her clueless supervisors. Another faker, Robert Slutsky, was publishing a paper on average every 10 days. John Darsee was considered a brilliant cardiologist at Harvard Medical School until it was discovered that much of his data was faked. Darsee started his meteoric rise at Notre Dame as an undergraduate, reporting experiments on rats that would have been impossible to conduct. With funding and accolades rolling in, his supervisors apparently operated under the "ignorance is bliss" paradigm.
3. Chaos As Cover.
Disorganized data structures with a concurrent lack of traceability of materials make it difficult to manage and later to audit a project, much less detect fabrication or falsification. The "sommerlier" Dipak Das provides an example of this style.
4. Cherry Picking.
A devious antithesis to fabricating data is to filter the data to select a subset that fits the desired hypothesis. When this is the sole motivation for subsetting a data set, then it raises the action to the fabrication level. The difficulty here in assessing the "crime" is to distinguish incompetence from intent to deceive.
5. Too Good To Be True.
If the results and conclusions are spectacularly wonderful and pleasing in an area where previously successes were epsilon-incremental, perhaps a closer look is warranted. The cloning wizard -- Hwang Woo Suk of Seoul National University who became a temporary national hero before humiliation -- and the cold-fusion guys, Stanley Pons and Martin Fleischmann, come to mind. Hwang, along with 24 rapidly distancing coauthors, was ultimately revealed to have faked his cloning, resulting in articles from the journal Science being retracted. Pons and Fleishmann were shown to be sloppy in the laboratory, but there was no evidence of fabrication. Carl Sagan coined the phrase, "extraordinary claims require extraordinary evidence," so those with such claims ought to be eager and willing to provide the evidence.
What if there are no red flags? How can data fabrication and falsification be detected? Whistleblowers could help, but the corporate culture must be willing to protect those that come forward. Furthermore, protection from false accusations -- which mire honest analysts in distraction while the cheats zoom ahead via their shortcuts -- also is needed. For detection, a third-party audit could provide both detection and deterrent capability.
Some situations are not always so black and white. Recently I started my data ethics seminar with a plot summary of a book. A young man leaves his home country on a large ship that is carrying wild animals. The ship sinks and the boy ends up in a lifeboat, which he shares with a wild cat. After many days at sea, he is rescued and the disposition of the cat is unknown to the rescuers. Sounds like Yann Martel's The Life of Pi, right? What you might not know is that Martel was inspired by a review of Dr. Moacyr Scliar's book, Max and the Cats, published in 1988 in Portuguese. Scliar's story had a panther rather than a tiger and Max was fleeing from Germany rather than India, among other differences. There was no specific plagiarism but the plots are very similar. Scliar graciously complemented Martel's book while the Brazilian press was less generous. I wonder if the Man Booker prize would have gone to Martel if the panel had known about this inspirational precedent. Had Martel acknowledged Scliar's prior work initially, the controversy might have been averted.
In a followup column I plan to talk about data fabrication and falsification in the corporate world. If you have any examples of big data fabrication in the business world -- suitably sanitized for anonymity, of course -- please share them with me either via the comments section below or email. Thanks in advance!
Companies want more than they're getting today from big data analytics. But small and big vendors are working to solve the key problems. Also in the new, all-digital Analytics Wish List issue of InformationWeek: Jay Parikh, the Facebook's infrastructure VP, discusses the company's big data plans. (Free registration required.)
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.