Big Data's Human Error Problem

Machine Learning & AI

We humans are our own worst enemies in the quest for better data quality, says one expert. Think false memory syndrome, typos, slips of the tongue and confirmation bias.

Jeff Bertolucci, Contributor

June 10, 2013

4 Min Read

5 Big Wishes For Big Data Deployments

5 Big Wishes For Big Data Deployments(click image for larger view and for slideshow)

Has the problem of bad data grown worse in the era of big data? No, not really, says author and industry analyst Joe Maguire, one of the organizers of the MIT Chief Data Officer and Information Quality (CDOIQ) Symposium, to be held July 17-19 in Cambridge, Mass.

The event, now in its 7th year, focuses on the issues of information quality and the need for a chief data officer (CDO) role within enterprises. In addition to being one of the conference organizers, Maguire will moderate a panel on human factors in information quality.

When it comes to information, digital or otherwise, one fact never changes: humans and data quality errors are inseparable, Maguire told InformationWeek in a phone and email interview. Furthermore, data that's too clean -- devoid of any signs of human blunders -- is immediately suspect.

"Sure, bad data touches human lives -- and vice versa. Humans are known to make a certain number of typos. In certain contexts, immaculate data could be a sign of fraud. If humans are involved in the production of data, you should expect it to be imperfect," Maguire wrote via email.

[ Is it heresy to ask Is Data Quality Overrated? ]

Problems resulting from poor data quality -- some serious, others lighthearted -- are often in the news. On the somber side, foreign names of terrorism suspects often have multiple spellings in U.S. intelligence databases, a common error that makes it difficult for security officials to track potential troublemakers.

On a lighter note, Yiddish scholars last week griped that the spelling of "knaidel," the winning word in last month's Scripps National Spelling Bee, should actually have been spelled "kneydl." The data quality issue stemmed from the fact that Scripps contestants, including the 13-year-old Queens, New York, boy who won the event by spelling knaidel, used Webster's Third New International Dictionary to study for the contest, rather than, say, a Yiddish-English dictionary with an alternative spelling, The New York Times reported.

"Bad data is first and foremost a human phenomenon: false memory syndrome, typos, slips of the tongue, confirmation bias and too many others to list," Maguire wrote.

Big data also has the potential to expand bad data "shenanigans" by providing people with a much larger mass of data from which to cherry-pick morsels of information that justify their positions, added Maguire.

"Confirmation bias deserves special attention. Besides producing bad data -- as when researchers rationalize discarding inconvenient data points -- it can also yield dismissive responses to good data," he wrote via email. "Think of those who cannot or will not be dissuaded from believing that vaccines cause autism, or those who could not swallow Nate Silver's predictions about the 2012 presidential election. Most noteworthy about confirmation bias is the sincerity felt at the very moment it occurs."

Another big data factor that makes bad data trickier to limit: Enterprises often don't have control over the data sources they're analyzing, including social media feeds and data sets available from public repositories such as Data.gov.

One of the best things about the MIT CDOIQ Symposium, at least according to Maguire, is that it's a self-selecting group of data quality professionals who are very passionate about the topic.

In addition to the usual typos and misspellings that characterize bad data, attendees will also discuss other metrics of information quality, including data timeliness and appropriateness, Maguire said.

Enterprises can take steps to reduce (but not eradicate) bad data, such as implementing data governance guidelines and establishing programs to incentivize employees to value information quality, Maguire added.

E2 is the only event of its kind, bringing together business and technology leaders across IT, marketing, and other lines of business looking for new ways to evolve their enterprise applications strategy and transform their organizations to achieve business value. Join us June 17-19 for three days of 40+ conference sessions and workshops across eight tracks and discover the latest insights in enterprise social software, big data and analytics, mobility, cloud, SaaS and APIs, UI/UX and more. Register for E2 Conference Boston today and save $200 off Full Event Passes, $100 off Conference, or get a FREE Keynote + Expo Pass!

About the Author(s)

Jeff Bertolucci

Contributor

Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek.

See more from Jeff Bertolucci

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice