Is Data Quality Overrated? - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management
10:41 PM
Rajan Chandras
Rajan Chandras

Is Data Quality Overrated?

MetLife and search-and-rescue teams prove that sometimes as-is data is good enough for serving customers and saving lives.

Am I being a heretic here? IT professionals swear by the mantra of data quality. Business is forever hand-wringing over it and willing to invest in a multi-million-dollar initiative to improve data quality (if the last one did not do too well, they're always game to sponsoring another go at it).

Vendors have made -- and are still making -- a killing selling data quality tools, to the extent that Gartner now has a magic quadrant devoted to this segment. The challenge of gaining and maintaining enterprise data quality has given rise to whole new discipline, which some call Data Governance (DG). And DG, in turn, has led to the emergence of a whole new set of tools, of course.

So what madness has come upon me that I seek to storm this formidable fortress?

Relax, I'm doing nothing of the kind. It's true that I have some strong feelings about after-the-fact measures for improving data quality. And yes, the term "data governance" continues to amuse and bemuse me. But what has me excited is refreshing case studies demonstrating a relatively unheralded truth: not every value proposition that involves data requires rigorous and pricey up-front investments in data quality improvement.

Case in point: a recent breakthrough project in which MetLife put together a consolidated customer view using data from more than 70 systems, moving from pilot to rollout in 90 days.

[ Want more on quick, easy data integration? Read MetLife Uses NoSQL For Customer Service Breakthrough. ]

Sure, there will be limitations to what must necessarily be, in some respects, a quick-and-dirty solution. For example, customer data was integrated without using the kind of sophisticated customer matching algorithms found in professional-grade master data management (MDM) tools (although MetLife does have a separate MDM effort under way). Also, I imagine that data cleansing and standardization along the way was minimal at best.

But what's not to like about an inspired three-month initiative involving 70 systems that reduces some customer-service processes from 40 clicks down to one click and as many as 15 different screens to one? The plan is to roll this out to 3,000 call center and research staff within six months. These are compelling numbers.

In a different sort of example, consider a report on about crisis-mapping technologies that can help humanitarian organizations deliver assistance to victims of civil conflicts and natural disasters by receiving and processing eyewitness reports submitted via email, text message and social media, and then building interactive geospatial maps, all in real time.

One such open-source solution, called Ushahidi, was used to crowdsource a live crisis map of the 2010 earthquake in Haiti. The map helped the U.S. Marine Corps locate victims lying under the rubble of collapsed buildings and helped save hundreds of lives.

There was a caveat when the application was deployed in that it did not support automated categorizing and geo-tagging of incoming text. That had to be done manually. This can only go so far. The Japanese earthquake and tsunami in 2011 generated more than 300,000 Tweets every minute. And when Hurricane Sandy hit the U.S. eastern seaboard last year, there were more than 20 million Tweets -- hardly something that can be processed manually.

The inherent limitations of this approach spurred the author of the article, Patrick Meier, and his team to enhance Ushahidi with a set of Twitter classifiers -- algorithms that could automatically identify Tweets that were relevant and informative to the crisis at hand. For example, classifiers automatically categorize eyewitness reports, infrastructure-damage assessments, casualties, humanitarian needs, offers of help and so on.

But given the quality of incoming data -- terse text with an emphasis on emotion rather than nicety of speech -- what results can we expect? Not too bad, as it turns out; initial accuracy rates range between 70% and 90%. Meier and his team are now working on developing more sophisticated algorithms that can be trained to better interpret incoming messages, leading to continued improvements in accuracy.

Both the above examples demonstrate, albeit in slightly different ways, a simple maxim: Sometimes "as is" data quality serves the purpose.

In the first case, the data in the underlying systems has clearly enabled integration to a sufficient extent -- a common customer key, for example, that has migrated to multiple systems (demonstrating that some good can come out of point-to-point integration, too). Equally importantly, the information is being consumed by humans -- call center operators, for example -- so that data-quality issues can be identified as they surface, giving MetLife an opportunity to clean up its information and tie together formerly disparate records.

In the second case, substantial value was derived despite the free-form, low-quality textual data. This is to be expected, as it's precisely the purpose of techniques such as pattern recognition, natural language processing and sentiment analysis.

There are a slew of use cases where granular data quality doesn't matter much. Typical examples include summary-level and statistical reporting/analytics. If a trucking company is looking to identify most frequently used or most-profitable routes, for example, individual discrepancies in transportation records don't really matter.

So, does data quality matter? Of course it does. The problem isn't that we are too obsessed with data quality; the problem is that we (still) aren't taking it seriously enough. Data quality continues to be an after-thought, addressed through ad-hoc and localized measures.

However, not everything needs to wait upon big-bang data quality initiatives. It's not a bad idea to take the occasional step back and ask yourself what business value can be obtained from data as is. Sometimes "good enough" data quality is just that.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Author
5/30/2013 | 8:49:33 PM
re: Is Data Quality Overrated?
Data quality can be an obstacle if IT waits to have data perfect and complete before it puts data in employees' hands. IT can't anticipate every use case and know what data's most valuable. The case I've heard from more than one CIO is that people aren't motivated to get totally data accurate and complete until they know it's being used -- until the business case is made. When IT asks for sales per store in the Middle East in order to fill out a data model, it draws yawns, When the CEO asks "why don't we have that data?" it gets done.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Remote Work Tops SF, NYC for Most High-Paying Job Openings
Jessica Davis, Senior Editor, Enterprise Apps,  7/20/2021
Blockchain Gets Real Across Industries
Lisa Morgan, Freelance Writer,  7/22/2021
Seeking a Competitive Edge vs. Chasing Savings in the Cloud
Joao-Pierre S. Ruth, Senior Writer,  7/19/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Monitoring Critical Cloud Workloads Report
In this report, our experts will discuss how to advance your ability to monitor critical workloads as they move about the various cloud platforms in your company.
Flash Poll