Am I being a heretic here? IT professionals swear by the mantra of data quality. Business is forever hand-wringing over it and willing to invest in a multi-million-dollar initiative to improve data quality (if the last one did not do too well, they're always game to sponsoring another go at it).
Vendors have made -- and are still making -- a killing selling data quality tools, to the extent that Gartner now has a magic quadrant devoted to this segment. The challenge of gaining and maintaining enterprise data quality has given rise to whole new discipline, which some call Data Governance (DG). And DG, in turn, has led to the emergence of a whole new set of tools, of course.
So what madness has come upon me that I seek to storm this formidable fortress?
Relax, I'm doing nothing of the kind. It's true that I have some strong feelings about after-the-fact measures for improving data quality. And yes, the term "data governance" continues to amuse and bemuse me. But what has me excited is refreshing case studies demonstrating a relatively unheralded truth: not every value proposition that involves data requires rigorous and pricey up-front investments in data quality improvement.
Case in point: a recent breakthrough project in which MetLife put together a consolidated customer view using data from more than 70 systems, moving from pilot to rollout in 90 days.
[ Want more on quick, easy data integration? Read MetLife Uses NoSQL For Customer Service Breakthrough. ]
Sure, there will be limitations to what must necessarily be, in some respects, a quick-and-dirty solution. For example, customer data was integrated without using the kind of sophisticated customer matching algorithms found in professional-grade master data management (MDM) tools (although MetLife does have a separate MDM effort under way). Also, I imagine that data cleansing and standardization along the way was minimal at best.
But what's not to like about an inspired three-month initiative involving 70 systems that reduces some customer-service processes from 40 clicks down to one click and as many as 15 different screens to one? The plan is to roll this out to 3,000 call center and research staff within six months. These are compelling numbers.
In a different sort of example, consider a report on Forbes.com about crisis-mapping technologies that can help humanitarian organizations deliver assistance to victims of civil conflicts and natural disasters by receiving and processing eyewitness reports submitted via email, text message and social media, and then building interactive geospatial maps, all in real time.
One such open-source solution, called Ushahidi, was used to crowdsource a live crisis map of the 2010 earthquake in Haiti. The map helped the U.S. Marine Corps locate victims lying under the rubble of collapsed buildings and helped save hundreds of lives.
There was a caveat when the application was deployed in that it did not support automated categorizing and geo-tagging of incoming text. That had to be done manually. This can only go so far. The Japanese earthquake and tsunami in 2011 generated more than 300,000 Tweets every minute. And when Hurricane Sandy hit the U.S. eastern seaboard last year, there were more than 20 million Tweets -- hardly something that can be processed manually.
The inherent limitations of this approach spurred the author of the article, Patrick Meier, and his team to enhance Ushahidi with a set of Twitter classifiers -- algorithms that could automatically identify Tweets that were relevant and informative to the crisis at hand. For example, classifiers automatically categorize eyewitness reports, infrastructure-damage assessments, casualties, humanitarian needs, offers of help and so on.
But given the quality of incoming data -- terse text with an emphasis on emotion rather than nicety of speech -- what results can we expect? Not too bad, as it turns out; initial accuracy rates range between 70% and 90%. Meier and his team are now working on developing more sophisticated algorithms that can be trained to better interpret incoming messages, leading to continued improvements in accuracy.
Both the above examples demonstrate, albeit in slightly different ways, a simple maxim: Sometimes "as is" data quality serves the purpose.
In the first case, the data in the underlying systems has clearly enabled integration to a sufficient extent -- a common customer key, for example, that has migrated to multiple systems (demonstrating that some good can come out of point-to-point integration, too). Equally importantly, the information is being consumed by humans -- call center operators, for example -- so that data-quality issues can be identified as they surface, giving MetLife an opportunity to clean up its information and tie together formerly disparate records.
In the second case, substantial value was derived despite the free-form, low-quality textual data. This is to be expected, as it's precisely the purpose of techniques such as pattern recognition, natural language processing and sentiment analysis.
There are a slew of use cases where granular data quality doesn't matter much. Typical examples include summary-level and statistical reporting/analytics. If a trucking company is looking to identify most frequently used or most-profitable routes, for example, individual discrepancies in transportation records don't really matter.
So, does data quality matter? Of course it does. The problem isn't that we are too obsessed with data quality; the problem is that we (still) aren't taking it seriously enough. Data quality continues to be an after-thought, addressed through ad-hoc and localized measures.
However, not everything needs to wait upon big-bang data quality initiatives. It's not a bad idea to take the occasional step back and ask yourself what business value can be obtained from data as is. Sometimes "good enough" data quality is just that.