8 Ways To Ensure Data Quality
The quality of your business decisions is only as good as the quality of the data you use to back them up. Here are some tips to help you determine how reliable your data actually is.
The changing volume and variety of data is obvious to nearly everyone, but far fewer of us understand the concept of veracity. Treating all data as though it were equally accurate and reliable can adversely affect the quality of business decisions and business outcomes.
"There are two core risks: making decisions based on 'information fantasy,' and compliance. If you're not representing the real world, you can be fined and your CFO can be imprisoned. It all comes down to that that one point: If your systems don't represent the real world, then how can you make accurate decisions?" said Steve Jones, global VP of the big data practice at global consultancy Capgemini.
The topic of data quality is not generally well understood, because it has been treated as an IT problem. Collecting, storing, and processing data require a lot of technical expertise to do right -- and achieving data quality targets can take considerably more time to do right than others in the organization expect.
[Having trouble making sense of disparate data? Read Data Visualizations: 11 Ways To Bring Analytics To Life.]
"[Data quality] is the most underappreciated part of a project. It's the part that takes the most time," said Moshe Kranc, CTO of Ness Software Engineering Services. "Once you get the data normalized and all the bad records removed, and the incorrect records are cleaned, the rest of the project is doing the analytics and seeing the results. It's the easier half compared to the 60% [spent] getting data where you want it in a clean, normalized format, so you can use it."
As more people use data and analytics in their everyday jobs, the importance of data quality is leading to new organizational roles, including the chief data officer, data stewards, and data governance teams. Because businesses run on data, it's important that people in the organization understand some of the basics so they can be confident that the data quality is reliable. Here's a guide to what's required to achieve business-driven data quality.
Data quality needs to be good enough for its target use, whether its purpose is to comply with regulatory mandates, improve customer satisfaction, or save lives. Nevertheless, most business professionals tend not to think about data quality, even though the quality of the decisions they make depends on the quality of the data they're using for analysis.
"Data quality means having data in a format you can trust. One of the biggest challenges is figuring out the golden source of data. Data gets copied from source to source. Before you know it, all the people who worked on all of those projects have been fired, and no one is really sure where the data comes from," said Moshe Kranc, CTO of Ness Software Engineering Services. "I've seen business analysts spend months analyzing [the wrong] data source because they thought it was the golden data source."
Data quality isn't only an IT problem -- it's a business problem. The two groups have to work together to balance the appropriate level of data quality with the time and costs it will require to achieve that level of accuracy. Otherwise, people outside IT risk making decisions based on data they think is more accurate than it actually is. Conversely, IT risks implementing data quality standards that are not in line with business requirements.
Data quality can vary significantly depending on how it was collected, stored, cleansed, and processed. At the same time, no one level of data quality applies to all use-cases in all situations. Still, some say, "data is data" as if it were homogeneous.
"People think data accuracy is 100% accurate. We're dealing with tens of millions of pieces of content a month, so it's hard to get to 100% accuracy," said James O'Malley, senior VP of analytics at communications agency Porter Novelli. "The cost to achieve an incremental increase in accuracy doesn't necessarily benefit the decision-making process. Improving accuracy from 98% to 99% can involve far more cost and time than is practical, particularly if the difference would not affect the business outcome."
When it comes to data quality, one size does not fit all. For example, social media data is roughly 80% accurate (although it varies), which is considered good enough for sentiment analysis. However, the same level of accuracy applied to other types of data in other contexts is unacceptable. Take bank account balances, for example.
There are a lot of things that impact the quality of data throughout its life cycle. Errors can be introduced in the data collection process as the data ages, as it is cleansed and transformed, and while it's being moved among disparate systems. In other words, even accurate data can become inaccurate over time. For example, one bank has spent several years trying to come up with a single view of a customer. This has been complicated by its many acquisitions, as well as by the complex and dynamic nature of its business customers' activities. Regardless of where a company is with its data quality at a given point in time, improving data quality is an ongoing pursuit, not a project that can simply be checked off a list.
"Organizations get overwhelmed quickly when you talk about data quality. It's a journey, a way of doing business that's going to change things, and it needs to be maintained," said Angela Fernandez, VP of retail grocery and food service at information standards organization GS1 US.
A recent test of 24 companies conducted by GS1 US showed that 50% of the data analyzed was inaccurate. One problem with taking a piecemeal approach to data quality is the possibility of introducing errors that affect the system at large.
Ignorance is bliss until there are obvious consequences caused by data quality issues. Without understanding the reasons why data quality can vary -- and the veracity of a data source or dataset -- it's easy to grab some data, start analyzing it, and arrive at a conclusion.
"Bad science can come into play here. If you're going to do anything serious with data, you need to be concerned with data quality," said James Heires, a data scientist for software estimation and management tool company Quantitative Software Management, in an interview. "There is a temptation to just grab data needed for an analysis, run statistics, and come away with a result -- and not spend the time it takes to understand the data and where it originated."
The best quality data accurately describes something in the real world. To enable that, organizations usually have to do several things, including normalizing their data (getting it into a common format) and verifying its accuracy.
"If you haven't got good master data (the whole concept of cross-reference), and there's one person in the real world, and you have records for them in 20 systems and 30 records in each of those 20 systems, that's your problem," said Steve Jones, global VP of Capgemini's big data practice. "Master data, metadata, and reference data management are the most important things when you look at data quality."
IT-centric data quality practices don't serve an organization as effectively as they might if they're decoupled from business requirements. Similarly, if business professionals don't understand the importance of data quality, they're inclined to disregard it.
"The number one mistake that people [make is to] lead with data quality initiatives, rather than leading with a business initiative. You see a lot of data quality problems that are internal to IT where the business doesn't engage because it hasn't been involved in a way that it can engage," said Steve Jones, global VP of Capgemini's big data practice. "People take a vanilla approach to data quality, saying all data needs to be high quality. It's not necessary and it adds massive costs. Data quality isn't a business outcome. It's about the business objective that data quality enables."
Data quality can remain IT-focused in some organizations because of the continued focus on schemas -- data warehouse schemas, customer schemas, and product schemas, as examples. As organizations bring in additional data sources from social media and the Internet of Things, they're unable to define a schema that applies equally well to all of them. "Big data has gotten people to realize the phase zero of data quality is being able to navigate among those data sets, not get everyone to conform to schema," Jones said.
The people responsible for data vary across organizations (depending on their size, business model, fiscal health, and data strategy). These individuals could carry the title of CIO, CTO, chief data officer, chief privacy officer, member of the governance body, CMO, line-of-business leader, or one with a combination of roles. Regardless of the titles and organizational structure, it's important to have a combination of business domain expertise, data management expertise, and statistical expertise, as well as oversight governing how data is collected and used.
"I see more companies establishing a governance body, and it's fascinating. It's not something I saw three years ago," said Alex Guazzelli, chief scientist at big data firm Opera Solutions. "Having a governance body to standardize what you have, having a data dictionary for the enterprise, and having a schema that's dynamic enough that it can be updated when new data comes into the data repository or data lake -- all of those things help ensure good quality."
Ultimately, the data quality buck has to stop somewhere. While data quality doesn't appear as a line-item on everyone's job description, anyone collecting data, entering data into a system, or using data should be concerned about its quality.
The reputations of data products and the companies who offer them hinge on good-quality data. Survey platform provider SurveyMonkey collects about a million survey responses per month and aggregates them so customers subscribing to its benchmarking service can compare their survey results with the aggregated results of other organizations across industries. A customer might assume the data is reliable as-is, but SurveyMonkey verifies the data across several sources, some of which are outside the company.
"You have to craft different techniques so you can understand if what you have is correct or not," said David Wong, data scientist at SurveyMonkey. "There's no clear answer in every domain because [data quality] is domain-specific, but if you want to make sure you have something that makes sense, you're going to consider things like whether the data will answer the problem, whether it's wildly changing, whether it's unstable, and whether it's something you've been able to verify."
The reputations of data products and the companies who offer them hinge on good-quality data. Survey platform provider SurveyMonkey collects about a million survey responses per month and aggregates them so customers subscribing to its benchmarking service can compare their survey results with the aggregated results of other organizations across industries. A customer might assume the data is reliable as-is, but SurveyMonkey verifies the data across several sources, some of which are outside the company.
"You have to craft different techniques so you can understand if what you have is correct or not," said David Wong, data scientist at SurveyMonkey. "There's no clear answer in every domain because [data quality] is domain-specific, but if you want to make sure you have something that makes sense, you're going to consider things like whether the data will answer the problem, whether it's wildly changing, whether it's unstable, and whether it's something you've been able to verify."
-
About the Author(s)
You May Also Like