An audit can help you discover the source of data quality problems. Hint: It might be the warehouse.
Data quality problems are often widespread and originate in your source systems, their applications, and operational processes. And, sadly, many are the direct result of inadequate warehouse management.
In order to combat these problems, data warehouse architects must understand their source data. This understanding can come from data profiling. But even though profiling techniques are important, they're still exploratory. They leave it up to the analyst to understand how the data profile fits the business requirements. That's where a business rules-based audit can be useful, if not critical.
Understanding the Source
Within your source systems are two principle areas of concern:
1. Insufficient process controls to ensure data quality and consistency. Epitomizing a lack of process control is the ability to change key values, such as account numbers, that have a direct impact on the historical integrity of the warehouse. When you can change the primary keys, you compromise referential integrity. Ensuring the integrity of the warehouse remains intact requires extremely complex ETL processing when these changes occur at the source level.
2. Limited or no basic edits. Source applications have been known to provide little or no edit checks to ensure that data being entered meets even the most basic standards. Without measures to monitor and control the most fundamental standards at the source systems, the warehouse team will struggle against a constant flow of poor quality data. And although user-defined fields add flexibility to an application, they have painful consequences for statistical analysis and reporting required from warehouses. Essentially, user-defined fields are free-form fields. They can be anything, contain anything, and change anytime the user sees fit. All of which introduces the problem of discerning any value from the fields.
The warehouse, unfortunately, adds to the data quality problems, which can be typically attributed to two factors:
1. Inadequate source data analysis. Guidelines for sourcing data into the warehouse must be established in order to avoid simply sourcing data because it is part of an existing record. This practice leads to a high percentage of data of questionable value being warehoused.
2. Limited warehouse management. A fundamental function of the warehouse team is to actively monitor and tune warehoused data in pursuit of quality assurance and data relevance, specifically:
Warehoused data should be associated with a business requirement. If not, the data is often neglected, neither purged because of irrelevance nor scrubbed for quality.
Reference table maintenance is another common source of warehouse-centric data quality problems. Reference tables often are out of sync with applications or even moot if ETL processes allow invalid codes.
With the source data applications providing an unfettered environment for end users, coupled with incomplete transformation, cleansing, and warehouse maintenance processes, the warehouse data is guaranteed to be less than satisfactory.
So, how bad is your warehoused data? A data quality audit is designed and implemented in order to determine the answer. There are three high-level objectives of a data quality effort:
Produce statistically valid data quality measurements of the data warehouse
Investigate, identify, and document leading data quality causes
Create an assessment and recommendation report.
This audit approach is straightforward. It is designed to be implemented by any warehouse team, on demand.
The Data Quality Audit Approach
The data quality audit is a business rules-based approach that incorporates standard deviation to identify variability in sample test results. You examine the likelihood (confidence levels of 95 to 99 percent) of invalid data values occurring in columns or fields after business rules are applied against sample test data. A business rule may be a simple check, such as "Account_Aggregate_Balance must be greater than zero," but it may also include rules that assure its accuracy. For example:
IF "Account_Aggregate_Balance" is > 0
AND "Account_Type" = "LOAN"
AND "Account_Open" is > 30 days
THEN "Account_Interest_Fee" MUST BE > 0
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Cybersecurity Strategies for the Digital EraAt its core, digital business relies on strong security practices. In addition, leveraging security intelligence and integrating security with operations and developer teams can help organizations push the boundaries of innovation.