The intent of business intelligence (BI) is to help decision makers make well-informed choices. Therefore, modern BI systems must be able to consume vast quantities of detailed, disparate data and quickly reduce it to meaningful, accurate information that people can then confidently act on. The corollary of data quality is better decision-making.
An immediate challenge for data architects, however, is the ever-rising flood of operational data that must be cleansed, integrated, and transformed. These tasks must address enterprise-scale issues, including the ever-increasing volume and variety of data and, in some cases, near real-time data refresh levels.
Up to the Job?
Most traditional data quality tools, whether they're implemented as stand-alone solutions or used to supplement extract, transform, and load (ETL) processing, are inadequate for enterprise-scale implementations. Most of them simply can't scale to enterprise-level data volumes and refresh frequencies (from batch to continuous).
To illustrate, let's consider the data flow dictated by the cleansing and transformation batch processing that's common to these tools. First, the data must be extracted from the source and temporarily stored in files or tables for processing. The data is then cleansed, transformed, or otherwise prepared according to predefined data quality rules. During this process, the data is moved in and out of temporary files or tables as required. When the data is prepared to the defined specifications, it's temporarily stored again. Finally, the data is moved from the final temporary storage and loaded into the target data warehouse tables or passed on to the ETL technology for further processing. When you consider all the batch data movement required by typical data quality software, it's easy to see that the technology can quickly become a process bottleneck as data volumes or refresh rates increase.
The enterprise-scale requirements of data quality, coupled with the limitation of traditional data quality technologies, leave architects with few options. Some architects have merely lowered expectations. They either implement data quality processes for only critical data or constrain quality processing to pedestrian activities, such as simple standardization. And although these approaches may serve as workarounds to handle data volume issues, they operate at the expense of the trustworthiness of the overall warehouse. Poor-quality data compromises virtually all the analytics and, therefore, the data warehouse's value to decision makers.
First Things First
But the seasoned architect is aware that certain techniques and technologies can be adapted to meet the requirements of enterprise data quality, specifically, in-database data mining now offered by leading database vendors. But before I can continue with how to incorporate in-database mining into your data quality solution, it is critical that you:
- Look beyond the typical applications associated with data mining technology. Data mining is too quickly pigeonholed as an esoteric application used only for prediction or forecasting.
- Appreciate the value in-database data mining brings to the modern warehouse.