5. Where should we correct data?
Business users are aware that data quality is a serious and expensive problem. Thus, most organizations are likely to support initiatives to improve data quality. But most users probably have no idea where data quality problems originate or what should be done to improve data quality. They may think that data quality is a simple execution problem for the ETL team, but data quality cannot be improved by ETL alone. In this environment, the ETL team needs to be agile and proactive, partnering with the business and the IT teams that support the source systems.
The key decision is where to correct the data. The best solution is clearly to have the data captured accurately in the first place. In most cases the data should be corrected back in the source system. Unfortunately, it is inevitable that poor quality data will reach the ETL system, and when this happens, there are three choices:
2) Send the offending record(s) to a suspense file for later processing
3) Tag the data and pass it through.
The third choice is by far the best choice, whenever possible. Halting the process is obviously a pain because it requires manual intervention to diagnose the problem, restart or resume the job, or abort completely. Sending records to a suspense file is often a poor solution because it is not clear when or if these records will be fixed and reintroduced to the pipeline. Until the records are restored to the data flow, the overall integrity of the database is questionable because records are missing. I don't recommend using the suspense file for minor data transgressions. The third option of tagging the data with the error condition often works well. Bad fact table data can be tagged with an audit dimension that describes the overall data quality condition of the offending fact row. Bad dimension data can also be tagged using the audit dimension or, in the case of missing or garbage data, it can be tagged with unique error values in the field itself. From this point, data quality reporting capabilities can flag the offending fact and dimension rows, indicating a need for resolution and, ideally, repair of the data in the source system.
6. What data latency level do we need to support?
Data latency describes how quickly source system data must be delivered to the business users via the DW/BI system. Data latency obviously has a huge effect on the costs and complexity of your ETL environment. Clever processing algorithms, parallelization and potent hardware can speed up traditional batch-oriented data flows. But at some point, if the data latency requirement is urgent, the ETL system architecture must step up from batch mode to streaming orientation. This isn't a gradual or evolutionary change; it's a major paradigm shift in which almost every step of the data delivery pipeline must be reimplemented.
ETL streams for most organizations typically require a data latency that matches the natural rhythm of the business. We find that most organizations require daily updates for most ETL streams and weekly or monthly updates for other ETL streams. However, in some circumstances, more frequent updates or even real-time updates suit the rhythm of the business. The key is to recognize that only a handful of business processes within any organization are appropriate for real-time updating. There's no compelling reason to convert all ETL processing to real time. The rhythm of most business processes simply doesn't demand that treatment.
Be careful: asking business users if they want "real-time" delivery of data is an invitation for trouble. Most business users will request lower-latency data regardless of whether they understand the impact of the request. We recommend dividing the real-time challenge into three categories: daily, frequently and instantaneous. Get your end users to describe their data latency requirements in similar terms and then design your ETL solution appropriately to support each set of requirements:
- Instantaneous means that the data visible on the end user's screen represents the true state of the source transaction system at every instant. When the source system status changes, the online screen must also respond instantly.
- Frequently means that the data visible to the end user is updated many times per day but is not guaranteed to be the absolute current data as of this instant.
- Daily means that the data visible on the screen is valid as of a batch file download or reconciliation from the source system at the end of the previous working day.
Kimball Group Design Tip #89 Real Time Triage, describes the impacts and implementation alternatives for each of these capabilities in more detail.
This article has addressed six key decisions that must be evaluated during the creation of an ETL architecture supporting a dimensional data warehouse. There is seldom a single correct choice for any of these decisions. As always, the right choices are best driven by the unique requirements and characteristics of your organization.