Ideally, the design of your extract, transform, and load (ETL) system begins with one of the toughest challenges: surrounding the requirements. By this we mean gathering in one place all the known requirements, realities, and constraints affecting the ETL system. The list of requirements is pretty overwhelming, but it's essential to lay them on the table before launching a data warehouse project.
The requirements are mostly things you must live with and adapt your system to. Within the framework of your requirements, you'll have many places where you can make your own decisions, exercise your judgment, and leverage your creativity, but the requirements are just what they're named. They are required.
In this column, excerpted from The Data Warehouse ETL Toolkit (Wiley, 2004), we'll discuss the requirements and their implications.
The business needs are the information requirements of the data warehouse's end users. We use the term business needs somewhat narrowly here to mean the information content that end users need to make informed business decisions. Subsequently listed requirements broaden the definition of business needs, but this requirement is meant to identify the extended set of information sources that the ETL team must introduce into the data warehouse.
Taking, for the moment, the view that business needs directly drive the choice of data sources, it's obvious that understanding and constantly examining the business needs are the ETL team's core activities. Elsewhere we have described in detail the process for interviewing end users and gathering business requirements. The result of this process is a set of expectations the users have about what data will do for them.
In many cases, the original interviews with the end users and the original investigations of possible sources don't fully reveal the data's complexities and limitations. The ETL team often makes significant discoveries that affect whether the end user's business needs can be addressed as originally hoped for. And, of course, the ETL team often discovers additional capabilities in the data sources that expand the end users' decision-making capabilities. The lesson here is that even during the most technical back-room development steps of building the ETL system, you must maintain a dialog among the ETL team, data warehouse architects, business analysts, and end users. In a larger sense, the business needs and the content of the data sources are both moving targets that constantly need to be reexamined and discussed.
In recent years, especially with the passage of the Sarbanes-Oxley Act of 2002, organizations have been forced to seriously tighten up what they report and provide proof that the reported numbers are accurate, complete, and untampered with. Of course, data warehouses in regulated businesses such as telecommunications have complied with regulatory reporting requirements for many years. But certainly the whole tenor of financial reporting has become much more serious for everyone.
Some of the financial reporting issues will be outside the scope of the data warehouse, but many others will land squarely within its scope. Typical due diligence requirements for the data warehouse include the following:
- Saving archived copies of data sources and subsequent stagings of data
- Proof of the complete transaction flow that changed any data results
- Fully documented algorithms for allocations, adjustments, and derivations
- Proof of security of the data copies over time, both online and offline.
Data Quality via Data Profiling
As Jack Olson explains so clearly in his book Data Quality: The Accuracy Dimension (Morgan Kaufmann, 2003), data profiling is a necessary precursor to designing any kind of system to use that data. As he puts it: Data profiling "employs analytic methods for looking at data for the purpose of developing a thorough understanding of the content, structure, and quality of the data. A good data profiling [system] can process very large amounts of data, and with the skills of the analyst, uncover all sorts of issues that need to be addressed."
This perspective is especially relevant to the ETL team that may be handed a data source with content that hasn't really been vetted. For example, Jack points out that a data source that perfectly handles the needs of the production system, such as an order-taking system, may be a disaster for the data warehouse, because the ancillary fields the data warehouse hoped to use weren't central to the success of the order-taking process and were revealed to be unreliable and too incomplete for data warehouse analysis.
Data profiling is a systematic examination of the quality, scope, and context of a data source to allow an ETL system to be built. At one extreme, a very clean data source that has been well maintained before it arrives at the data warehouse requires minimal transformation and human intervention to load directly into final dimension tables and fact tables.
And at the other extreme, if data profiling reveals that the source data is deeply flawed and can't support the business' objectives, the data warehouse effort should be cancelled. The profiling step not only gives the ETL team guidance as to how much data cleaning machinery to invoke, but protects the ETL team from missing major milestones in the project because of the unexpected diversion to build a system to deal with dirty data. Do the data profiling up front! Use the data profiling results to prepare the business sponsors for the realistic development schedules, the limitations in the source data, and the need to invest in better data capture practices in the source systems.
The general level of security awareness has improved significantly in the last few years across all IT areas, but security remains an afterthought and an unwelcome additional burden to most data warehouse teams. The basic rhythms of the data warehouse are at odds with the security mentality. The data warehouse seeks to publish data widely to decision makers, whereas the security interests assume that data should be restricted to those with a need to know.
Throughout the Toolkit series of books, we've recommended a role-based approach to security where the ability to access the results from a data warehouse is controlled at the final applications delivery point. Security for end users isn't controlled with grants and revokes to individual users at the physical table level, but is controlled through roles defined and enforced on an LDAP-based network resource called a directory server. It is then incumbent on the end users' applications to sort out what the authenticated role of a requesting end user is and whether that role permits the end user to view the particular screen that's being requested.
Additionally, security must be extended to physical backups. If a tape or disk pack can easily be removed from the backup vault, then security has been compromised as effectively as if the online passwords were compromised.
Data Integration and the 360 Degree view
Data integration is a huge topic for IT because, ultimately, it aims to make all systems work together seamlessly. The "360-degree view of the customer" is a familiar name for data integration. In many cases, serious data integration must take place among the organization's primary transaction systems before any of that data arrives at the data warehouse. But rarely is that data integration complete, unless the organization has settled on a single enterprise resource planning (ERP) system, and even then it's likely that other important transaction processing systems exist outside the main ERP system.
Data integration usually takes the form of conforming dimensions and conforming facts in the data warehouse. Conforming dimensions means establishing common dimensional attributes (often textual labels and standard units of measurement) across separated databases so that "drill across" reports can be generated using these attributes.
Conforming facts means making agreements on common business metrics such as key performance indicators (KPIs) across separated databases so that these numbers can be compared mathematically by calculating differences and ratios.
The data latency requirement describes how quickly the data must be delivered to the end users. Data latency obviously has a huge effect on the architecture and system implementation. Up to a point, more clever processing algorithms, parallel processing, and more potent hardware can speed up most of the traditional batch-oriented data flows. But at some point, if the data latency requirement is sufficiently urgent, the ETL system's architecture must convert from batch oriented to streaming oriented. This switch isn't a gradual or evolutionary change; it's a major paradigm shift in which almost every step of the data delivery pipeline must be reimplemented.
Archiving and Lineage
We hinted at these requirements in the previous compliance and security sections. But even without the legal requirements for saving data, every data warehouse needs various copies of old data, either for comparisons with new data to generate change capture records or reprocessing.
We recommend staging the data at each point that a major transformation has occurred. These staging points occur after all four steps: extract, clean, conform, and deliver. So, when does staging (writing the data to disk) turn into archiving (keeping the data indefinitely on some form of permanent media)?
Our simple answer is a conservative answer. All staged data should be archived unless a conscious decision is made that specific data sets will never be recovered in the future. It's almost always less of a headache to read the data back in from permanent media than it is to reprocess the data through the ETL system at a later time. And, of course, it may be impossible to reprocess the data according to the old processing algorithms if enough time has passed.
And while you are at it, each staged/archived data set should have accompanying metadata describing the origins and processing steps that produced the data. Again, the tracking of this lineage is explicitly required by certain compliance requirements, but should be part of every archiving situation.
End-User Delivery Interfaces
The final step for the ETL system is the handoff to the end-user applications. We take a strong and disciplined position on this handoff. We believe the ETL team, working closely with the modeling team, must take responsibility for the content and structure of the data that, as much as we can control, makes the end-user applications simple and fast. This attitude is much more than a vague motherhood statement. We believe it's irresponsible to hand the data off to the end-user application in such a way as to increase the complexity of the application, slow down the final query or report creation, or make the data seem unnecessarily complex to the end users. The most elementary and serious error is to hand across a full-blown normalized physical model and walk away from the job. This is why we go to such lengths to build dimensional structures that comprise the actual final handoff.
In general, the ETL team and data modelers need to work closely with the end-user application developers to determine the exact requirements for the final data handoff. Each end-user tool has certain sensitivities that should be avoided and certain features that can be exploited if the physical data is in the right format. The same considerations apply to data prepared for OLAP cubes.
Some of the big design decisions when building an ETL system must be made on the basis of available resources to build and manage the system. You shouldn't build a system that depends on critical C++ processing modules if those programming skills aren't in house and you can't reasonably acquire and keep those skills. You may be much more confident in building your ETL system around a major vendor's ETL tool if you already have those skills in house and you know how to manage such a project.
You need to look in depth at the big decision of whether to hand code your ETL system or use a vendor's package. Technical issues and license costs aside, you shouldn't go off in a direction that your employees and managers find unfamiliar without seriously considering your decision's long-term implications.
Finally, in many cases, major design decisions will be made for you implicitly by senior management's insistence that you use existing legacy licenses. In many cases, this requirement is one you can live with and for which the advantages in your environment are pretty clear to everyone. But in a few cases, the use of a legacy system for your ETL development is a mistake. This is a difficult position to be in, and if you feel strongly enough about it, you may need to bet your job. If you must approach senior management and challenge the use of an existing legacy system, be well prepared in making your case; be man or woman enough to accept the final decision or possibly seek employment elsewhere.
The combined impact of all these requirements is overwhelming. Perhaps for that reason, many ETL implementations don't have a coherent set of design principles below the basic E and T and L modules. Many ETL systems seem to be the union of a thousand reactions to individual requirements. In the next column, we'll propose a set of unifying principles that will make sense out of your ETL implementation.
Ralph Kimball founder of the Kimball Group, teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. He has four best-selling data warehousing books in print, including the newly released The Data Warehouse ETL Toolkit (Wiley, 2004).
Margy Ross is president of the Kimball Group and an instructor with Kimball University. She co-authored The Data Warehouse Lifecycle Toolkit (Wiley, 1998) and The Data Warehouse Toolkit, 2nd Edition.