Kimball University: The Subsystems of ETL Revisited
These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.
Through education and consulting work, Kimball Group has been exposed to hundreds of successful data warehouses. Careful study of these successes has revealed a set of extract, transformation, and load (ETL) best practices. We first described these best practices in an Intelligent Enterprise column three years ago (see "The 38 Subsystems of ETL" ). Since then we have continued to refine the practices based on client experiences, feedback from students and continued research. As a result, we have carefully restructured these best practices into 34 subsystems that represent the key ETL architecture components required in almost every dimensional data warehouse environment. No wonder the ETL system takes such a large percentage of data warehouse and BI project resources!
The good news is that if you study these 34 subsystems, you'll recognize almost all of them and will be on the way to leveraging your experience as you build your ETL system. While we understand and accept the industry's accepted acronym, the "ETL" process really has four major components: Extracting, Cleaning and Conforming, Delivering and Managing. Each of these components and all 34 subsystems contained therein are explained below.
EXTRACTING: GETTING DATA INTO THE DATA WAREHOUSE
To no surprise, the initial subsystems of the ETL architecture address the issues of understanding your source data, extracting the data and transferring it to the data warehouse environment where the ETL system can operate on it independent of the operational systems. While the remaining subsystems focus on the transforming, loading and system management within the ETL environment, the initial subsystems interface to the source systems to access the required data. The extract-related ETL subsystems include:
Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.
Change Data Capture (subsystem 2) — Isolates the changes that occurred in the source system to reduce the ETL processing burden.
Extract System (subsystem 3) — Extracts and moves source data into the data warehouse environment for further processing.
CLEANING AND CONFORMING DATA
These critical steps are where the ETL system adds value to the data. The other activities, extracting and delivering data, are obviously important, but they simply move and load the data. The cleaning and conforming subsystems change data and enhance its value to the organization. In addition, these subsystems should be architected to create metadata used to diagnose source-system problems. Such diagnoses can eventually lead to business process reengineering initiatives to address the root causes of dirty data and to improve data quality over time.
The ETL data cleaning process is often expected to fix dirty data, yet at the same time the data warehouse is expected to provide an accurate picture of the data as it was captured by the organization’s production systems (see related article, "Data Stewardship 101: First Step to Quality and Consistency). It's essential to strike the proper balance between these conflicting goals. The key is to develop an ETL system capable of correcting, rejecting or loading data as is, and then highlighting, with easy-to-use structures, the modifications, standardizations, rules and assumptions of the underlying cleaning apparatus so the system is self-documenting.
The five major subsystems in the cleaning and conforming step include:
Data Cleansing System (subsystem 4) — Implements data quality processes to catch quality violations.
Error Event Tracking (subsystem 5) — Captures all error events that are vital inputs to data quality improvement.
Audit Dimension Creation (subsystem 6) — Attaches metadata to each fact table as a dimension. This metadata is available to BI applications for visibility into data quality.
Deduplication (subsystem 7) — Eliminates redundant members of core dimensions, such as customers or products. May require integration across multiple sources and application of survivorship rules to identify the most appropriate version of a duplicate row.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.