Big Data. Big Decisions
InformationWeek
Special Coverage Series


Kimball University: The Subsystems of ETL Revisited

These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.

Through education and consulting work, Kimball Group has been exposed to hundreds of successful data warehouses. Careful study of these successes has revealed a set of extract, transformation, and load (ETL) best practices. We first described these best practices in an Intelligent Enterprise column three years ago (see "The 38 Subsystems of ETL" ). Since then we have continued to refine the practices based on client experiences, feedback from students and continued research. As a result, we have carefully restructured these best practices into 34 subsystems that represent the key ETL architecture components required in almost every dimensional data warehouse environment. No wonder the ETL system takes such a large percentage of data warehouse and BI project resources!

The good news is that if you study these 34 subsystems, you'll recognize almost all of them and will be on the way to leveraging your experience as you build your ETL system. While we understand and accept the industry's accepted acronym, the "ETL" process really has four major components: Extracting, Cleaning and Conforming, Delivering and Managing. Each of these components and all 34 subsystems contained therein are explained below.

EXTRACTING: GETTING DATA INTO THE DATA WAREHOUSE

To no surprise, the initial subsystems of the ETL architecture address the issues of understanding your source data, extracting the data and transferring it to the data warehouse environment where the ETL system can operate on it independent of the operational systems. While the remaining subsystems focus on the transforming, loading and system management within the ETL environment, the initial subsystems interface to the source systems to access the required data. The extract-related ETL subsystems include:

Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.

Change Data Capture (subsystem 2) — Isolates the changes that occurred in the source system to reduce the ETL processing burden.

Extract System (subsystem 3) — Extracts and moves source data into the data warehouse environment for further processing.

CLEANING AND CONFORMING DATA

These critical steps are where the ETL system adds value to the data. The other activities, extracting and delivering data, are obviously important, but they simply move and load the data. The cleaning and conforming subsystems change data and enhance its value to the organization. In addition, these subsystems should be architected to create metadata used to diagnose source-system problems. Such diagnoses can eventually lead to business process reengineering initiatives to address the root causes of dirty data and to improve data quality over time.

The ETL data cleaning process is often expected to fix dirty data, yet at the same time the data warehouse is expected to provide an accurate picture of the data as it was captured by the organization’s production systems (see related article, "Data Stewardship 101: First Step to Quality and Consistency). It's essential to strike the proper balance between these conflicting goals. The key is to develop an ETL system capable of correcting, rejecting or loading data as is, and then highlighting, with easy-to-use structures, the modifications, standardizations, rules and assumptions of the underlying cleaning apparatus so the system is self-documenting.

The five major subsystems in the cleaning and conforming step include:

Data Cleansing System (subsystem 4) — Implements data quality processes to catch quality violations.

Error Event Tracking (subsystem 5) — Captures all error events that are vital inputs to data quality improvement.

Audit Dimension Creation (subsystem 6) — Attaches metadata to each fact table as a dimension. This metadata is available to BI applications for visibility into data quality.

Deduplication (subsystem 7) — Eliminates redundant members of core dimensions, such as customers or products. May require integration across multiple sources and application of survivorship rules to identify the most appropriate version of a duplicate row.

Data Conformance (subsystem 8) — Enforces common dimension attributes across conformed master dimensions and common metrics across related fact tables (see related article, "Kimball University: Data Integration for Real People").

 1 | 23  | Next Page »


Related Links

Related Reading


More Insights




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

BYTE encourages readers to engage in spirited, healthy debate, including taking us to task. However, BYTE moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. BYTE further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

Follow InformationWeek

By The Numbers

What Are Your Primary Concerns About Using Big Data Software?

Base: 417 respondents at organizations using or planning to deploy data analytics, BI or statistical analysis software
Data: InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

What Do You Think?

What's your attitude about SQL analysis on top of Hadoop?
We want fast, standard SQL analysis capabilities on Hadoop ASAP
Hadoop is for unstructured data; SQL is for relational databases
We'll give SQL on Hadoop a try, but relational DBs will remain the mainstay
Given strong SQL support on Hadoop, we'd nix the data warehouse
We're not interested in Hadoop
No opinion



Related Content

From Our Sponsor

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Business leaders often need a visual snapshot of data to quickly grasp and use it. This paper identifies five challenges in presenting data and how visual analytics can resolve them. Solutions are suggested to overcome the challenges of: speed, data clarity, data quality, displaying meaningful results, and dealing with outliers.

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Today's competitive advantage requires a deeper understanding of your business, your market and your customers. As an IT executive, you can drive that knowledge transformation. In this white paper, learn how to make decisions as a strategic business leader and three steps to begin an analytics initiative within your enterprise.

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

High-performance data visualization turns sophisticated analyses into meaningful graphics, leading to faster and smarter decision making. In this white paper, learn how visual analytics can transform big data, with additional features such as real-time functionality, mobile compatibility, robust applications for technical groups and accessibility for nontechnical users.

Big Data: Lessons from the Leaders

Big Data: Lessons from the Leaders

Financial performance, competitive advantage, operational efficiency, strategic decision making - every business goal can extract value from big data, and the time for doubt or inaction has long passed. In this Economist Intelligence Unit report, in-depth interviews with data pioneers reveal the link between the effective use of big data and the bottom line among other results.

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Which came first, the data or the decision? This white paper makes the case for having a decision in mind, then tailoring big data's volume, variety and velocity to achieve business results such as overcoming customer dissatisfaction or creating well-informed strategies in real time.

Informationweek Reports

Research: The Big Data Management Challenge

Research: The Big Data Management Challenge

The challenge of big data is real, but most organizations don't differentiate 'big data' from traditional data, and nearly 90% of respondents to our survey use conventional databases as the primary means of handling data. We'll help you understand what constitutes big data (it's not just size) and the numerous management challenges it poses.