The extract-transform-load (ETL) system, or more informally, the "back room," is often estimated to consume 70 percent of the time and effort of building a data warehouse. But there hasn't been enough careful thinking about just why the ETL system is so complex and resource intensive. Everyone understands the three letters: You get the data out of its original source location (E), you do something to it (T), and then you load it (L) into a final set of tables for the users to query.
When asked about breaking down the three big steps, many designers say, "Well, that depends." It depends on the source, it depends on funny data idiosyncrasies, it depends on the scripting languages and ETL tools available, it depends on the skills of the in-house staff, and it depends on the query and reporting tools the end users have.
The "it depends" response is dangerous because it becomes an excuse to roll your own ETL system, which in the worst-case scenario results in an undifferentiated spaghetti-mess of tables, modules, processes, scripts, triggers, alerts, and job schedules. Maybe this kind of creative design approach was appropriate a few years ago when everyone was struggling to understand the ETL task, but with the benefit of thousands of successful data warehouses, a set of best practices is ready to emerge.
I have spent the last 18 months intensively studying ETL practices and ETL products. I have identified a list of 38 subsystems that are needed in almost every data warehouse back room. That's the bad news. No wonder the ETL system takes such a large fraction of the data warehouse resources. But the good news is that if you study the list, you'll recognize almost all of them, and you'll be on the way to leveraging your experience in each of these subsystems as you build successive data warehouses.
If you've survived to the end of this list, congratulations! Here are the important observations I'd like you to carry away: It's really difficult to argue that any of these subsystems are unnecessary as this list makes it really clear that without dividing up the task (perhaps 38 ways), the descent into chaos is inevitable. The industry is ready to define best-practices goals and implementation standards for each of these 38 subsystems, and it would be a tremendous contribution for the ETL tool vendors to provide wizards or serious templates for each of these 38 subsystems. We have a lot to talk about. Maybe 38 more columns!
Ralph Kimball founder of the Kimball Group, teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. He has four best-selling data warehousing books in print, including the newly released The Data Warehouse ETL Toolkit (Wiley, 2004).