Kimball University: Six Key Decisions for ETL Architectures
Best-practice advice on software vs. coding, where to integrate, how to capture changed data, when to stage data, where to correct data and what latency levels to shoot for.
1. Should We Use an ETL Tool?
More Software Insights
- Using InfoSphere Information Server to Integrate and Manage Big Data
- Why is Information Governance So Important for Modern Analytics?
- Don't Get Stuck on Your Virtualization Journey: Where to Focus Next
- Improve quality and increase your business agility with automated testing
One of the earliest and most fundamental decisions you must make is whether to hand code your ETL system or use a vendor-supplied package. Technical issues and license costs aside, you shouldn't go off in a direction that your employees and managers find unfamiliar without seriously considering the decision's long-term implications. This decision will have a major impact on the ETL environment, driving staffing decisions, design approaches, metadata strategies, and implementation timelines for a long time.
In today's environment, most organizations should use a vendor-supplied ETL tool as a general rule. However, this decision must be made on the basis of available resources to build and manage the system. ETL tools are really system building environments that use icons, arrows and properties to build the ETL solution rather than writing code. Be careful; if your proposed ETL development team is comprised of a number of old-school hand coders, they might not adapt well to an ETL tool. For this reason alone, some organizations find that custom ETL development is still a reasonable solution.
If you decide to use an ETL tool, don't expect a huge payback in your first iteration. The advantages will become more apparent as you traverse additional iterations and begin to leverage the development advantages of using a tool during subsequent implementations. You'll also experience the benefits of enhanced maintenance capabilities, more complete documentation and improved metadata support over time. The article "Should You Use an ETL Tool," by Joy Mundy, provides an in-depth look at the pros and cons of using an ETL tool.
2. Where and how should data integration take place?
Data integration is a huge topic for IT because, ultimately, it aims to make all systems work together seamlessly. The "360 degree view of the enterprise" is a commonly discussed goal that really means data integration. In many cases, serious data integration should take place among an organization's primary transaction systems before data arrives at the data warehouse. However, rather than dealing with integration in the operational environment, these requirements are often pushed back to the data warehouse and the ETL system.
Most of us understand the core concept that integration means making disparate databases function together in a useful way. We all know we need it; we just don't have a clear idea of how to break it down into manageable pieces. Does integration mean that all parties across large organizations agree on every data element or only on some data elements? This is the crux of the decision that must be made. At what level in the data does business management agree/insist integration will occur? Are they willing to establish common definitions across organizations and to abide by those definitions?
Fundamentally, integration means reaching agreement on the meaning of data from the perspective of two or more databases. With integration, the results of two databases can be combined into a single data warehouse analysis. Without such an accord, the databases will remain isolated stovepipes that can't be linked in an application. For the context of our ETL environment, data integration takes the form of conforming dimensions and conforming facts in the data warehouse. Conforming dimensions means establishing common dimensional attributes across separated fact tables so that "drill across" reports can be generated using these attributes. Conforming facts means making agreements on common business metrics such as key performance indicators (KPIs) across separated databases so that these numbers can be compared mathematically for calculating differences and ratios.