Kimball University: The Subsystems of ETL Revisited
These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.
DELIVERING: PREPARE FOR PRESENTATION
The primary mission of the ETL system is the handoff of the dimension and fact tables in the delivery step. There is considerable variation in source data structures and cleaning and conforming logic, but the delivery processing techniques are more defined and disciplined. Careful and consistent use of these techniques is critical to building a successful dimensional data warehouse that is reliable, scalable and maintainable.
Many of these subsystems focus on dimension table processing. Dimension tables are the heart of the data warehouse. They provide the context for the fact tables and hence for all the measurements. For many dimensions, the basic load plan is relatively simple: perform basic transformations to the data to build dimension rows to be loaded into the target presentation table.
Preparing fact tables is certainly important as they hold the key measurements of the business that users want to see. Fact tables can be very large and time consuming to load. However, preparing fact tables for presentation is typically more straightforward.
The delivery systems in the ETL architecture consist of:
Surrogate Key Generator (subsystem 10) — Produces surrogate keys independently for every dimension.
Hierarchy Manager (subsystem 11) — Delivers multiple, simultaneous, embedded hierarchical structures in a dimension.
Special Dimensions Manager (subsystem 12) — Creates placeholders in the ETL architecture for repeatable processes supporting an organization’s specific dimensional design characteristics, including standard dimensional design constructs such as junk dimensions, mini-dimensions and behavior tags.
Fact Table Builders (subsystem 13) — Construct the three primary types of fact tables: transaction grain, periodic snapshot and accumulating snapshot.
Surrogate Key Pipeline (subsystem 14) — Replaces operational natural keys in the incoming fact table record with the appropriate dimension surrogate keys.
Multi-Valued Bridge Table Builder (subsystem 15) — Builds and maintains bridge tables to support multi-valued relationships.
Late Arriving Data Handler (subsystem 16) — Applies special modifications to the standard processing procedures to deal with late-arriving fact and dimension data.
Dimension Manager (subsystem 17) — Centralized authority who prepares and publishes conformed dimensions to the data warehouse community.
Fact Table Provider (subsystem 18) — Owns the administration of one or more fact tables and is responsible for their creation, maintenance and use.
Aggregate Builder (subsystem 19) — Builds and maintains aggregates to be used seamlessly with aggregate navigation technologies for enhanced query performance.
OLAP Cube Builder (subsystem 20) — Feeds data from the relational dimensional schema to populate OLAP cubes.
Data Propagation Manager (subsystem 21) — Prepares conformed, integrated data from the data warehouse presentation server for delivery to other environments for special purposes.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.