Clean source data is essential. True hierarchies are often maintained in normalized tables, as illustrated below. Optimally, this maintenance occurs before the data warehouse proper, either in the source transaction system or a master data management system.
You can write an ETL process to move this nicely structured data into the dimension table; it's a two-step process. Start at the top of the hierarchy (department), and perform inserts and updates into normalized tables in the staging area. Work down to the leaf level (product). Your staging tables will look similar to the structures in the sample product hierarchy table presented earlier. Once you've performed the extract step and have staged all the hierarchical data, write a query to join these tables together and perform standard dimension processing from the staging area into the data warehouse dimension.
The product dimension in the data warehouse should be denormalized into a single flattened dimension table. The normalization illustrated above is the design pattern for the source system and staging areas, not the actual dimension table that users query.
Address Dirty Sources
Not everyone has a well-designed source system with normalized hierarchies as illustrated above. It's common in the DW/BI world for hierarchies to be managed by business users. Transaction systems tend to have only enough information to do their job, and business users often have a legitimate need for alternative, richer rollups and attributes. What can you do?
- • Modify the source systems. This is extraordinarily unlikely, unless your organization wrote those systems.
- • Buy and implement a master data management (MDM) system that manages the process of defining and maintaining hierarchies. This is the best solution, though MDM is expensive in terms of software license but especially management commitment and attention.
- • Write an applet to manage a specific user hierarchy. Keep your design simple, solving only the problem in front of you – for example, the product hierarchy. If you get carried away, you'll find yourself developing what amounts to a MDM solution.
A true hierarchy has referential integrity between each of its levels. Remember that this is fundamentally a data quality issue that is enforced in the back room or source systems; it's typically not carried into the presentation area as separate tables or snowflakes of tables. When a dimension has a true hierarchy, you gain two huge benefits:
- • You will be able to define and maintain precomputed aggregations at intermediate levels of the hierarchy. In other words, you can pre-compute and store an aggregate at the Month level or the Product Brand level. Precomputed aggregations are one of the most important tools for improving query performance in the DW/BI system.
- • You will be able to integrate data at different levels of granularity. Sometimes data naturally exists at an aggregate level. For example, our store might develop a long-term sales forecast by month and category. We can create a subset dimension at the category level to associate with the forecast facts, and then join together actual and forecast sales, if and only if the product hierarchy is a true hierarchy.
Make it Perform
Those with large data warehouses, especially those with large dimensions, need to worry about dimension hierarchies. The performance benefits of precomputed aggregations are tremendous, and they will make or break the usability of the BI/DW system. To realize these benefits, you must implement procedures to maintain hierarchical information correctly in the source system or a master data management system.
In the meantime, users can benefit from navigation paths that look like hierarchies but really aren't. Business users have legitimate reasons for wanting to group information together, and it's our job to make that not just possible, but also easy and well-performing. Just make sure that your project has the resources to ensure success!