Kimball University: Maintaining Dimension Hierarchies
Dimension hierarchies and pre-computed aggregations can make or break your data warehouse. Here's how to design, load and maintain true hierarchies while working around bad data sources and optimizing for usability and performance.
Dimensions are key to navigating the data warehouse / business intelligence system, and hierarchies are the key to navigating dimensions. Any time a business user talks about wanting to drill up, down or into the data, they are implicitly referring to a dimension hierarchy. In order for those drill paths to work properly, and for a large DW/BI system to perform well, those hierarchies must be correctly designed, cleaned, and maintained.
Hierarchies are important not just for usability. They play a huge role in query performance for a modern DW/BI system: aggregations are often precomputed and stored for intermediate hierarchy levels and transparently used in queries. Precomputed aggregations are one of the most valuable tools to improve query performance, but in order for them to work, your hierarchies have to be clean.
Start with the Design
The solution to the problem of maintaining hierarchies begins during the design phase. For every substantial dimension, spend time thinking through the hierarchical relationships. Business-user input is absolutely imperative, as is time spent exploring the data.
The first question to resolve is what are the drilldown paths or hierarchies in each dimension? Most dimensions have a hierarchy, even if it's not coded in the transaction system. A core dimension such as customer, product, account, or even date may have many hierarchies. Date provides a good example that we all understand.
The date dimension often has three or more hierarchies. Novice dimensional modelers will try to create a single hierarchy that goes from day to week, month, quarter, and year. But that just doesn't work! Weeks do not roll up smoothly to months or even years. There is usually a separate fiscal calendar, and sometimes several others.
Display the hierarchies graphically to review them with the business users. The diagram below shows clearly the different hierarchies and levels that will be available. Notice the attributes that apply at different levels. This picture is a graphical display suitable for communicating with users and among the DW/BI team; it does not represent the table's physical structure. Get user buy-in on the hierarchies, levels, and names. Equally important, test how much transformation you need to apply to the actual data in order to populate these hierarchical structures.
The familiar date dimension contains lessons that are applicable to the administration of all dimensions:
• You can have multiple hierarchies. Most interesting dimensions have several alternative hierarchies. Work with business users to name columns and hierarchies so that the meaning of each is clear.
• You must have many-to-one referential integrity between each level: a day rolls up to one and only one month, month to quarter, and quarter to year.
• If the data warehouse environment (as opposed to the original source) maintains referential integrity with explicit physical tables for each level, then a unique primary key must be identified at each level. If these keys are artificial surrogate keys, then they should be hidden from the business users in the final single, flat denormalized dimension table in the presentation layer of the data warehouse. A common error is to think of the key for the month level as month name (January) or month number. The correct primary key is year and month. This is a very common mistake which we encounter in many kinds of dimensions. In a geography dimension, for example, city name alone is not an identifier column; it needs to be some combination of city, state, and perhaps country.
• Think carefully during the design phase about whether columns can be reused between hierarchies. You might think that the week hierarchy could share the year column with the calendar hierarchy, but what about the first and last weeks of the year? If our business rule is to have week 1 for a new year start on the first Monday of the year, Week 1 of 2009 starts on January 5. January 1-4 will fall in 2008 for the week hierarchy. You need a separate year-of-week column. Sometimes you do want hierarchies to intersect, but you must be certain that the data will support that intersection.
Load Normalized Data
The date dimension hierarchies are easy to load and maintain. Nothing is more predictable than the calendar, and no user intervention is required. Other dimensions are often populated from imperfect source systems, including the least perfect of all: the spreadsheet.
If your source systems are imperfect, managing the hierarchies over time is painful. Optimally, hierarchies should be maintained before the data warehouse — in the transaction system or a master data management (MDM) system. With good normalized source data, the data warehouse will never see malformed data. In the real world, we're not always so lucky. Data warehouse teams have been managing master data for decades and in many organizations will continue to do so.
Consider a product dimension for a retail store, with a hierarchy that goes from product to brand, category, and department. In this example, the product hierarchy isn't officially part of the transaction systems, but instead is managed by business users in the Marketing department. When we initially load the data warehouse, our incoming data is as illustrated in table below:
The scenario described here is not ideal: this product dimension is not well maintained by the source systems. Most of it is fine, but notice the last row of data: we have a typo in the category, which breaks referential integrity. The "Ice Creamy" brand in one row rolls up to Frozen Desserts, and in another row to Frozen. This is forbidden.
You should find and fix problems like these early on, before you even start building the ETL system. Your ETL system must implement checks, to confirm that each category rolls to one department, and each brand to one category. But by the time you're actually loading the historical data, you should have worked with the source systems and business users to fix the data errors.
The real challenge lies with ongoing updates of the dimension table. We don't have time during nightly processing to have a person examine a suspect row and make an intelligent determination about what to do. If the data arriving at the ETL system's door is suspect, the ETL system can't distinguish between bad data and intentional changes. This is one of the hazards of developing a prototype or proof of concept. It's easy to fix up the data on a one-time basis; keeping it clean over time is hard.
Maintain True Hierarchies
Clean source data is essential. True hierarchies are often maintained in normalized tables, as illustrated below. Optimally, this maintenance occurs before the data warehouse proper, either in the source transaction system or a master data management system.
You can write an ETL process to move this nicely structured data into the dimension table; it's a two-step process. Start at the top of the hierarchy (department), and perform inserts and updates into normalized tables in the staging area. Work down to the leaf level (product). Your staging tables will look similar to the structures in the sample product hierarchy table presented earlier. Once you've performed the extract step and have staged all the hierarchical data, write a query to join these tables together and perform standard dimension processing from the staging area into the data warehouse dimension.
The product dimension in the data warehouse should be denormalized into a single flattened dimension table. The normalization illustrated above is the design pattern for the source system and staging areas, not the actual dimension table that users query.
Address Dirty Sources
Not everyone has a well-designed source system with normalized hierarchies as illustrated above. It's common in the DW/BI world for hierarchies to be managed by business users. Transaction systems tend to have only enough information to do their job, and business users often have a legitimate need for alternative, richer rollups and attributes. What can you do?
• Modify the source systems. This is extraordinarily unlikely, unless your organization wrote those systems.
• Buy and implement a master data management (MDM) system that manages the process of defining and maintaining hierarchies. This is the best solution, though MDM is expensive in terms of software license but especially management commitment and attention.
• Write an applet to manage a specific user hierarchy. Keep your design simple, solving only the problem in front of you – for example, the product hierarchy. If you get carried away, you'll find yourself developing what amounts to a MDM solution.
A true hierarchy has referential integrity between each of its levels. Remember that this is fundamentally a data quality issue that is enforced in the back room or source systems; it's typically not carried into the presentation area as separate tables or snowflakes of tables. When a dimension has a true hierarchy, you gain two huge benefits:
• You will be able to define and maintain precomputed aggregations at intermediate levels of the hierarchy. In other words, you can pre-compute and store an aggregate at the Month level or the Product Brand level. Precomputed aggregations are one of the most important tools for improving query performance in the DW/BI system.
• You will be able to integrate data at different levels of granularity. Sometimes data naturally exists at an aggregate level. For example, our store might develop a long-term sales forecast by month and category. We can create a subset dimension at the category level to associate with the forecast facts, and then join together actual and forecast sales, if and only if the product hierarchy is a true hierarchy.
Make it Perform
Those with large data warehouses, especially those with large dimensions, need to worry about dimension hierarchies. The performance benefits of precomputed aggregations are tremendous, and they will make or break the usability of the BI/DW system. To realize these benefits, you must implement procedures to maintain hierarchical information correctly in the source system or a master data management system.
In the meantime, users can benefit from navigation paths that look like hierarchies but really aren't. Business users have legitimate reasons for wanting to group information together, and it's our job to make that not just possible, but also easy and well-performing. Just make sure that your project has the resources to ensure success!
About the Author
You May Also Like