Software // Information Management
09:02 AM
Application Firewalls (WAFs) vs. Web Behavior Analytics
Jun 09, 2016
This webinar discusses how the biggest problems to your web site aren't cross-site scripting and S ...Read More>>

Kimball University: Maintaining Dimension Hierarchies

Dimension hierarchies and pre-computed aggregations can make or break your data warehouse. Here's how to design, load and maintain true hierarchies while working around bad data sources and optimizing for usability and performance.

Load Normalized Data

The date dimension hierarchies are easy to load and maintain. Nothing is more predictable than the calendar, and no user intervention is required. Other dimensions are often populated from imperfect source systems, including the least perfect of all: the spreadsheet.

If your source systems are imperfect, managing the hierarchies over time is painful. Optimally, hierarchies should be maintained before the data warehouse — in the transaction system or a master data management (MDM) system. With good normalized source data, the data warehouse will never see malformed data. In the real world, we're not always so lucky. Data warehouse teams have been managing master data for decades and in many organizations will continue to do so.

Consider a product dimension for a retail store, with a hierarchy that goes from product to brand, category, and department. In this example, the product hierarchy isn't officially part of the transaction systems, but instead is managed by business users in the Marketing department. When we initially load the data warehouse, our incoming data is as illustrated in table below:

Sample Source Data

The scenario described here is not ideal: this product dimension is not well maintained by the source systems. Most of it is fine, but notice the last row of data: we have a typo in the category, which breaks referential integrity. The "Ice Creamy" brand in one row rolls up to Frozen Desserts, and in another row to Frozen. This is forbidden.

You should find and fix problems like these early on, before you even start building the ETL system. Your ETL system must implement checks, to confirm that each category rolls to one department, and each brand to one category. But by the time you're actually loading the historical data, you should have worked with the source systems and business users to fix the data errors.

The real challenge lies with ongoing updates of the dimension table. We don't have time during nightly processing to have a person examine a suspect row and make an intelligent determination about what to do. If the data arriving at the ETL system's door is suspect, the ETL system can't distinguish between bad data and intentional changes. This is one of the hazards of developing a prototype or proof of concept. It's easy to fix up the data on a one-time basis; keeping it clean over time is hard.

2 of 3
Comment  | 
Print  | 
More Insights
Threaded  |  Newest First  |  Oldest First
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
2016 InformationWeek Elite 100
Our 28th annual ranking of the leading US users of business technology.
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.