Dimension hierarchies and pre-computed aggregations can make or break your data warehouse. Here's how to design, load and maintain true hierarchies while working around bad data sources and optimizing for usability and performance.
Load Normalized Data
The date dimension hierarchies are easy to load and maintain. Nothing is more predictable than the calendar, and no user intervention is required. Other dimensions are often populated from imperfect source systems, including the least perfect of all: the spreadsheet.
If your source systems are imperfect, managing the hierarchies over time is painful. Optimally, hierarchies should be maintained before the data warehouse — in the transaction system or a master data management (MDM) system. With good normalized source data, the data warehouse will never see malformed data. In the real world, we're not always so lucky. Data warehouse teams have been managing master data for decades and in many organizations will continue to do so.
Consider a product dimension for a retail store, with a hierarchy that goes from product to brand, category, and department. In this example, the product hierarchy isn't officially part of the transaction systems, but instead is managed by business users in the Marketing department. When we initially load the data warehouse, our incoming data is as illustrated in table below:
The scenario described here is not ideal: this product dimension is not well maintained by the source systems. Most of it is fine, but notice the last row of data: we have a typo in the category, which breaks referential integrity. The "Ice Creamy" brand in one row rolls up to Frozen Desserts, and in another row to Frozen. This is forbidden.
You should find and fix problems like these early on, before you even start building the ETL system. Your ETL system must implement checks, to confirm that each category rolls to one department, and each brand to one category. But by the time you're actually loading the historical data, you should have worked with the source systems and business users to fix the data errors.
The real challenge lies with ongoing updates of the dimension table. We don't have time during nightly processing to have a person examine a suspect row and make an intelligent determination about what to do. If the data arriving at the ETL system's door is suspect, the ETL system can't distinguish between bad data and intentional changes. This is one of the hazards of developing a prototype or proof of concept. It's easy to fix up the data on a one-time basis; keeping it clean over time is hard.
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
IT Strategies to Conquer the CloudChances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.