Kimball University: Five Alternatives for Better Employee Dimension ModelingThe employee dimension presents one of the trickier challenges in data warehouse modeling. These five approaches ease the complication of designing and maintaining a 'Reports To' hierarchy for ever-changing reporting relationships and organizational structures.
The employee dimension presents one of the trickier challenges in data warehouse modeling. These five approaches ease the complication of designing and maintaining a 'Reports To' hierarchy for ever-changing reporting relationships and organizational structures.
Most enterprise data warehouses will eventually include an Employee dimension. This dimension can be richly decorated, including not only name and contact information, but also job-related attributes such as job title, departmental cost codes, hire dates, even salary-related information. One very important attribute of an employee is the identity of the employee's manager. For any manager, we'd like to work down the Reports To hierarchy, finding activity for her direct reports or her entire organization. For any employee, we'd like to work up the hierarchy, identifying his entire management chain. This Reports To hierarchy presents significant design and management challenges to the unwary. This article describes approaches for including this relationship in a dimensional model. The Employee Dimension
The basic structure of the Employee dimension is shown in Figure 1. The unique feature of a Reports To hierarchy is that a manager is also an employee, so Employee has a foreign key reference to itself, from Manager Key to Employee Key.
However, if you want to query this table in the relational environment, you'd have to use a CONNECT BY syntax. This is very unattractive and probably unworkable:
- Not every SQL engine supports CONNECT BY.
- Even SQL engines that support CONNECT BY may not support a GROUP BY in the same query.
- Not every ad hoc query tool supports CONNECT BY.
Alternative 1: Bridge Table using Surrogate Keys
The classic solution to the Reports To or variable-depth hierarchy problem is a bridge table technique described in The Date Warehouse Toolkit (Wiley 2002), p.162-168 and illustrated by Figure 2. The same Employee dimension table as above relates to the fact table through a bridge table.
The Reports To Bridge table contains one row for each pathway from a person to any person below him in the hierarchy, both direct and indirect reports, plus an additional row for his relationship to himself. This structure can be used to report on each person's activity; the activity of their entire organization; or activity down a specified number of levels from the manager.
There are several minor disadvantages to this design:
- The bridge table is somewhat challenging to build.
- The bridge table has many rows in it, so query performance can suffer.
- The user experience is somewhat complicated for ad hoc use, though we've seen many analysts use it effectively.
- In order to drill up -- to aggregate information up rather than down a management chain -- the join paths have to be reversed.
The major challenge comes when we want to manage Employee and the Reports To hierarchy as a Type 2 dimension -- a dimension for which we are tracking history rather than updating in place. This bridge table would still work in theory; the problem is the explosion of Employee and Reports To Bridge records to track the changes.
To understand the problem, look back at Figure 1 and think about it as a Type 2 dimension for a medium-sized company with 20,000 employees. Imagine that the CEO -- the top of the hierarchy -- has 10 senior VPs reporting to her. Let's give her a Type 2 change that generates a new row and hence a new Employee Key. Now, how many employees are pointing to her as their manager? It's a brand new row, so of course no existing rows point to it; we need to propagate 10 new Type 2 rows for each of the senior VPs. The change ripples through the entire table. We end up replicating the complete Employee table because of one attribute change in one row. Even aside from the obvious implication of data volume explosion, simply teasing apart the logic of which rows need to be propagated is an ETL nightmare.