Kimball University: Five Alternatives for Better Employee Dimension Modeling
The employee dimension presents one of the trickier challenges in data warehouse modeling. These five approaches ease the complication of designing and maintaining a 'Reports To' hierarchy for ever-changing reporting relationships and organizational structures.
The employee dimension presents one of the trickier challenges in data warehouse modeling. These five approaches ease the complication of designing and maintaining a 'Reports To' hierarchy for ever-changing reporting relationships and organizational structures.
Most enterprise data warehouses will eventually include an Employee dimension. This dimension can be richly decorated, including not only name and contact information, but also job-related attributes such as job title, departmental cost codes, hire dates, even salary-related information. One very important attribute of an employee is the identity of the employee's manager. For any manager, we'd like to work down the Reports To hierarchy, finding activity for her direct reports or her entire organization. For any employee, we'd like to work up the hierarchy, identifying his entire management chain. This Reports To hierarchy presents significant design and management challenges to the unwary. This article describes approaches for including this relationship in a dimensional model. The Employee Dimension
The basic structure of the Employee dimension is shown in Figure 1. The unique feature of a Reports To hierarchy is that a manager is also an employee, so Employee has a foreign key reference to itself, from Manager Key to Employee Key.
Someone new to dimensional modeling might leave the table as it is currently designed as the Manager/Employee relationship is fully described. Assuming you can populate the table, this design will work if an OLAP environment is used to query the data. Popular OLAP tools contain a Parent-Child hierarchy structure that works smoothly and elegantly against a variable-depth hierarchy modeled as shown here. This is one of the strengths of an OLAP tool.
However, if you want to query this table in the relational environment, you'd have to use a CONNECT BY syntax. This is very unattractive and probably unworkable:
Not every SQL engine supports CONNECT BY.
Even SQL engines that support CONNECT BY may not support a GROUP BY in the same query.
Not every ad hoc query tool supports CONNECT BY.
Alternative 1: Bridge Table using Surrogate Keys
The classic solution to the Reports To or variable-depth hierarchy problem is a bridge table technique described in The Date Warehouse Toolkit (Wiley 2002), p.162-168 and illustrated by Figure 2. The same Employee dimension table as above relates to the fact table through a bridge table.
The Reports To Bridge table contains one row for each pathway from a person to any person below him in the hierarchy, both direct and indirect reports, plus an additional row for his relationship to himself. This structure can be used to report on each person's activity; the activity of their entire organization; or activity down a specified number of levels from the manager.
There are several minor disadvantages to this design:
The bridge table is somewhat challenging to build.
The bridge table has many rows in it, so query performance can suffer.
The user experience is somewhat complicated for ad hoc use, though we've seen many analysts use it effectively.
In order to drill up -- to aggregate information up rather than down a management chain -- the join paths have to be reversed.
The major challenge comes when we want to manage Employee and the Reports To hierarchy as a Type 2 dimension -- a dimension for which we are tracking history rather than updating in place. This bridge table would still work in theory; the problem is the explosion of Employee and Reports To Bridge records to track the changes.
To understand the problem, look back at Figure 1 and think about it as a Type 2 dimension for a medium-sized company with 20,000 employees. Imagine that the CEO -- the top of the hierarchy -- has 10 senior VPs reporting to her. Let's give her a Type 2 change that generates a new row and hence a new Employee Key. Now, how many employees are pointing to her as their manager? It's a brand new row, so of course no existing rows point to it; we need to propagate 10 new Type 2 rows for each of the senior VPs. The change ripples through the entire table. We end up replicating the complete Employee table because of one attribute change in one row. Even aside from the obvious implication of data volume explosion, simply teasing apart the logic of which rows need to be propagated is an ETL nightmare. Alternative 2: Bridge Table with Separate Reports To Dimension
Tracking the history of changes in a variable depth hierarchy such as an employee Reports To hierarchy is especially challenging when the hierarchy changes are intermingled with other Type 2 changes in the dimension. An obvious solution is to separate the Employee dimension from the Reports To relationship. Simplify Employee by removing the self-referencing relationship, and create a new Reports To dimension, as illustrated in Figure 3.
The key elements that distinguish this design from the classic structure are:
Eliminate the surrogate key for manager from the Employee dimension, and hence the recursive foreign key relationship.
The Reports To dimension has very few columns: surrogate keys, personnel numbers, and names. The only Type 2 attribute is possibly the Manager Position Number.
If you're exclusively using OLAP to query the schema, the bridge table is unnecessary.
If the business users don't need to track changes in the full Reports To hierarchy, this solution works neatly. Employee is a Type 2 dimension. We see the name of each employee's manager. If Employee.ManagerName is managed as Type 2, we can easily see the names of all past bosses from the Employee dimension. If Reports To is managed as Type 1 -- we're not tracking changes in the reporting structure -- it is no more difficult to populate and maintain than in the classic dimension structure.
If the business users absolutely must see the history of the reporting relationship, this solution will be challenging. We've simplified the management problem by separating out the Reports To and Employee dimensions, but if we get a major organizational change we're still going to have to propagate a lot of new rows in both Reports To and the bridge table.
Alternative 3: Bridge Table with Natural Keys
In order to track changes in a Reports To hierarchy for anything other than trivial data volumes, we need a solution that does not use surrogate keys. The classic structure described in Figure 2 works fine at query time, but it's a maintenance challenge. Our natural key alternative is illustrated in Figure 4.
The key elements of this design relative to the classic structure of Alternative 1 are:
Eliminate the surrogate key for manager from the Employee dimension, and hence the recursive foreign key relationship.
Include the Employee dimension twice in the schema, once as the employee (linked directly to the fact table), and once as the manager (linked via the bridge table). The Manager dimension table is simply a database view of the Employee dimension.
The bridge table is built on employee numbers -- the natural key carried in the source systems -- rather than the data warehouse surrogate keys. It's like the classic bridge table except that we need start and end dates to uniquely identify each row.
The propagation of new rows in the bridge table is substantially fewer than before since new rows are added when reporting relationships change, not when any Type 2 employee attribute is modified (as in Figure 2). A bridge table built on natural keys is an order of magnitude easier to manage -- though still quite challenging.
A primary design goal is to be able to find all the fact rows associated with a manager and her entire organization, as the organization was structured at the time of the event measured in the fact table. This is a complicated query:
From the Manager view of the Employee dimension, find the manager we're interested in.
Join to the bridge table to find the personnel numbers and row dates for the employees in her organization.
Join again to the Employee dimension to find the surrogate Employee Key for the people in the organization.
Finally, join to the fact table to pick up all facts associated with these employees.
The joins to the Bridge table and Manager view of Employee must constrain to pick up only the one row that's in effect as of the time of the fact transaction.
SELECT Manager.ManagerName, Employee.EmployeeName, SUM(FactTable.SomeFact) AS
OrganizationalSum
FROM FactTable
INNER JOIN Employee -- standard dimensional join
ON (FactTable.EmployeeKey = Employee.EmployeeKey) INNER JOIN NKBridge -- needs a date constraint ON (Employee.PersonnelNum = Bridge.PersonnelNum
AND Fact.DateKey BETWEEN Bridge.RowStartDate and Bridge.RowEndDate) INNER JOIN Manager -- needs a date constraint ON (Bridge.MgrPersonnelNum = Manager.MgrPersonnelNum
AND Fact.DateKey BETWEEN Manager.RowStartDate AND Manager.RowEndDate) WHERE Manager.ManagerName = 'Name of specific person'
GROUP BY Manager.ManagerName, Employee.EmployeeName
The natural key bridge table approach is unwieldy. Its main advantage is that it's feasible to maintain. It also avoids breaking out the reporting relationship into a separate dimension, as in Alternative2. Any queries that don't involve the Reports To structure can drop the bridge table and Manager dimension view. Disadvantages include:
Query performance is a concern as the queries are complex and the bridge table will grow quite large over time.
The technique is not appropriate for broad ad hoc use. Only a tiny percentage of power users could ever hope to master the complex query structure.
The technique relies on dynamic "date-bracketed" joins between the tables, and hence cannot be implemented in OLAP technology.
Alternative 4: Forced Fixed Depth Hierarchy Technique It is tempting to force the structure into a fixed depth hierarchy. Even a very large company probably has fewer than 15-20 layers of management, which would be modeled as 15-20 additional columns in the Employee dimension. You'll need to implement a method of handling the inevitable future exceptions. A fixed depth employee dimension table is illustrated in Figure 5.
The Employee Org Level Number tells us what level from the top of the hierarchy we'll find this employee. Usually we fill in the lower levels with the employee's name.
At query time, the forced fixed depth hierarchy approach will work smoothly with both relational and OLAP data access. The biggest awkwardness is to train the users to query the Org Level Number first to find out the level where the employee is located -- for example Level 5 -- and then constrain on that column (Level05 Manager Name). A design that uses this approach must very carefully evaluate whether this two step query procedure is actually workable with a particular query tool and consider the training costs for the business users. Query performance should be substantially better than designs that include a bridge table.
The forced fixed depth approach is maintainable, but you will see a lot of propagation of Type 2 rows. If the entire fixed depth hierarchy is managed as Type2, then a new CEO (Level01 Manager) would result in a new row for every employee. Some organizations compromise by managing the top several levels as Type1.
Alternative 5: The PathString Attribute
By now you're probably desperate for a recommendation. Two years ago, a clever student in a Kimball University modeling class described an approach that allows complex ragged hierarchies to be modeled without using a bridge table. Furthermore, this approach avoids the Type 2 SCD explosion described in Alternative #1, and it works equally well in both OLAP and ROLAP environments.
The PathString attribute is a field in the Employee dimension that contains an encoding of the path from the supreme top level manager down to the specific employee. At each level of the hierarchy, the nodes are labeled left to right as A, B, C, D, etc. and the entire path from the supreme parent is encoded in the PathString attribute. Every employee has a PathString attribute. The supreme top level manager has a PathString value of "A". The "A" indicates that this employee is the left most (and only) employee at that level. Two additional columns would hold the level number, and an indicator of whether the employee is a manager or an individual contributor. Figure 6 shows a sample organization chart with PathString values for each node.
Users query the tree by creating a filter condition on the PathString column in the Employee dimension. For example, we can find all the people who report (directly or indirectly) to the employee with PathString ACB by filtering WHERE PathString LIKE 'ACB%'. We can find direct reports by adding a clause AND OrgLevel = 4.
The advantage of the PathString approach is its maintainability. Because of this clever structure, you will see substantially fewer Type 2 rows cascading through the dimension. An organizational change high in the tree -- such as creating a new VP organization and moving many people from one node to another -- will result in a substantial restatement of the tree. If you're tracking the organizational structure itself as Type 2, this would mean many new rows in the employee dimension. But it's still fewer rows than the alternative approaches.
The main disadvantage of the PathString approach is the awkwardness of the business user query experience. This solution will require substantial marketing and education of the user community for it to be palatable.
Recommendation
Hopefully when you study these alternatives, you will see one that meets your needs. A Type2 Reports To or variable-depth hierarchy is a challenging beast to include in your DW/BI design. This is particularly true if you want to support ad hoc use of the structure as you'll need to balance ease of use and query performance against some very difficult maintenance problems. The decision matrix is complicated by the different capabilities of alternative storage engines, especially the differences between relational and OLAP.
The sad conclusion is that there is no universally great solution to the problem. To make the best choice, you need to have a deep understanding of both your data and your business users' requirements. It's always best to strive for that understanding, but in this case, it's imperative.
About the Author
You May Also Like