Kimball University: Handling Arbitrary Restatements of History
How do you cope with an executive's request to "bring back a time series of activity for all subscribers who were in platinum status as of X date," or "show me a time series of orders by sales region according to the sales organization as of Y"? Here's how data warehouse pros can cope with the common requirement to look back in time.
Most business users of the data warehouse/business intelligence system are content with looking at their information in one of two ways: by the current state of affairs, or by tracking history. For example, a sales manager developing a sales forecast wants to see sales for his or her region as that region is defined today. But that same sales manager confirming the compensation plan must correctly associate all sales with each salesperson, even if they used to report into a different region. Today's best-practice BI systems are designed for these alternative approaches.
But sometimes it's more complicated than that. Do our existing design techniques enable a business user to restate history as of an arbitrary date? As we describe in this article, the answer is "yes, but." Yes we can, but it's a challenging query for business users to construct. Luckily, a modest design change in the data model can improve the situation.
Manage Changes to Dimension Attributes
The Kimball Method advocates three main techniques for managing changes to dimension attributes:
Type 1: Restate history by updating the dimension attribute in place.
Type 2: Track history by adding a new row to the dimension table that contains the new view of the dimension member.
Type 3: Snapshot history by adding new columns to the dimension table that holds the attribute's values at a specific date, often year-end.
These techniques have stood the test of time. They meet the vast majority of business users' requirements for reporting and analysis. Dimension tables are joined to the fact table by a single column key, usually an integer, that's managed by the data warehouse ETL process. This simplicity pays us back at query time when the database engine has a relatively simple task — a task for which most database vendors have invested in tuning their engines. Dimensions with Type 1 and Type 2 attributes are smoothly incorporated into OLAP cubes as well, for high-performance querying and analysis. Sometimes, though, users need to report using dimension attributes as of an arbitrary date. For example, "Bring back a time series of activity for all subscribers who were in platinum status as of December 12," or "Give me a time series of orders by sales region according to the sales organization as of July 15."
This requirement isn't directly met by the three classic dimension attribute change types. Type 1 gives us the dimension attributes as of today. Type 3 is not sufficiently flexible; we need to modify the data model, adding columns to the dimension table to encode the attributes for the alternative history. Type 2 is our best bet; the Type 2 attributes actually contain the information we need to reconstruct the dimension as of a past date.
Looking at an Example
The diagram below illustrates a stylized star schema for subscribers to a service, perhaps a TV set-top box, with a Type 2 dimension for subscriber. Subscribers' status — bronze, silver, platinum — changes over time.
We're keeping track of those changes as illustrated by the sample data below. Our subscriber, Mari Tono, has changed her service several times. With this dimension table it's easy to report on her activity either by her current level — platinum — or by her service level at the time a transaction occurred.
Subscriber Key | Acct Number | Subscriber Name | Acct Level (Type 2) | Current Acct Level (Type 1) | Device Type | Row Start Date | Row End Date |
157 | 111Tono33 | Mari Tono | Bronze | Platinum | Eniac | 6/1/1999 | 4/17/2002 |
348 | 111Tono33 | Mari Tono | Silver | Platinum | Eniac | 4/18/2002 | 12/14/2007 |
423 | 111Tono33 | Mari Tono | Platinum | Platinum | Hal | 12/15/2007 | 12/31/9999 |
375 | 222Gatt44 | Stu Gatts | Bronze | Bronze | Hal | 1/7/2007 | 12/31/9999 |
But let's go back to our business question. Perhaps the business users want to evaluate the effects of a change to the structure of the account service levels that occurred on December 12. We want to see Mari's activity in the months before and after December 12, to see if she changed her behavior. We're not just querying for Mari; we want to analyze all subscribers who were Silver on December 12.
You may be tempted to write a simple query that filters on Subscriber.Acct_Level='Silver' and joins back to the fact table to pick up the sales transactions. But this approach will not return all the results the analyst wants; you won't pick up any of Mari's transactions that occurred while she was Bronze or Platinum. We want all transaction history, but we want it categorized by each subscriber's account level as it was on December 12.
We know we're close, though. Looking at the sample data, we can see that we want to select data for Mari; we want fact rows for subscriber keys 157, 348 and 423. We don't want any data for Stu, because he was not Silver on 12 December (for more on "Handling Time, see "The Soul of the Data Warehouse, Part 3").
Type 2 Dimensions Can Do It!
If we are wizards at writing SQL, we can construct a query to answer this question, and a thousand related analyses. To do it, we need to create a subquery that brings back all the subscriber keys and other subscriber attributes of interest. For example:
SELECT S.Subscriber_Name, F.Order_Date_Key, SQ.Device_Type, SUM(F.Sales_Amt) AS
Sales_Amt
FROM (
SELECT Acct_Number, Device_Type from Subscriber
WHERE Acct_Level='Silver'
AND '12/12/2007' BETWEEN Row_Start_Datetime and Row_End_Datetime
) SQ
INNER JOIN Subscriber S ON S.Acct_Number =SQ.Acct_Number
INNER JOIN Fact_Table F ON S.Subscriber_Key=F.Subscriber_Key
GROUP BY S.Subscriber_Name, F.Order_Date_Key, S1.Device_Type
This query returns the following rowset of the complete history of sales for Mari Tono since she became a subscriber in 1999:
Subscriber Name | Order Date Key | Device Type | Sales Amt |
Mari Tono | 19990615 | Eniac | 100 |
Mari Tono | 20010101 | Eniac | 200 |
Mari Tono | 20020430 | Eniac | 300 |
Mari Tono | 20071212 | Eniac | 400 |
Mari Tono | 20071220 | Eniac | 500 |
A real-world example would bring back more columns, but this is enough to provide the basic structure. Expand the query to join in the other dimension tables in a standard star query to get the full set of data including counts, amounts, dates, products, and other attributes. If you have fine-grained datetime stamps of the changed rows in the dimension table, you may need to modify the BETWEEN logic slightly, as we'll discuss in an upcoming article.
There are several drawbacks to this approach. The first is complexity: we don't like the subquery that joins the subscriber dimension to itself. There are few query and report definition tools that will support the development of this kind of SQL. We can't expect a business user to generate this query; it's something that a member of the BI team will have to construct.
Related to the problem of complexity is performance. We're asking the database engine to do more work than the normal and predictable pattern of the standard star query. This wouldn't be a big deal for a small or medium-sized dimension, but may be problematic if we have tens of millions of subscribers.
If this is a rare analysis, these problems are not insurmountable. As long as the BI team has good relationships with the business, they can help construct the query. A parameterized report in which the user can choose the date of analysis may meet a wide range of requirements.
An Alternative Approach
If this kind of analysis is common, we want to make it easier on both the business user and the database engine. The simplest approach is to add the operational natural key to the fact table, Acct Number in our example, as illustrated below. The inclusion of Acct Number in the fact table is the only difference between the two data models.
In this case we can remove the subqueries and perform a simple equijoin to the fact table. The "subscriber" portion of the query would not require the self-join to the subscriber table. Join the Subscriber dimension to the fact table by Acct_Number rather than by Subscriber_Key, in a query such as:
SELECT S.Subscriber_Name, F.Order_Date_Key, S.Device_Type, SUM(F.Sales_Amt) AS
Sales_Amt
FROM Subscriber S
INNER JOIN Fact_Table2 F ON (S.Acct_Number=F.Acct_Number
AND S.Acct_Level='Silver'
AND '12/12/2007' BETWEEN Row_Start_Datetime and Row_End_Datetime )
GROUP BY S.Subscriber_Name, F.Order_Date_Key, S.Device_Type
This query returns the same rowset as previously. It will probably perform better, especially for a large Subscriber dimension. It may be easier to train the business user in how to construct the query for truly ad hoc analysis. Most query and reporting tools will provide a more intuitive interface for constructing the query.
In some cases, especially if you are integrating your dimension from multiple sources, you may not use the operational natural key, as we did here with the social security number. Instead, you may have an unchanging surrogate key that you manage inside the ETL application, in addition to the Type 2 surrogate key that's the primary key for the Subscriber dimension table.
We have spoken to some consultants and DW designers who use this kind of structure for all of their dimensions. In other words, instead of using a single column integer primary key for the dimension table, they use the operational natural key plus the Row Start Date. Although this approach can address most user queries, it does so at great complexity to both the business user and the database engine, and is most definitely not recommended.
Type 2 Dimensions Win Again
Most users' queries want to see dimension attributes either as they are currently or as they are at the time the fact table transaction occurred. The classic Type 2 dimension structure meets these needs simply and efficiently. Sometimes business users want to view a history of transactions according to the dimension structure as of some arbitrary point in the past. If we knew in advance the date for which they want to restate history — often year-end — we could build out the dimension table to make that analysis simple and easy. But if it's truly ad hoc, it's nice to know that the query can be accommodated without modifying the data model at all. During your business requirements gathering process, you may learn that this kind of analysis is fairly common. If so, especially if the dimension in question is quite large, you may consider adding the natural operational key to the fact table in order to make this analysis easier.
One downside of these analyses is that they're not readily compatible with an OLAP database, which likes one and only one dimension table row to be associated with a fact table row. Both of the approaches presented here illustrate the more flexible nature of the relational model to "join and go" unexpected selections of dimension members. Even if your BI system uses only the relational engine, you should expect this sort of query to take longer than a standard one. You're asking a harder question, and it's unlikely that a precomputed aggregation will be available.
Plan to Meet Users' Requirements
For most organizations, few analyses need to restate history as of an arbitrary date. The good news is that even if you didn't capture this business requirement at design time, your dimensional data model with Type 2 attributes should support the analysis — though perhaps not as easily as your users would like. If it's a common need, it's something that should be uncovered during requirements gathering. You can plan your architecture, analysis tools and even data model to support those requirements rather than try to put bandages on a system that's already in production.
About the Author
You May Also Like
2024 InformationWeek US IT Salary Report
Aug 15, 20242024 InformationWeek US IT Salary Report
May 29, 20242022 State of ITOps and SecOps
Jun 21, 2022