informa
/
News

Kimball University: Handling Arbitrary Restatements of History

How do you cope with an executive's request to "bring back a time series of activity for all subscribers who were in platinum status as of X date," or "show me a time series of orders by sales region according to the sales organization as of Y"? Here's how data warehouse pros can cope with the common requirement to look back in time.
An Alternative Approach

If this kind of analysis is common, we want to make it easier on both the business user and the database engine. The simplest approach is to add the operational natural key to the fact table, Acct Number in our example, as illustrated below. The inclusion of Acct Number in the fact table is the only difference between the two data models.

Modified Star Schema

In this case we can remove the subqueries and perform a simple equijoin to the fact table. The "subscriber" portion of the query would not require the self-join to the subscriber table. Join the Subscriber dimension to the fact table by Acct_Number rather than by Subscriber_Key, in a query such as:

SELECT S.Subscriber_Name, F.Order_Date_Key, S.Device_Type, SUM(F.Sales_Amt) AS
Sales_Amt
FROM Subscriber S
INNER JOIN Fact_Table2 F ON (S.Acct_Number=F.Acct_Number
     AND S.Acct_Level='Silver'
     AND '12/12/2007' BETWEEN Row_Start_Datetime and Row_End_Datetime )
GROUP BY S.Subscriber_Name, F.Order_Date_Key, S.Device_Type

This query returns the same rowset as previously. It will probably perform better, especially for a large Subscriber dimension. It may be easier to train the business user in how to construct the query for truly ad hoc analysis. Most query and reporting tools will provide a more intuitive interface for constructing the query.

In some cases, especially if you are integrating your dimension from multiple sources, you may not use the operational natural key, as we did here with the social security number. Instead, you may have an unchanging surrogate key that you manage inside the ETL application, in addition to the Type 2 surrogate key that's the primary key for the Subscriber dimension table.

We have spoken to some consultants and DW designers who use this kind of structure for all of their dimensions. In other words, instead of using a single column integer primary key for the dimension table, they use the operational natural key plus the Row Start Date. Although this approach can address most user queries, it does so at great complexity to both the business user and the database engine, and is most definitely not recommended.

Type 2 Dimensions Win Again

Most users' queries want to see dimension attributes either as they are currently or as they are at the time the fact table transaction occurred. The classic Type 2 dimension structure meets these needs simply and efficiently. Sometimes business users want to view a history of transactions according to the dimension structure as of some arbitrary point in the past. If we knew in advance the date for which they want to restate history — often year-end — we could build out the dimension table to make that analysis simple and easy. But if it's truly ad hoc, it's nice to know that the query can be accommodated without modifying the data model at all. During your business requirements gathering process, you may learn that this kind of analysis is fairly common. If so, especially if the dimension in question is quite large, you may consider adding the natural operational key to the fact table in order to make this analysis easier.

One downside of these analyses is that they're not readily compatible with an OLAP database, which likes one and only one dimension table row to be associated with a fact table row. Both of the approaches presented here illustrate the more flexible nature of the relational model to "join and go" unexpected selections of dimension members. Even if your BI system uses only the relational engine, you should expect this sort of query to take longer than a standard one. You're asking a harder question, and it's unlikely that a precomputed aggregation will be available.

Plan to Meet Users' Requirements

For most organizations, few analyses need to restate history as of an arbitrary date. The good news is that even if you didn't capture this business requirement at design time, your dimensional data model with Type 2 attributes should support the analysis — though perhaps not as easily as your users would like. If it's a common need, it's something that should be uncovered during requirements gathering. You can plan your architecture, analysis tools and even data model to support those requirements rather than try to put bandages on a system that's already in production.