Coping With Growing Pains

Your dimensional data mart is popular: Will you be ready once users clamor for more?

Plan Before You Build

Not surprisingly, the first thing you need to do to deploy the call-center data is develop a plan. I hope you already have policies for managing the development cycle: checking in code to source control; using separate development, test, and production environments; having specifications and test plans; deploying SLAs for system availability; and so on. If not, now is a good opportunity to put that infrastructure in place.

The call center schema design specifies the new tables and modifications to the existing tables, such as customer. The designers should also provide a reasonable specification for the ETL process. You need to drill down on those specifications to a finer detail, discussing what the designs mean within the context of modifications to a system in production. In other words, you need to think through, at a high level, how you're going to modify the existing ETL system to incorporate the new workload. You also need to consider whether you're going to alter the existing customer table or rebuild it. In either case, you must think about any implications on the SLA.

One of the most important issues to resolve during the planning phase is how to handle the modification of the customer dimension to track history. By far the easiest approach is to start tracking history now. All existing historical data gets associated with the customer's attributes as they are today. New facts get correctly associated with future changes in customer attributes. Often it's the only feasible approach because the history of customer attributes — address in our example — is simply unavailable. Note that even with this simple approach, you might need to modify the ETL process for retail sales so that you're propagating the correct customer key into the fact table. In practice, the retail sales ETL process likely picks up the most current customer key, but you'd be foolish to exclude verification of this point from your test plan.

A more challenging approach is to re-create history for a customer. You may be able to build a version of the customer dimension that includes the history of customer address changes. This is seldom easy, but may be possible. If it's possible, it's the job of the data warehouse team leader to present management with a cost-benefit analysis of the effort. If you decide to take on this challenge, you have a second decision — whether or not to "re-create" history for the retail sales schema. The process of re-creating history in the fact table is conceptually simple because you know the sales transaction dates and the dates for which a customer's address is valid. However, performing this simple action requires updates on a significant number of fact table rows, which is an expensive operation. A second cost-benefit analysis is required. And don't forget to develop a test plan to ensure you didn't mess things up!

Finally, your plan should discuss how to handle the proliferation of these changes to aggregate tables and OLAP cubes. If these structures are affected at all, they probably need to be rebuilt from scratch. Adjust your schedule accordingly, and don't forget your test plan.

Develop, Test, Deploy

The development of the new database elements and ETL processes is typically done against a small subset of data. This approach is only sensible. But developers need to remain constantly aware that they're adding new elements to an existing production system with potentially significant data volumes. A technique such as dropping and re-creating a table might work just fine in the development environment, but could be a complete disaster in the real world.

The test environment is the place to verify that the database can be upgraded, new data sources added, and new transformations processed without adversely affecting the existing operations. The first thing you need to do in the test environment is back up the entire system: data structures, data, and ETL processes. You need to test these main things: the new database structures, the ETL processes, and the process of migrating to the new structures and processes. Each test iteration should start from a clean system that hasn't had any of the migration changes applied to it (in other words, a system just restored from backup). Although early testing can be performed against a subset of data, your final set of tests should operate against a comprehensive environment, ideally an exact copy of the production system that you'll be modifying.

It's difficult to determine how to test the migration process. People often use this time as an opportunity to migrate the data warehouse to new hardware, which they've been testing as a shadow system. If you're running your data warehouse on commodity hardware, it's not terribly difficult to justify this approach. But if you have huge data volumes in production, you probably have no choice but to invest some serious planning and testing time to "doing it right."

Some Final Thoughts

Much of the pain discussed in this column can be avoided if the initial design for retail sales thinks a few steps ahead to anticipate the need for address-change tracking. One of the elements of "art" in data warehouse design is to think just broadly enough without being sucked into getting all requirements from all users. Realistically, it's a characteristic of a successful data warehouse to change over time, and the data warehouse team should be prepared to accommodate those changes.

My recommendations are: Plan, plan, plan. Communicate. Develop. Test, test, test.

JOY MUNDY is a member of the Kimball Group. She focuses on dimensional data warehouse consulting and education.