Successful warehouses grow — get ready for the opportunities and obstacles
So you've deployed your first dimensional data mart. Fortunately, the business loves it and is clamoring for more! They want to query and analyze integrated data from additional enterprise data sources. For some of you, this demand is a dream come true. For others, it may be the start of a slow-motion nightmare.
In this column, I'll use a case study to discuss the challenges you'll likely encounter when it's time to roll your next business process subject area or data mart into production. Sadly, there's no magic. Putting your next subject area into production is a difficult, though not intractable, problem. It's a good place to focus any fastidious tendencies you might have.
Let's start at the end-state goal. Your target enterprise information architecture consists of multiple interconnecting business process subject-area schemas (or data marts) that are dimensional in format. By dimensional, I mean there's a detailed fact table connected to multiple dimension tables. The dimension tables allow different subject areas to interconnect. You have to share dimensions between subject areas for this to work. In other words, if two schemas use a customer dimension, they have to use the same customer dimension. Previous Data Warehouse Designer columns have advocated the enterprise data warehouse bus-architecture approach, providing clear instructions for designing in an incremental and iterative fashion.
Using shared or "conformed" dimensions has several important implications. Dimension surrogate keys must be identical. How else can you join across subject areas? Additionally, you must have corporate agreement about which attributes are tracked through history and which are updated in place. How else can dimension keys be identical?
Let's imagine you launched your enterprise information infrastructure with a successful retail sales data mart. The retail sales information is used daily by store management, as well as by many corporate users in marketing and finance. The dimensional design for retail sales is outlined in Figure 1. It captures data at the correct transactional grain, its daily update process is smooth, and system downtime is well within the service-level agreement (SLA). The retail sales data mart is a thing of grace and beauty.
The success of the retail sales project has emboldened management to extend the enterprise warehouse to support additional CRM requirements, namely customers' calls into the call center. I won't detail this new schema, but it obviously includes the customer dimension. In theory, this decision is no problem: You'll reuse the date and customer dimensions from the first mart, add some new dimensions and new facts, and you're done, right? In practice, there will be a few wrinkles.
Obviously, you'll need to figure out how to modify the data extract, transform, and load (ETL) process to also populate the new tables in the call center schema. But there may be some more complex implications.
Let's look a bit more closely at the customer dimension in the original retail sales schema. First, note that the definition of customer is actually for a subset of all customers; the only customers you know anything about are those with loyalty cards and corresponding account numbers. When you add the call center information, you can expect to get information about a whole bunch of customers you've never seen before. The customer dimension ETL process is going to have to change to accommodate the new source, as I'll discuss later. Adding new customers to the customer dimension shouldn't affect the existing data mart. The new customer rows won't join to any existing retail sales fact rows because there's no way to link a "nonloyalty" caller to a sales transaction. The existing sales schema should behave the same, or almost the same, as it always has.
What if the business users want to track more information for call-center customers than was necessary or available for retail sales? As a simple example, what if the caller's phone number is important? You need to add that new attribute to the customer dimension, again modifying the customer-dimension ETL process. As was the case with the new customers, adding the phone number shouldn't disrupt the original data mart; existing reports and analyses should continue to work as before.
Finally, let's tackle a harder problem. As shown in Figure 1, you have attributes about the customer's address, starting with postal code. What if the business users of the call-center data have a compelling need to track history for the customer's address? In other words, they want to be able to know that when a customer called last year she lived in Montana, but now lives in Hawaii. Dimensional modeling can handle this requirement with what's known as a type 2 slowly changing dimension technique. The issue is that you're changing a fundamental characteristic of the customer dimension in a way that could affect the existing retail sales subject area. I'll return to this dilemma in a bit.
Not surprisingly, the first thing you need to do to deploy the call-center data is develop a plan. I hope you already have policies for managing the development cycle: checking in code to source control; using separate development, test, and production environments; having specifications and test plans; deploying SLAs for system availability; and so on. If not, now is a good opportunity to put that infrastructure in place.
The call center schema design specifies the new tables and modifications to the existing tables, such as customer. The designers should also provide a reasonable specification for the ETL process. You need to drill down on those specifications to a finer detail, discussing what the designs mean within the context of modifications to a system in production. In other words, you need to think through, at a high level, how you're going to modify the existing ETL system to incorporate the new workload. You also need to consider whether you're going to alter the existing customer table or rebuild it. In either case, you must think about any implications on the SLA.
One of the most important issues to resolve during the planning phase is how to handle the modification of the customer dimension to track history. By far the easiest approach is to start tracking history now. All existing historical data gets associated with the customer's attributes as they are today. New facts get correctly associated with future changes in customer attributes. Often it's the only feasible approach because the history of customer attributes — address in our example — is simply unavailable. Note that even with this simple approach, you might need to modify the ETL process for retail sales so that you're propagating the correct customer key into the fact table. In practice, the retail sales ETL process likely picks up the most current customer key, but you'd be foolish to exclude verification of this point from your test plan.
A more challenging approach is to re-create history for a customer. You may be able to build a version of the customer dimension that includes the history of customer address changes. This is seldom easy, but may be possible. If it's possible, it's the job of the data warehouse team leader to present management with a cost-benefit analysis of the effort. If you decide to take on this challenge, you have a second decision — whether or not to "re-create" history for the retail sales schema. The process of re-creating history in the fact table is conceptually simple because you know the sales transaction dates and the dates for which a customer's address is valid. However, performing this simple action requires updates on a significant number of fact table rows, which is an expensive operation. A second cost-benefit analysis is required. And don't forget to develop a test plan to ensure you didn't mess things up!
Finally, your plan should discuss how to handle the proliferation of these changes to aggregate tables and OLAP cubes. If these structures are affected at all, they probably need to be rebuilt from scratch. Adjust your schedule accordingly, and don't forget your test plan.
The development of the new database elements and ETL processes is typically done against a small subset of data. This approach is only sensible. But developers need to remain constantly aware that they're adding new elements to an existing production system with potentially significant data volumes. A technique such as dropping and re-creating a table might work just fine in the development environment, but could be a complete disaster in the real world.
The test environment is the place to verify that the database can be upgraded, new data sources added, and new transformations processed without adversely affecting the existing operations. The first thing you need to do in the test environment is back up the entire system: data structures, data, and ETL processes. You need to test these main things: the new database structures, the ETL processes, and the process of migrating to the new structures and processes. Each test iteration should start from a clean system that hasn't had any of the migration changes applied to it (in other words, a system just restored from backup). Although early testing can be performed against a subset of data, your final set of tests should operate against a comprehensive environment, ideally an exact copy of the production system that you'll be modifying.
It's difficult to determine how to test the migration process. People often use this time as an opportunity to migrate the data warehouse to new hardware, which they've been testing as a shadow system. If you're running your data warehouse on commodity hardware, it's not terribly difficult to justify this approach. But if you have huge data volumes in production, you probably have no choice but to invest some serious planning and testing time to "doing it right."
Much of the pain discussed in this column can be avoided if the initial design for retail sales thinks a few steps ahead to anticipate the need for address-change tracking. One of the elements of "art" in data warehouse design is to think just broadly enough without being sucked into getting all requirements from all users. Realistically, it's a characteristic of a successful data warehouse to change over time, and the data warehouse team should be prepared to accommodate those changes.
My recommendations are: Plan, plan, plan. Communicate. Develop. Test, test, test.
JOY MUNDY is a member of the Kimball Group. She focuses on dimensional data warehouse consulting and education.