Hubs, Spokes and Buses: How to Get to a Better Data Warehouse
Are you frustrated by the inefficiency, rigidity and latency of a first-generation, hub-and-spoke-architecture data warehouse? Here's a six-step guide to evolving to a streamlined, robust Kimball Dimensional Bus Architecture that will reduce time to reporting, lower data latency, and deliver more detailed, analytically useful information.
Now that data warehousing has become ubiquitous in the corporate world, "clean slate" designs of entirely new decision support systems are becoming somewhat rare. Instead, designs and roadmaps are more often intended to improve, modernize or otherwise evolve existing production data warehouses that may be showing signs of age.
Hub-and-Spoke Architecture
(click image for larger view)
Many first-generation data warehouses have a hub-and-spoke architecture, an approach that presents aggregated and departmentally-aligned information to end users in dimensional form. These "departmental data marts" can be either centralized or distributed, but a foundational tenet of this approach is that they must be sourced from a centralized and normalized repository — often referred to as a Third Normal Form Data Warehouse (3NF DW), as depicted in the diagram at right — that contains the most detailed information available. In this architecture, Sales, Inventory and Payment information might appear in several departmental data marts, aggregated as needed by each department and reflecting each department's preferences for metric calculations or dimensional attribution.
Kimball Bus Architecture
(click image for larger view)
The Kimball Dimensional Bus Architecture, in contrast, makes the most detailed data directly available to end users (also in dimensional form) but in a business-process-aligned (rather than departmentally aligned) manner. Thus, Sales, Inventory and Payment information is stored only once — each in a single Business Process Dimensional Model, as pictured at right — rather than appearing in various incarnations within several hub-and-spoke departmental data marts. Dimensions are conformed in an organization-wide manner, tied to atomically-grained and single-business-process facts, allowing users to creatively sum the data any which way, using any combination of the attributes from the (conformed) dimensions. Special departmental needs are handled by adding department-specific attributes to dimensions, or by creating department-specific measures in the facts, which retains the benefits of a unified cross-department perspective.
Notice that the hub-and-spoke architecture's normalized data warehouse (3NF DW) and its associated ETL are completely eliminated, simplifying and streamlining the solution. (A more thorough discussion of these two data warehousing approaches can be found in the article "Differences of Opinion," by Margy Ross and Ralph Kimball.) WHY TAKE THE BUS?
Some organizations find that the inherent inefficiencies of the hub-and-spoke approach become problematic as the data warehouse and its community mature, pressuring data warehouse management and administration staff, and compromising warehouse robustness. Some of the more common hub-and-spoke drawbacks include:
Multiple Versions of the Truth. The departmental data marts of the hub-and-spoke may present separate and differing versions of facts and metrics needed by multiple departments. There is no explicit architectural support for a user-accessible, cross-departmental view; this can become a source of confusion and frustration to executives viewing inconsistent reports from different departmental data marts.
Reduced Dimensional History. It's difficult to design a 3NF DW to simultaneously (A) support capturing point-in-time images of dimensions (Customer, Product, etc.) that change over time, and (B) exhibit acceptable query performance against facts. Thus, hub-and-spoke data warehouses sometimes eschew capturing dimensional history altogether, presenting historical facts only through the lens of today's dimensional images. Again, businesses sometimes outgrow this limitation.
Brittle Data Marts. The aggregates within the hub-and-spoke departmental data marts are, by definition, based on assumptions of business user information requirements. If and when these requirements change or evolve, the aggregates become obsolete, forcing structural changes to both the department data mart and the ETL that feeds it.
High Latency. The hub-and-spoke approach requires additional layers of storage (3NF DW) and data manipulation (two layers of ETL) before information becomes available to end-users. This extra processing consumes resources and time, and it slows publication.
Slow Adaptation to New or Improved Data Sources. The additional storage and ETL functionality needed by the hub-and-spoke approach may lengthen the data warehouse team's development time when adapting to new or improved data sources.
THE CONVERSION PLEDGE
For the reasons detailed above and others, many organizations are evolving their hub-and-spoke data warehouses into dimensional bus architectures. The section that follows presents a roadmap for evolving your warehouse along with suggestions for a smooth transition. But before you embark on the actual conversion, keep these five goals in mind:
Do No Harm. Avoid undue disruption to the existing hub-and-spoke data warehouse. The transition should be as transparent as possible to business users.
Leverage Existing Features. Use existing features of the hub-and-spoke architecture that are common to the dimensional bus architecture.
Deliver Business Value Quickly. Deploy a new business process dimensional model as the first conversion deliverable.
Retrofit As Needed. Driven by business need, retrofit existing business process areas over time.
Stay on Message. Communicate consistently that this effort is a logical extension of the first-generation hub-and-spoke data warehouse — in response to new and more profound business intelligence needs — not a clean-slate reengineering effort nor a correction of prior design shortcomings.
THE CONVERSION ROADMAP
The recipe for the conversion is actually quite simple: select a new business process area that presents a new data source, extend the hub-and-spoke's dimension conformance components as needed, and add high-speed ETL for the loading of atomic facts into a new Kimball Business Process Dimensional Model, as shown in the diagram at right.
Extending Hub-and-Spoke Architecture
(click image for larger view)
Step 1: Identify a new business process area and data source. The goal is to quickly address some area of urgent business pain by rapidly building a new Kimball Business Process Dimensional Model. Choose wisely here, selecting a business process that requires a single, well-defined source of data that will become a new fact that can leverage existing dimensions and that will be a highly visible win for the data warehouse team. Once you've identified a suitable candidate process, propose it to the business as a proof of concept (POC) for a second-generation modernization of the data warehouse.
Be explicit about the business benefits that the POC is expected to deliver, including reduced time to reporting against new business processes, reduced data latency, more-detailed and analytically useful information for end users, and improved flexibility for changing business requirements. Try to collect stats on the existing hub-and-spoke data warehouse's baseline time to new data sources or departmental data marts, refresh latencies and system performance. The goal is to meet or improve all three measures in the new Kimball Business Process Dimensional Model.
Step 2: Reuse/extend hub-and-spoke ETL. Dimensional data warehousing demands distinctly different and specialized ETL processing of facts and dimensions. Try to reuse and extend the dimension conformance pieces of the hub-and-spoke ETL layers to ensure that they can publish conformed dimensions to the new Business Process Dimensional Model. This may require new functionality for creating and maintaining surrogate keys, deduplication, generating and publishing source-system key cross-references, and enforcing slowly changing dimension policies. (The article "Four Fixes Refurbish Legacy Data Warehouses," by Margy Ross, describes challenges and solutions to modernizing and improving dimension conformance systems.)
You will likely also need to extend the hub-and-spoke ETL to support high-speed fact processing — performing in-memory surrogate key lookups, metrics calculations (if needed) and using bulk data loading and aggregate refresh techniques — because the Kimball Business Process Dimensional Model will be capturing atomically-grained (rather than aggregate) facts. (See Bob Becker's "Subsystems of ETL Revisited" for more detail about the functionality needed in the second-generation Dimensional Bus ETL, and consider referencing the "The Data Warehouse ETL Toolkit" book for in-depth solutions to these challenges.)
Step 3: Reuse/Extend Technical Architecture. The conversion team is going to need a repository to house the new Business Process Dimensional Model. It's best to create entirely new database instances (on separate platforms from the hub-and-spoke system if possible) to support development, test, and production ETL and database server functions without impacting the existing production hub-and-spoke platforms.
You also must ensure that dimensionally friendly features of the DBMS, such as aggregate navigation/query rewrite and special Star schema optimizations, are enabled, tested and fully functional. The goal is to have the new Business Process Dimensional Model perform as well or better than the legacy hub-and-spoke departmental data marts without exposing aggregates to reports or analytic applications.
If you happen to be working in an Oracle environment, Bert Scalzo's "Oracle DBA Guide to Data Warehousing and Star Schemas" offers a great advice on how to get these Oracle DBMS features working reliably.
Step 4: Publish new Business Process Dimensional Model and reports. Once the new ETL, Business Process Dimensional Model, and reports are tested and tuned, migrate these to the new production ETL, database, and report servers. Validate, once again, that data quality and accuracy, completeness, and performance are all exemplary and reliable in the new production environments.
Step 5: Promote the success. Business advocates should announce and celebrate the victory. Statistics on the time and cost savings and performance improvements can be shared as well — presented as a logical next step in the evolution of the first-generation data warehouse to deliver more sophisticated analytics and improved responsiveness to change.
Step 6: Repeat steps 1-5. Partner with the business to extend the Kimball Dimensional Bus to new business process areas or to retrofit older business process areas, based on business needs and priorities. When retrofitting business processes, try to preserve key features of existing hub-and-spoke ETL such as the source definition meta data, changed data capture (CDC) logic and so on. When enough business process areas have been converted to fully support a department's reporting needs, the hub-and-spoke departmental data mart can be retired and replaced with cross-business area reports that exploit powerful drill-everywhere features of the Kimball Dimensional Bus.
FINAL THOUGHTS
By applying these techniques, we can improve the flexibility, legibility, usability, robustness, and performance of first-generation data warehouses while leveraging features common to both architectures and minimizing disruption to the end-user community. As the departmental data marts of the hub-and-spoke are gradually retired, unused pieces of the 3NF DW and it's associated ETL layer can be retired as well. In this manner, over time, the hub-and-spoke can be fully or partially transformed into the Kimball Bus Architecture (as shown in the second diagram in this article) in a win/win for both end users and the data warehouse support team.
Jim Stagnitto is an information architect specializing in data warehousing and master data management and the director of the Information Services practice for IT consultancy A2C. He's a contributing author of The Data Warehouse ETL Toolkit . Write him at [email protected].
About the Author
You May Also Like