It has become fashionable to regard the final deployment step of building an extract-transform-load (ETL) system as a mere tactical choice between delivering dimensional relational tables or OLAP cubes into the end-user environment. But is this choice quite so superficial? Should we defer thinking about this choice until just before roll-out? In this article we take a hard look at this final deployment conundrum and urge you to resolve this question very early in the design process Business intelligence (BI) developers have largely accepted the premise that data is most user-friendly when delivered in a dimensional format. If you define the data warehouse as the platform for all forms of BI, then the last step of the ETL process in the data warehouse is exposing the data in a dimensional format. Many BI developers have recognized that a properly designed set of dimensional relational tables can be transformed into an OLAP cube in a virtual one-to-one mapping. For various reasons explained below, I recommend that all OLAP cubes be built directly from dimensional models. The dimension tables in such a relational schema become the OLAP cube dimensions, often referred to as the edges of the cube. The fact tables from the relational schema provide the content of the OLAP cube cells.
While there can be some small semantic differences between the relational dimensions and the OLAP cube dimensions, the profound overlap between these two approaches has made it tempting to regard the final deployment choice as a tactical maneuver executed at the very end of the data warehouse development. Worse yet, the argument is sometimes made that BI applications can be switched between relational and OLAP implementations because of this similarity.
Not so fast! Under the right circumstances, the ETL pipeline can be mostly insulated from the final deployment choice, but the relational vs. OLAP choice is multi-faceted decision with lots of issues to consider. Let’s look at the advantages and disadvantages of both choices before we jump on either bandwagon.
- Dimensional Relational Advantages
- Relational database structures are largely vendor independent and especially dimensional structures are pretty easily ported. However, ETL command line scripts and proprietary code like PL/SQL may not be very portable.
- All the main relational DBMS’s have high-volume bulk data loaders, and they're especially effective if you turn off transaction logging.
- A wide variety of SQL-generating tools from many vendors are able to directly access the data. Usually these tools can be repurposed to point at a new relational DBMS.
- SQL expertise is widely available in the marketplace, since the main features of SQL have long been standardized, and SQL is routinely taught at the college level.
- Hand-coded SQL is generally readable, although the SQL emitted by high-end BI tools is overwhelming and cannot reasonably be changed by the BI applications development staff.
- There are many different ways to control DBMS performance, including schema designs, indexes, aggregates and materialized views. Dimensional relational structures, because of their predictable characteristics, have well-understood performance tuning techniques, although these techniques vary somewhat by vendor.
- Relational databases are extremely stable and are suitable for serious archiving and backup.
- Relational database structures are not vulnerable to catastrophic invalidation, unlike OLAP cubes that may involuntarily rebuild themselves if a SCD Type 1 change is made to a dimension.
- Database sizes are pretty much unlimited, and single dimensional fact tables in the multi-terabyte range are increasingly common.
- High-end relational databases can join many large tables.
- Hybrid (mixed workload) applications involving both querying and updating are usually easy to construct.
- SQL is a truly horrible language for powerful analysis and complex applications.
- SQL is severely asymmetric: you can perform complex constraints and calculations generally only WITHIN records, not across records.
- In spite of performance tuning capabilities, it is still too easy to lose control of performance.
- Generally much better performance than relational when the cubes are designed correctly, with less need to play complex performance tuning games compared to relational.
- Much more powerful analytic capabilities than relational. MDX, for example, has more powerful semantics for traversing ragged unbalanced hierarchies, such as organization charts, than does SQL.
- Vendor-supplied OLAP tools for reporting and querying have historically been superior to relational tools, although I'm impressed by the steady investment relational tool vendors have made in improving ease of use and feature sets.
- OLAP doesn't suffer from the symmetry problem that limits SQL.
- The decision to load OLAP directly from dimensional relational tables doesn’t affect the ETL backroom very much: the deployment step occurs at the very end.
- Certain loading scenarios can be very fast.
- Certain vertical industries, especially financial services, have developed awesome OLAP solutions.
- OLAP encourages more complex security scenarios, especially for ad hoc access. By comparison, it’s difficult to set up a relational database to protect detailed data (sales by sales rep), but provide more open access to summarized data (sales by region). This is especially true for ad hoc access on the relational side. Security is significantly more powerful on OLAP because of the semantics about parents and children inherent in the access languages.
Comment |Print |More InsightsWebcastsMore WebcastsWhite PapersMore White Papers