The SQL of OLAP
Don't overlook the core strength of your OLAP technology solution: SQL.
When considering online analytic processing (OLAP), architects often focus on issues such as which dimensions to include, what facts are relevant, how often to refresh the data contents, and so on. Among these issues, the OLAP language is often overlooked. And of all the OLAP-centric languages, the most often ignored is SQL itself. To overlook the language of your OLAP technology solution is to ignore its real strength, or weakness, because this language dictates your applications' flexibility and complexity.
Many aspects of OLAP are already integrated with the relational database engine. This blending of technology blurs the distinction between an RDBMS and OLAP data management technology, effectively challenging the passive role often relegated to relational databases with regard to dimensional data. The more your RDBMS can address the needs of both normalized (traditional relational data and related techniques and technology) and denormalized (dimensional/OLAP-centric techniques and technology above and beyond star schemas) data, then the more you can realize the genuine value-add for OLAP-only technology: providing an environment to perform end-user interrogation of a data set instead of the heavy lifting associated with OLAP data management. Leveraging your investment in RDBMS technology, skills, and resources to fill dual roles not only makes for more efficient administration, but also minimizes the needed investment in OLAP-only technology.
OLAP Functionality
OLAP is a critical BI technology that enables analysts and executives alike to gain informational insight through fast, reliable, and interactive access to a wide variety of views. The target information is transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user and defined by the business. While OLAP systems have the ability to answer "who" and "what" questions, it's their ability to answer "what if" that sets them apart from other BI tools.
Leading RDBMS products, such as DB2 and Oracle, currently offer core, OLAP-centric SQL functions, including categories such as ranking, numbering, and grouping.
Ranking is performed with two functions: RANK and DENSE_RANK.
RANK assigns a sequential rank of a row within a window. The RANK of a row is defined as one plus the number of rows that strictly precede the row. Rows that aren't distinct within the ordering of the window are assigned equal ranks. If two or more rows aren't distinct with respect to the ordering, then the sequential rank numbering will have one or more gaps. That is, the results of RANK may have gaps in the numbers resulting from duplicate values.
DENSE_RANK also assigns a sequential rank to a row in a window. However, a row's DENSE_RANK is one plus the number of rows preceding it that are distinct with respect to the ordering. Therefore, the sequential rank numbering will have no gaps, with ties being assigned the same rank.
Numbering uniquely identifies rows in a resultant set with ROW_NUMBER. This function computes the sequential row number of the row within the window defined by an ordering clause (if one is specified), starting with 1 for the first row and continuing sequentially to the last row in the window. If an ordering clause, ORDER BY, isn't specified in the window, the row numbers are assigned to the rows in arbitrary order as returned by the subselect.
Grouping capabilities are critical to OLAP functionality. ROLLUP and CUBE are extensions of the GROUP BY clause. The functionalities of ROLLUP and CUBE are often referred to as supergroups.
A ROLLUP group is an extension to the GROUP BY clause that produces a result set that contains subtotal rows in addition to the "regular" grouped rows. Subtotal rows are superaggregate rows that contain further aggregates whose values are derived by applying the same column functions that were used to obtain the grouped rows. A ROLLUP grouping is a series of grouping-sets:
GROUP BY ROLLUP (a,b,c) is equivalent to:
GROUP BY GROUPING SETS
(
(a,b,c)
(a,b)
(a)
()
)
Notice that the n elements of the ROLLUP translate to an n+1 grouping set. Another point to remember is that the order in which the grouping expressions are specified is significant for ROLLUP.
The CUBE supergroup is the other extension to the GROUP BY clause that produces a result set that contains all the subtotal rows of a ROLLUP aggregation and, in addition, contains "cross-tabulation" rows. Cross-tabulation rows are additional superaggregate rows. They are, as the name implies, summaries across columns as if the data were represented as a spreadsheet. Like ROLLUP, a CUBE group can also be thought of as a series of grouping-sets. In the case of a CUBE, all permutations of the cubed grouping expression are computed along with the grand total. Therefore, the n elements of a CUBE translate to 2n grouping-sets.
GROUP BY CUBE (a,b,c) is equivalent to:
GROUP BY GROUPING SETS
(
(a,b,c)
(a,b)
(a,c)
(b,c)
(a)
(b)
(c)
()
)
Notice that the three elements of the CUBE translate to eight grouping sets. Unlike ROLLUP, the order of specification of elements doesn't matter for CUBE: CUBE (DayOfYear, Sales_Person) is the same as CUBE (Sales_Person, DayOfYear).
CUBE is an extension of the Rollup function. The CUBE function not only provides the column summaries we saw in rollup but also calculates the row summaries and grand totals for the various dimensions.
Aside from these functions, the ability to define a window is equally important to SQL's OLAP functionality. You use windows to define a set of rows over which a function is applied and the sequence in which it occurs. Another way to view the concept of a window is to equate it with the concept of a slice. In other words, a window is simply a slice of the overall data domain.
Moreover, when you use an OLAP function with a column function, such as AVG, SUM, or MAX, the target rows can be further refined, relative to the current row, as either a range or a number of rows preceding and following the current row. For example, within a window partitioned by month, a moving average can be calculated over the previous three-month period.
The value of having these terms available in leading product lines is that you can call upon the entire SQL vocabulary to combine in any of your OLAP-centric SQL statements.
OLAP Examples of SQL
The following example illustrates an advanced OLAP function used in combination with traditional SQL. The result is a valuable SQL statement that epitomizes the power and relevance of BI at the database engine level.
This example shows a rollup function of sales by region and city:
SELECT b.region_type_id,
a.city_id,
SUM(a.sales) AS
TOTAL_SALES
FROM fact_table a,
lookup_market b
WHERE YEAR(transdate)
=1999
AND a.city_id=b.city_id
AND b.region_type_id=6
GROUP BY ROLLUP
(b.region_type_id,
a.city_id)
ORDER BY
b.region_type_id,
a.city_id
The resultant set is reduced by explicitly querying region 6 and the year 1999. A sample result of the SQL is shown in Table 1, below. The result shows ROLLUP of two groupings (region, city) returning three totals, including region, city, and grand total.
Enterprisewide Dimensional Layer
The traditional data warehouse architecture includes an atomic layer of granular data, often normalized, that serves as the only source of data for subsequent, subject-specific data marts. Generally, the data marts are implemented as star schemas, proprietary MOLAP cubes, or both. Establishing a layer of data marts provides an excellent foundation from which to serve up consistent, multidimensional data on an enterprise scale. But when you couple the current notion of data marts with OLAP-centric SQL functions, it's important that BI architects confirm the value added from proprietary, OLAP-only technology, specifically proprietary multidimensional database servers.
This is especially true when you consider the entire scope of relational technology currently focused on multidimensional data management, including:
Database kernel support optimized to address multidimensional queries
RDBMS technology such as Materialized Query/View Tables used to improve performance
Metadata capture and management of multidimensional structures (for example, dimensions) supported in the relational environment
Expanded OLAP-centric SQL vocabulary standardized for consistent application.
Database-resident OLAP functions, coupled with these multidimensional solutions offer the possibility of a single point of truth and efficient management of enterprisewide, traditional relational and multidimensional data requirements. We don't need to completely eliminate OLAP-only technology, but certainly minimize the needed investment to only true value-add.
Vendors Taking Note
The OLAP-centric words I've illustrated are consistent in RDBMS products such as DB2, Oracle, and Microsoft SQL (2005). The significance should be obvious: OLAP has been integral to the leading database vendors and continues to influence subsequent product releases. As such, these RDBMS products are capable of serving up data for multiple purposes, including OLAP.
For BI architects, this flexibility translates into significant architectural opportunities that go beyond building and propagating proprietary OLAP structures. Instead, architects can commission their relational database of choice as the frontline of dimensional data. In that capacity, the RDBMS controls the data space and its language, SQL, quietly begins to establish itself as the underlying language of choice.
Michael L. Gonzales is the president of The Focus Group Ltd., a consulting firm specializing in data warehousing. He has written several books, including IBM Data Warehousing (Wiley, 2003). He speaks frequently at industry user conferences and conducts data warehouse courses internationally.
About the Author
You May Also Like