Kimball University: Microsoft SQL Server Comes of Age for Data Warehousing
With new compression, partitioning and star schema optimization features, Microsoft's SQL Server 2008 is catching up with the state of the industry in data warehousing. Here's why these three capabilities are crucial for scalability and performance on any platform.
Warren Thornthwaite |
Microsoft's upcoming SQL Server 2008 release includes several new features and enhancements that are extremely valuable in data warehousing and business intelligence systems — as those already leveraging these capabilities on other database platforms will attest. The key performance enhancements include database compression, partitioning, and star schema optimization. In this article I briefly describe the benefits of these three features in any DW/BI deployment.
The Kimball Group offers courses and a Toolkit book that shows how to apply our data warehouse design principles and techniques specifically on the Microsoft platform, but it should be noted that we have always been and we remain vendor independent. Whether you're using the Microsoft SQL Server or another platform, Kimball Group encourages you to explore the potential of these three broadly available features in your environment.
Speed the Queries With Database Compression
Database compression is exactly what it sounds like: compressing the data before it is stored on disk. While this sounds mundane as we've been using zip utilities to reduce file size for decades, moving compression into the database can make a significant difference in storage and performance.
The storage advantages of compression are obvious: smaller data size means you need less disk space. In SQL Server 2008, you should expect about a 50 percent reduction in storage requirements thanks to compression. This advantage passes through to the backup process, which should complete much faster with less data to backup. It's also possible to compress the backup without compressing the data in the database.
On the performance side, the DW/BI system is often constrained by disks, which are the slowest-performing component in the environment. By compressing data you reduce the amount of disk space used and therefore the number of disk reads required, in some cases by 50 percent or more. The tradeoff is that more CPU cycles are needed to decompress the data before it is delivered to the application, but as CPU speeds have increased, the penalty of this tradeoff has been reduced. As a result, compression can cut query time significantly for certain kinds of queries. Since compression happens at the data-definition level, database components, like the storage manager, can handle the compression process without changes in your BI applications.
There are many different types of compression, with the general rule being the greater the compression, the more CPU time required. Compression techniques can be applied within a row, across all rows on a page, or at the table, file, or database level. One form of row-level compression involves applying the variable length concept we already have in the VARCHAR data type to all fields where it might make a difference. For example, we might define a column like OrderDollars as Decimal(20,5) which would take 13 bytes in SQL Server 2005. The scale and precision of this might be necessary to support the few very large orders you get every year, but it leaves a lot of unused space in every row for most records in the fact table.
There are several forms of compression that can be applied at the page level. One technique used in SQL Server 2008 is called dictionary compression. This involves looking for repeated values anywhere on the page and storing them in a special compression information structure kept on each page after the page header. For example, a customer dimension may have a region attribute that only has ten possible values. Storing a distinct list of these values in the compression information structure and only keeping small pointers to the structure at the row level can reduce the overall size of the data dramatically. Note that this technique is conceptually the same as normalizing the data into a Region table at the page level behind the scenes. You get the benefits of the dimensional model and reduced storage requirements all in one!
The biggest benefit of compression is realized on queries that require reading many pages from disk. Queries that do distinct counts of top N customers or that require full table scans, such as creating data mining sets, are good examples. To give you a sense of the potential impact of compression, I did a simple test with a customer dimension with more than 3 million rows and 44 columns. The table below compares data size and query results before and after compression.
Impact of Compression on Data Size and Query Performance
Base Table | Page Compressed | Percent Change | |
---|---|---|---|
Data size | 1,115,408 KB | 465,352 KB | 58.30% |
Total rows | 3,068,505 | 3,068,505 | -- |
Query times (min:sec): | |||
Select Count(*) | 04:35.9 | 00:43.4 | 84.30% |
Select CustType, AVG(CustValueScore) | 04:26.3 | 00:05.1 | 98.10% |
Select Distinct Region | 04:20.1 | 00:04.9 | 98.10% |
The table also shows how compression leverages another DW/BI system performance principle: more memory is better. In this case, SQL Server 2008 keeps the compressed structure after it is read from disk. As a result, the compressed table now fits completely in memory, so the second and third queries were dramatically faster than the first. This advantage of having more available memory is a major driver of adoption of 64-bit machines.
Note that the queries used for this table were designed to demonstrate the best case. Overall, you should expect more modest performance gains of 15 percent to 20 percent, as long as you have available CPU cycles. If your system is already CPU bound, compression will worsen performance.
The CPU cost of compression comes at both ends of the data management process. The CPU must do more work to compress the data before it is written to disk. This raises the question: how does compression impact data load times? The answer depends on the type of compression used, but I tried a simple insert of 30,205 new rows into compressed and uncompressed versions of the customer dimension and found that the compressed insert required about 25 percent more CPU time. Compressed index builds also require more CPU time, but the resulting indexes are faster to traverse.
If the time it takes to load and index data in your ETL process is already substantial and your server is CPU bound, this increase could hurt. If the load and index times are relatively small, then the increase is bearable. Before you commit to compression, test its impact on your disk usage and backups, as well as your query and load times.
Divide and Conquer With Table Partitioning
Table partitioning involves dividing a large table into multiple sub-tables that can be managed independently while retaining, from a query perspective, the appearance of a single table. Table partitioning isn't a new concept; what is new in recent years is that it's available on most relational database platforms (although it often requires an "enterprise" license). Partitioning enables you to handle much larger tables by allowing you to segment the workload and distribute its processing. Implementing table partitioning requires additional management work in the ETL process, but should not require any changes in your BI applications.
In data warehousing you typically partition your fact tables to support loading current data into a single partition that is set up (perhaps as an empty partition or without indexes) to support fast-as-possible loading. Usually this means partitioning at some level in the transaction date hierarchy, such as month, week, or even day.
While you're at it, you can set up a separate load partition in a parallel structure so users can query yesterday's fact table while you load into this "shadow" partition. When the load is finished, simply swap the shadow partition into the main table with an ALTER TABLE command. This lets you provide 24-hour by seven-day availability by duplicating only a small portion of the fact table.
The shadow table concept also makes it easier to prune old data from a table. For example, if you have a table with the last five years of data partitioned by month, you can swap out the oldest month with an empty shadow partition. You can then archive or drop this separate partition as needed.
Partitioning will impact your indexing strategies. In general, DW/BI systems work best with indexes limited to the partition level. An index that spans partitions will require significantly more maintenance time when inserting new rows or dropping historical rows.
If you are having data load or management problems with large tables, you need to investigate table partitioning.
Go Dimensional With Star Schema Optimization
The basic structure of the dimensional model is different from a normalized transaction model. The database optimizer employs a detailed set of statistics about row counts, indexes and cardinality to determine the most efficient starting point for a query. It then follows the join paths, creating temporary intermediate results sets that ultimately lead to the desired results set. If the database optimizer applies a traditional strategy to a dimensional model, it usually picks a small or tightly constrained dimension as the starting point, but the second table, the fact table, is always the largest one in the query. This can lead to highly inefficient queries with large temporary results sets.
Star schema optimization, also known as star join optimization, takes advantage of the unique symmetry of the dimensional model: one fact table joining to many dimension tables. Leveraging this symmetry, the optimizer generally identifies the set of primary keys from each dimension involved in a given query, then, as a last step, queries the fact table with this key list.
Different database products approach this challenge in different ways, but all the major players leverage the dimensional model in generating a better query strategy. Microsoft SQL Server 2008 has improved its star schema optimization by building a hash table and bitmap filter for each participating dimension (search "Bloom filter" on the Internet for more details on this bitmap filter). These bitmap filters are applied to the fact table, filtering out almost all of the rows that don't apply to the query. The remaining rows are then joined to the dimension hash tables.
Early usage of SQL Server 2008 indicates it will generate star schema performance improvements of 15 percent to 25 percent across the entire DW/BI relational query load. Oracle uses bitmapped indexes on each of the foreign keys in the fact table as the core of its star join strategy. Make sure you have all the prerequisites and conditions in place to allow your database to generate the best query plans for your DW/BI system.
Up-and-Coming Opportunities
If you have already pushed your relational database to its limit, there are other technologies you might consider. Column-oriented databases are not new to the market, with commercial products available since at least 1995; however, they are making inroads in the DW/BI space. True to their name, these databases store data in a column-oriented rather than row-oriented form. This naturally supports compression techniques and fits well with the selective nature of analytic queries. The column-oriented approach is not well suited to transactions, but that's not the problem we're trying to solve.
Compression, partitioning and star schema optimization are some of the core features you can use to improve your DW/BI system's query performance. If you are not already using them, we encourage you to explore these capabilities on your platform. If you are using them and you still need better performance, you may need to explore alternative database platforms or bigger, more balanced hardware systems.
For more information about using the Microsoft SQL Server platform for your data warehouse/business intelligence system, attend the Microsoft Data Warehouse In-Depth class or read The Microsoft Data Warehouse Toolkit (Wiley 2006), by Joy Mundy and Warren Thornthwaite of the Kimball Group.
About the Author
You May Also Like