InformationWeek: The Business Value of Technology

InformationWeek: The Business Value of Technology
InformationWeek - Our New iPad App



InformationWeek Labs
Home
News
NewsFlash
News In Review
Financials
IW Community
AuthorITies
Shop Talk
Careers
Secret CIO
Columnists
Feedback
Business
Resource Center
Labs
Date Book

Services
Contact IW
IW Daily
Subscriptions
Media Kit/Ed Cal.
IW Marketplace
IW International
Site Map
June 30, 1997
Who's Minding The Cache?

The right storage selection is critical for high performance of your data warehouse


By Neil Raden and Michael Peterson

T he most rapidly improving and most-often overlooked area for increasing the performance of your data warehouse is storage. Storage arrays no longer consist of just disk and tape drives; they've evolved into complete, intelligent input/output channels and subsystems-every bit as sophisticated, complex, and expensive as the other components in the data warehouse.

But although data warehousing routinely sets and breaks the high-water mark for large databases, discussion of the relative strengths and weaknesses of storage options for data warehouses has been strangely absent in the industry. From a practitioner's perspective, choosing the right storage method for a data warehouse is a real enigma.

We see storage as key to achieving performance. There are many issues in selecting the right architecture for data warehousing. Do write-back caches or read-ahead caches improve or degrade data warehouse performance? Are higher-capacity drives a benefit, or is the decrease in the number of separate devices a drawback? Is it necessary to have RAID (redundant arrays of inexpensive di sks)? If so, at what levels? Can a data warehouse be backed up? Is the movement to centralized, multihost, shared storage, which lets a single array service mainframes, midrange servers, and PC servers, a boon or a bust for warehousing? And when it comes to cost, should capacity be added well in advance of the need or just-in-time?

If there were only a limited number of options, or if the technology were relatively static, selecting storage for the data warehouse would be uncomplicated-but none of those conditions applies.

The prices of storage options can span a surprisingly wide range, and are both dropping and climbing at a dizzying rate. Current prices range from $2 per megabyte at the high end to 20 cents per megabyte at the low end-and will probably drop 50% in the next year or two, especially at the low end.

At the high end, prices of many specialized disk-array storage systems and complex multiprocessor computers, with specialized storage hardware and software, are climbing as features ar e piled on. In many cases, storage is the largest single line item in the data warehousing budget. The wrong choice can damage or even doom the entire data warehousing project.

Despite their alluring potential benefits, high-end disk arrays are very costly compared with single disk drives in an enclosure. Yet much more than just out-of-pocket cash is involved. In information economics, you don't pay for the level of functionality over time, but rather the rate of change of the functionality. With a high-end disk array, you pay for the ability to grow and scale the solution without redesigning it every time you need to add incremental performance or capacity. Long-term success in the data warehouse arena is achieved through high-performance, scalable, intelligent, and modular storage solutions, not merely "bit buckets."

Is Storage Important?
Storage not only holds the data, it is also a key component of the data warehouse architecture, enabling enhanced performance, uptime, and access to in formation. Here's why:

  • Data warehouses are increasingly becoming vital to business. Downtime can cause you to lose revenue, productivity, and even customers. The most frequent causes of data warehouse outages are related to storage: device failure, controller failures, and lengthy load times and backups. Decreasing these problems directly improves the numerator of your return on investment calculation.
  • The ability to run file transfers and back up data while the system is running maximizes activities that generate and enhance revenue.
  • Centralizing physical data management, even in a logically decentralized environment, makes more efficient use of networks, people, and hardware.
  • Unbundling storage management from other processing can relieve pressure on servers or mainframes that are upgraded only in large, expensive increments, thereby extending the useful life of the devices and further reducing downtime and disruptions.
  • Being able to efficiently archive detailed historical da ta without redundancy, error, and data loss increases the organization's ability to leverage information assets.

Database performance, especially in data warehousing, is poorly understood. The only working benchmark is the Transaction Processing Performance Council's TPC-D test, and there is plenty of controversy about whether that test is representative of most types of data warehouse applications. Furthermore, few vendors offer storage solutions optimized for data warehousing. Instead, the market is full of general-purpose disk arrays. The tuning characteristics of warehouses, taken together with the connected components of the data warehouse (servers, databases, networks, clients, and middle tiers), present a complex problem without broadly known rules of thumb.

Where do we start? The first step is to examine the peculiar requirements of data warehousing and to develop a list of critical considerations. Next, we need to frame the context of performance-issues of availability, data protection, and cost run in parallel to performance.

Warehouse Requirements
Data warehouses can exhibit performance and usage characteristics across a wide spectrum. Some are read-only and serve only as a staging area for further distributing data to other applications, such as data marts. Others are read-only but are designed for fast online query processing, often very complex in nature. Though purists will deny this is the case, some data warehouses contain elements of online transactional processing systems-for example, planning and forecasting systems on which scenarios are created interactively and saved, or catalog retailing systems that capture customer profiles.

Whether or not you label these systems data warehouses, they exhibit performance characteristics different from pure OLTP systems and demand designs optimized to their purpose. In some cases, all of these elements are present. The most common situation is a single database that's designed to support canned queries, which allows for a certain amount of tuning, but is combined with open, ad hoc access by a smaller number of data surfers.

What we can generalize about data warehouses is that they are large, that they typically deal with data in very large chunks, that usage patterns are only partially predictable, and that everyone involved with them is greatly concerned about performance. Beyond these ambiguous characteristics, two features of data warehouses define operational performance: query performance and load performance. Two other elements that influence the purchase decision are cost and system cache.

Query Performance
Data warehouse query activity follows certain patterns. Obviously, activity is dependent on the concentration of users within time zones: if there is a large concentration in a few time zones, there will be peaks at certain times of the day. Also, usage is greatest immediately following an update. Beyond that, certain time cyclicities are generated by other factors: financial databases have their heaviest use just before or after the close cycle, or both. Promotional planning is an annual process, and usage rates are highest in the planning stages. New product launches can also ratchet up usage a few notches.

Data warehouse queries are complex. Instead of searching for a single record, these queries can range over the entire database, often requiring multiple passes and assembling temporary sets that are joined or merged for comparison. Derived values are calculated for thousands or even millions of records per query. The demands placed on the database server stress the CPU, memory, bus, and I/O system. If one piece is overworked, the rest perform suboptimally.

Understanding the performance requirements dictated by the unique usage patterns and complexity of data warehousing is critical in order to select and tune the I/O system.

First, let's review what we mean by the I/O system. The I/O system consists of many elements, including the bus, memory channel, drivers, adapters, interfaces, co ntrollers, and disk subsystems. When talking about storage, it's incorrect to assume that the disk subsystem is all-inclusive. The architectures of high-performance disk arrays include all these components outboard of the server bus. If there is adequate CPU power and memory available to process queries, the disk array and the I/O channel bandwidth hold the keys to improved performance.

Data warehouses make extensive use of temporary disk space to perform sorts and aggregations during parallel queries, particularly when multiple queries are executing concurrently. This workspace is evenly split between read and write activity, and performance in this area is critical. The good news is that the disk space allocated for this does not need to be protected; if a device fails, that space can be dropped from the database and re-created. RAID 1 and 5 devices for this space are poor choices for two reasons: high cost and the overhead associated with the extra writes needed to protect data from loss.

Sort and aggregation performance is critical and a good fit for adaptive RAID and volume-management techniques. The idea is to automatically allocate temporary space to a fast-write configuration such as RAID 0 (no redundancy is required because this is temporary data), and protecting the actual data by redundancy (RAID 1 or 5).

Load Performance
Data warehouses run all the time. Even if the people who use them are asleep, their queries are busy. This isn't the way it was supposed to be. Data warehouses were supposed to be static during the day (meaning that no updating was going on), and refreshed at night, when business was closed.

In general, this is no longer true. Decision-support systems based on data warehouses are dynamic tools whose volume sizes have increased dramatically. Two years ago, a large data warehouse housed about 10 to 20 Gbytes of raw data. Today, large projects involve a few hundred Gbytes (multiterabyte data warehouses are far less common than most people think).

But the a mount of raw data is only the beginning. Expect the total size of the database to grow four to eight times the size of the raw data. Add to that the working space for complex queries, development and test environments, staging areas for data in transit, and much more preprocessing than there was even a year ago to accommodate the large load volumes, and it becomes obvious that even 100 Gbytes of data can require 10 times that much storage-without redundancy.

In many cases, updates can involve tens or even hundreds of gigabytes in a single load. Massive replacement of database records is usually an inefficient process, so partial or even full reloads are often performed. These amounts of data moving from one machine to another across a network can soak up all of the available bandwidth, causing other, concurrent processes to slow. Load performance is crucial in these large environments. The new enterprise storage systems with their dedicated storage area networks offer some relief to this problem.

C ost Considerations
Costs per unit of storage are at an all-time low, and market forces are driving the prices down at an increasing rate. It is not uncommon to find prices around 25 cents per megabyte, though high-end storage subsystems such as those made by EMC Corp. in Hopkinton, Mass., and Clariion, a unit of Data General Corp. in Southboro, Mass., are selling for about $1 to $2 per megabyte.

Two factors are forcing down the price of storage even further. First, continuing advances in storage technology are increasing the density and speed of devices without increases in cost. More important, however, is the second effect of economies of scale: the increased number and size of databases is driving up the demand for storage, leading to an overall increase in manufacturing capacity and lowered unit costs. Based on these factors, many industry experts are predicting that costs will drop to 5 cents per megabyte by the year 2000. That represents a decrease of roughly 50% each year.

It's clear that having excess disk capacity now is a very expensive resource, especially when viewed over time rather than at current prices. As a data warehouse matures, the data and the number of uses rapidly increase. It follows, therefore, that the incremental disk space required in two years will be much greater than the total amount needed now.

A fully mirrored database that grows from 40 Gbytes of raw data to 200 Gbytes in three years-while sustaining the storage capacity and throughput requirements-will require an I/O subsystem with roughly 700 Gbytes of space. With disk prices dropping at a rate of 50% per year, staging the increments over the whole period can save more than half the cost. Given the declining prices of storage devices, deferring the purchase of 50 Gbytes of disk space today (at $2 per megabyte) for one year could save $50,000 in the capital budget.

That's the simple view, ignoring associated network and management costs. The actual cost savings are greater. Remember that you don't just buy disk drives. The proper original purchase is a scalable, modular system that can grow in small increments with your requirements-one that will provide for increasing bandwidth, as well as capacity.

Caching
Many of the advanced systems employ a series of caches to reduce the number of slow mechanical seeks of the disk-drive heads, instead of reading directly from volatile memory at bus speeds. The most common of these caches is the read-ahead cache, where the storage system gathers contiguous data when reading from the drives.

Although this is fine for serial processing, it's not useful for the random reading of most data warehouse queries. It is an excellent approach, however, for serial loads and unloads of the data warehouse. The troubling aspect of read-ahead disk caching is that a data warehouse can have caches in several places, many of which are redundant and work against one another.

The figure on page 50 depicts a typical application consisting of a client workstation conn ected to a middle-tier application server that draws data from a database server, which is connected to an I/O device that physically houses the data. Here are some possible different scenarios, describing how cache works for a data warehouse query:

  • The requested data is found in a cache on the application server; therefore, the request is not passed to the database server.
  • The requested data is not cached on the application server, requiring a query of the database server. In many cases, typical queries can be satisfied by the database cache.
  • The query cannot be satisfied by the database cache; therefore, a request is dispatched to the storage system. The requested data is found in the cache, requiring no physical I/O to the disk drives.
  • Failure to satisfy the query from the caches of the application server, the database server, and storage system results in a physical fetch-a read from the disk.

There are, however, some alarming drawbacks to this whole process:

  • The expensive cache of the storage system may indeed hold the requested data, but it isn't used because the query is satisfied by lower-level caches.
  • Each fetch causes a cascading refreshing of the lower-level caches. The more devices employing caching, the more inappropriate the process becomes.
  • The more transactions that occur in a time period, the less likely that large amounts of data in a complex query can be satisfied by the cache. When sharing a single centralized storage system (a "hotel") with multiple servers or applications (hosts), higher-volume OLTP systems will continuously wash out the caches of large queries, degrading data warehouse performance.

Achieving Performance
A basic principle is that there are three crucial issues you must consider: performance, availability, and data protection. All three are required to achieve a highly accessible, productive application.

Consider the case of a system with 200 Gbytes of raw data and a 700-Gbyte total storage re quirement. It uses on the order of 350 2-Gbyte disk drives. In a system with such a high component count, device failures occur biweekly, on average. This type of system requires a fault-tolerant disk array capable of hot-swapping all components, as well as a strong service relationship with your supplier.

Fault-tolerant arrays have another benefit. It's particularly unfortunate that many device failures occur in the "infant" period, which usually happens on system rollout, resulting in schedule slippage, customer frustration, and a poor initial impression of the system. An advantage of purchasing a fault-tolerant system with a good service arrangement is that you can ignore the inherent problem in new systems. Because failures are detected and automatically swapped out to hot spares, you don't have to burn in the system, which lets you get into production faster. Saving time is quite valuable at this stage of application development.

The chart below summarizes the key features to look for in selectin g a storage system for a data warehouse. Buy a fault-tolerant-class system based on a modular architecture that scales I/O bandwidth, as well as disk capacity, in small increments. Purchase only as much disk as you need, because prices are falling so rapidly. Select fast (Ultra SCSI, FC-AL, or SSA) disks for maximum I/O performance. Invest in RAID for availability and data protection, as well as read performance. Use a volume manager or adaptive RAID technology to load-balance different database activities.

Make sure there are multiple disk controllers as well as room for many network connections using high-speed interconnects such as Ultra SCSI, FDDI, Fibre Channel, SSA, or HIPPI. If you need a transfer rate of 1 Gbps and you are using Ultra SCSI drives, you will need about 30 Ultra SCSI host interfaces operating in parallel. Set up multiple disk controllers for high availability and to mirror-write cache for protection.

Further, don't mix database applications on a single multihost storage server. K eep them on separate storage servers, if possible, because of the different requirements of write-intensive vs. read-intensive operations.

You should also invest in management tools. Fault-tolerant, enterprise-class disk arrays should come with centrally administered device-management tools that allow remote monitoring and administration of performance, availability, utilization, and errors.

Which RAID levels should you use? The chart on page 56 compares how RAID levels and other array features apply to a traditional OLTP application vs. a data warehouse. RAID is appropriate for a data warehouse for availability, data protection, and performance. A combination of strategies can insulate the system from the failure of an individual device. The read-mostly nature of data warehousing is compatible with all levels of RAID, and the writing overhead can be mitigated by a sufficient amount of write cache.

In disk mirroring, the only real penalty is cost, and in some controllers, mirroring enables double the read throughput when operating both disks in parallel.

Conclusion
Data warehouses no longer exist in isolation, or at the end of a unidirectional flow. Although the tuning characteristics of the relational databases and the parallel servers were originally developed around OLTP applications, most enterprise storage array systems have the same roots. Don't assume that gains achieved with storage for OLTP applications can be duplicated with data warehouses.

Also, be especially careful about consolidating data storage into a single hotel-which can have a beneficial effect for the entire organization-at the expense of a single application, particularly the data warehouse.

Storage performance is also a network problem, not just a disk problem. Thus, we suggest investing in high-end, fault-tolerant storage systems because they can be optimized for data warehousing applications. If you want storage performance, build sufficient I/O bandwidth to get the data to the compute resources using enough disk drives and controllers in parallel to supply the data.

Neil Raden is president and founder of Archer Decision Sciences Inc., a consulting company with offices in New York and Santa Barbara, Calif., that specializes in implementing data warehouses and business intelligence systems. He can be reached at nraden@ archer-decision.com or 805-966-5145.

Michael Peterson is president of Strategic Research Corp ., a market-research and consulting firm in Santa Barbara that specializes in network storage and storage management. He can be reached at 805-569-5610.

What To Look For In A Disk Subsystem For Data Warehousing

Main performance objectives are high sustained bandwidth and high input/output performance to disk. Look for the following:
Servers
  • Many high-speed storage interfaces (Ultra SCSI, Fibre Channel, FDDI, SSA, HIPPI)
  • Device- and volume-management tools

Multiple controllers

  • Read cache
  • Write-back cache
  • Cache mirror

Disk arrays

  • RAID (0, 1, 5) and fast disk drives for performance
  • Redundant components and hot-swappable components for fault tolerance

Data: Archer decision sciences and Strategic Research


  Back to Labs

  Send Us Your Feedback

  Top of the Page

bottom navbar





Sign up for the InformationWeek Daily email newsletter

*Required field

Privacy Statement



This Week's Issue

Supplemental Issue

Related Whitepapers

Related Reports

Related Webcasts






Video