Managing Data Warehouse Growth
One of the biggest challenges in business intelligence and data warehousing initiatives is managing growth. Conflicting demands to support more users, deal with increased query and data complexity, and add more "right time" information have many at a crossroads. We explore technological changes that will make it easier to scale and offer advice on guiding the growth.
Data warehouses are growing rapidly--and not just in sheer size. Conflicting demands to support more users, increased query and data complexity and more "right-time" information have organizations coming to a crossroads. Stakeholders need to know: Do we have the right architecture for growth? Can we scale up without breaking the bank? And do we have the personnel to manage this beast?
Companies fear if they can't answer these questions, the competition could leave them behind. Although the data-warehousing concept dates back to the 1980s, it wasn't until the mid-1990s that the notion of a separate store specifically designed to support data access and analysis became a widespread phenomenon. Now, with nearly every major large business organization managing or at least planning for a data warehouse, the race is on to see which company can leverage the most information value. Dealing with scale is a critical part of that effort.
As companies come to rely on BI (business intelligence) for strategic, tactical and now operational decision-making, data warehouses are arguably the centerpiece of most data-management functions. We explore what's happening with data-warehouse scalability based in part on results from the Winter TopTen Survey of the largest databases. Then, after looking at significant technology moves vendors are making to meet scalability needs, we offer some advice to guide your data warehouse growth.
From Back Office To Front Lines
Data warehousing hasn't stayed the same over all these years. When the discipline broke out in the 1990s, it was enough to give business strategists, financial managers and marketing specialists accurate and timely reporting. Most received data through homegrown tools or emerging independent BI products and reporting functions embedded in leading-edge applications. In recent years, the vision has evolved beyond reporting on what happened yesterday to forecasting what needs to be done tomorrow--if not initiating action today. Data warehousing is moving out of the back room to play an important role in what front-line employees do with suppliers, customers and partners, as well as what suppliers, customers and partners do in a self-service environment.
In the early days, it was a major triumph to report accurately which customers had defected to the competition. Companies knew it took time and money to acquire a customer. For firms where business performance was extremely sensitive to the dynamics of customer retention, it made sense to amass as much data as possible to understand both individual customers and patterns of behavior.
With better information about which customers are likely to defect to the competition and which are the most valuable to retain, the next step has been to connect such knowledge to action. Data warehousing is moving into a more time-sensitive realm as businesses want to turn information immediately into actions that underlie goals of putting customers first, improving service and delivering special offers while customers are online or in the store. With timing being everything, companies are urgently trying to move data warehousing from a solely passive role to one that better fulfills the BI goal of delivering information at the right time and to the right people.
Raising the bar of business value is welcome, but with greater relevance comes the pressure to deliver. Along with accommodating the growth in size, user population and workload, data warehouses also must handle more complex queries. And data latency--the time between receipt of the data and its availability for query--must become shorter and shorter.
In 1995, the first Winter Corp. survey of database size reported that the largest system in the world contained a terabyte of data. Ten years later, the biggest we found was 100 times larger. Even more astounding is the growth curve; "Growth of Database Size," right, shows a tripling of the largest database identified (and thus, publicly acknowledged; some organizations do not want such publicity) in each of the most recent two-year periods. If this trend continues, we will see a 300 TB data warehouse in 2007--and that's real data, not just disk space allotted for the data. With 300 TB, the disk measure--often erroneously reported as the raw data figure--would likely reach 1,000 TB, or a petabyte.
Of course, there is much more to the story than terabytes. As "Multidimensional Scalability" (click on link or see next page), shows, scalability is about user populations and workloads. Also, the complexity of both queries and schemas is growing. Latency is continuously under pressure as data timeliness becomes paramount. Thus, overall scalability requirements are rising faster than the first figure's exponential curve for sheer size.
Business Pressures Tip The Scales
To understand better what's driving scalability challenges, consider a large financial services company that must analyze and data-mine payments to uncover fraud, define customer service needs and spot market trends for profit opportunities. To handle these challenges, one financial service company has a requirement on the table to put about 500 TB of data online within the next few years. However, the company knows that even this estimate may be low, given worldwide adoption of new payment devices, such as cell phones.
Along with requiring an extraordinary amount of data, the company's fraud-detection plans demand low latency. With several large database tables that do not share a common join key, the schema will be complex. So will queries, which must express aggregations and other statistical functions involving multiple tables that do not share partitioning keys. Finally, the company is planning for a user community well beyond its tens of thousands of employees to include merchants, partners and ultimately, millions of account holders.
Although the financial services sector is often the one pushing the envelope, transportation and retail concerns are not far behind. Supply-chain management and logistics analysis is on the threshold of vastly increased data granularity with RFID (radio frequency identification) adoption. Telecommunications firms also want finer granularity than billing records so they can look for clues to reducing churn and improving customer service. Billing data captures one record per call; control data ranges from 10 to potentially hundreds of records per call, depending on the type--voice or data--and communication technology-cellular, VoIP or other.
Companies across many industries want a 360-degree view of customer relationships. To accomplish this, they need to integrate operational data from all service offerings and sales channels, including online transactions and Web site interactions. Once the data is joined, companies must anticipate complex queries given the variety of ways that users--ranging from line of business managers to globally dispersed call-center operators--would want to work with the data. And the queries could come at any time, putting pressure on update procedures that may happen in batch mode during what have been off-peak hours.
More Data, Less Money
The explosion in data, query complexity and low latency demands has come at a time when another trend has been at work in data warehousing. Tighter IT budgets and pressures to reduce costs are placing a strong emphasis on justifying expenditures through demonstrable returns on investment. How can companies balance these seemingly conflicting demands? Now more savvy about the total costs of data-warehouse operations and aware of the management and database administration challenges, IT buyers are demanding lower prices from vendors and more attention to their specific needs.
Budget pressure is endemic in an age of global competition, but the BI and data-warehousing community has been under special scrutiny from business-side executives. In our experience, as many as 75 percent of data warehouses have fallen short of producing business benefits commensurate with the expenditures. Although vendors spotlight on the smaller percentage that has experienced fabulous success, the vast majority are still looking for the big business upside.
Vendors, therefore, face a tall order: solve more difficult problems even as they respond to greater pressure on pricing. The most established data warehouse platform vendors--IBM, Microsoft, Oracle and NCR's Teradata Division--share a focus on four principal objectives:
• Increase the capability to handle scale and complexity
• Simplify operation, management and use
• Incrementally adjust software pricing downward
• Pass along hardware savings
Let's take a closer look at what's most interesting about how each of these vendors is attacking these objectives.
IBM. With DB2 version 9, the company is simplifying the buying process by introducing a preconfigured unit of data warehouse capacity: the balanced configuration unit. The BCU sets out what a customer would need in DB2 software, servers, storage and interconnect technology to meet specific objectives. Design wizards advise users on physical implementation techniques. To improve performance, scalability and manageability, version 9 also offers range partitioning and data compression; these technical features work in concert with existing multidimensional clustering to help customers reduce I/O workload and overall latency.
Microsoft. SQL Server 2005 was a major advance for Microsoft. The largest SQL Server data warehouse identified in the 2005 TopTen Survey contained 19.5 TB of data, which was more than 10 times larger than the biggest SQL Server data warehouse in 2003. The 2005 version introduced range partitioning for both tables and indexes. Microsoft also offers 64-bit support to improve data caching and reduce I/O, plus a new set of data integration services for extract, transformation and loading (ETL).
Oracle. In the latest version of its database, Oracle 10gR2, Oracle offers a grid architecture that uses high-speed interconnect networks to arrange storage and server resources. Oracle 10gR2 promises that applications, databases and workloads can be distributed and managed transparently: that is, without users needing to grapple with knowing about the underlying platforms. Oracle's long-term vision is that it should be possible and practical to expand, reconfigure and adapt the grid dynamically to meet changing requirements and ebbs and flows in traffic; respond to equipment, power and network failures; enable maintenance; and expedite or slow down work in response to changing priorities. While Oracle plans a series of releases to fully realize these goals, 10g's Real Application Clusters (RAC) technology is already in widespread use for data warehouses.
NCR/Teradata. As an extension to its pre-existing toolset, the Teradata Active System Management (TASM) facility is an important advance for Teradata customers trying to gain the upper hand in managing large workloads. In coming releases, the company's goal is to enable users to specify service levels and define priorities and policies--and let the system do almost everything else by itself. This includes supporting both the short, rapid queries necessary for "active," operational data warehousing as well as conventional, long-running, strategically oriented queries and large batch jobs. Advances in parallelism, query processing and optimization are also on Teradata's agenda.
Teradata, which did much to define data warehousing originally, remains the farthest ahead in addressing enterprise data-warehouse requirements: that is, multiple scalability challenges shown in the second figure that are driving growth toward monster systems. Microsoft has the furthest to go. However, all will be major players in the enterprise data warehouse game in the coming years.
Appliances: The Next Wave?
Bizgres, Datallegro, Kognition and Netezza are leading a new charge that is aiming to disrupt the data-warehousing market. These vendors offer appliances that package various technology into a platform that fits the 80/20 rule; they offer 80 percent of the data-warehousing capability for 20 percent (or really, 20 percent to 50 percent) of the cost. Most use open-source database technology for their engines, as well as low-cost hardware. They aim for high performance on a select class of fundamentally important database operations. And by offering few options, they lower complexity for their customers. New release performance innovations are selected carefully with an eye toward keeping prices low.
Beyond the shared goals, however, lie differences. Netezza employs proprietary "intelligent" storage, while Kognition focuses on a proprietary database engine (not open source). However, both companies' technology, as well as that of the others', can deliver surprisingly rapid data and table scans at prices lower than the established players.
Data warehouse appliance vendors are innovators and are achieving some noteworthy success in the market, but IT buyers need to exercise caution. Not all the scalability dimensions depicted in the second figure are within the capabilities of the appliance vendors to handle. Their price-performance advantages over the established players come only when you look at a limited set of data-warehouse operations.
Complex queries are one area, for example, where the appliance vendors fall short. What happens, for example, when joins cannot be partitioned on the same key, which is generally employed in simpler designs? Such joins are common when a data warehouse expands to cover multiple subjects. A financial services company might have billions of rows of transactions, payment devices and so on, and tens to hundreds of millions of accounts, customers and households. When the full complexity of an enterprise is represented in the data, the number of such large tables grows steadily. It becomes absurdly clumsy, if not logically impossible to organize such complexity around a simplifying construct, such as the star schema.
As a result, data-warehouse engines have to repartition large join sets efficiently and in a scalable fashion to optimize complex, multi-table queries effectively. This sort of demand pushes beyond the current limits of the data warehouse appliances we've examined.
Appliances are best-suited for data-mart or "sandbox" applications that require little or no integration with the enterprise data warehouse. However, in such settings, the appliances can offer dramatic cost savings and performance advantages.
No doubt, appliance vendors are going to put additional pressure on the established vendors and drive down their profit margins. Although this is clearly a net plus for IT buyers, the creation of a two-tier data warehouse market--full service and economy--could increase the very data fragmentation companies are trying so hard to clear up right now. We could see more, not fewer hub-and-spoke architectures that have unfortunately led to isolated data islands and difficulties in performing wide-ranging, what-if analysis against all the data necessary.
Data integration priorities will likely force the appliance vendors to extend their products' capabilities for indexing and range partitioning, not to mention better security and availability. Customers will also push the appliance vendors to address the other 20 percent of the problems, which will inevitably increase the complexity and costs of their products. Greater choice in the market is great but it demands higher knowledge and sophistication in making data warehouse platform decisions than just a focus on price points.
Getting A Handle On Growth
What should you do to prevent data warehouse requirements from surpassing your ability to manage them? You don't have to be implementing a 500 TB warehouse to encounter challenges in performance, scalability or availability. The complexity of the problems often stumps organizations more than just sheer scale. A company might have less than a terabyte of user data but face such complex business rules that no database system could easily handle the benchmarking. With just 1 TB of data and a sufficiently complex schema and workload, you'll find yourself up against some tough problems.
The first step is to understand the business drivers. Make sure your business-level requirements and schedules reflect those drives accurately. Then be sure to define those requirements by year over a strategic time frame (typically three to five years). Third, develop a set of concrete usage scenarios that are characteristic of the workload that the business requirements will demand.
With the business drivers and requirements clear, you can focus more closely on the data-warehouse issues, including database size, structure, workload and service-level agreements. Build a margin of safety into your requirements; you'll never have complete certainty about how the system will be used and what might ultimately expand those requirements (such as success!). Then it's time to evaluate the long-term and near-term trade-offs:
• In the long term, the ability to leverage data from across the enterprise pays huge dividends, and is the key to most of the success stories you hear about. Clearly, the core of your enterprise data must be integrated to provide a platform for rapid and inexpensive implementation of analytical solutions.
• In the near term, you want to identify where you can gain significant cost or performance advantages in a specific application or sandbox with a data-mart or appliance approach. Remember that some benefits may come at the price of fragmenting decision support data and incurring greater overhead for replication, ETL and other data movement and integration.
Testing your intended solution against your detailed requirements is critical before committing to it. This advice applies to everything that will be important to handle scalaby: the platform, configuration, database design and more.
Remember, previous experience is not always a good predictor of future success. And don't simply wait and hope that scalability problems will work themselves out. Take the initiative to prove to yourself--not to mention all your stakeholders within and outside the enterprise--that the solution you've chosen will scale and be strong enough to withstand the challenges of many kinds of growth.
Richard Winter, president of Wintercorp., is an expert in large database technology, architecture and implementation. He is executive editor of Scaling Up, a quarterly newsletter, and can be reached at [email protected].
Rick Burns is vice president of engineering at Wintercorp. and conducts benchmarks, platform evaluations and architecture studies for large data warehouses. He can be reached at [email protected].
About the Author
You May Also Like
2024 InformationWeek US IT Salary Report
May 29, 20242022 State of ITOps and SecOps
Jun 21, 2022