Data warehouses aren't just exploding in size, they're also supporting more users and increasingly complex queries, all in shorter time frames. Here's how to make sure yours is ready to scale.
MULTIPLE DIMENSIONS OF SCALABILITY
The convergence of three key trends is driving the ever-expanding scalability challenges facing data warehouse managers. The first is well known: Data volumes are increasing rapidly. The largest data warehouses are tripling every two years, according to WinterCorp surveys.
That's about how fast LGR's data warehouse is growing; it will approach 3 PB in 2012. Hundreds of other data warehouses, including those run by retail, health care, and financial services companies, also will reach petabyte scale in the next few years and thousands will surpass 100 TB. In many cases, competitive pressures are driving businesses to capture more data in hopes they can better analyze, understand, acquire, and retain the most valuable customers.
Data warehouses also are getting more time sensitive. The extraordinary velocity of the data in LGR's warehouse is a case in point: Billions of records pour in throughout the day, loaded into the database within minutes, and acted on almost immediately. If a mobile phone customer calls in because he had a bad experience, "we want to see exactly what happened, what calls were dropped, what tower was involved, and so forth, while they're still on the phone," van Rooyen says. "At the same time, you want the customer service person to know the customer's history." Problems are resolved faster, customers get better service, and "the business works better all around," he says.
High-velocity use of data--also called "operational business intelligence"--isn't a new concept. Teradata identified it several years ago as "tactical data warehousing," and IBM's Dynamic Data Warehousing pushes a similar notion of "right time" data. But the business pressure to provide such capability is rising.
Tactical data warehousing facilitates the moment-by-moment decisions employees must make. Many of these decisions are similar and repetitive: What should I offer this customer? How do I treat this unexpected shipment that just turned up at the factory? Businesses that can make such decisions in a systematic way, informed by up-to-the-minute data, find they produce significantly better results.
Operational BI has big implications for data warehouse scalability. It results in larger user populations; more frequent, time-sensitive interactions; a need for fresher data; and support of business processes that can't tolerate downtime.
The third trend is rising complexity in data, queries, workloads, and analysis, all of which amplifies scale. When data warehouses are doing only simple things, such as predictable updating and straightforward reporting, they can grow without creating fundamentally new problems. But when they have to respond interactively to complex and unpredictable queries--perhaps performing large, complex joins, aggregations, sorts, and calculations on trillions of records--the requirements have truly escalated.
Many modern data warehouses perform complex queries, analyses, and reports. They also operate on more complex schemas than in the past, with thousands of tables, hundreds of thousands of columns, and a complex web of data relationships.
EXTREME MULTIDIMENSIONAL GROWTH
There are few better illustrations of the multidimensional growth phenomenon than eBay. About 85% of the queries run on the company's data warehouse are "exploratory in nature," says Oliver Ratzesberger, eBay's senior director of architecture and operations. They come from end users, with no opportunity for a database administrator to apply a tuning tool to them. "The queries hit the engine, and it has to handle them," Ratzesberger says.
EBay's data warehouse contains about 5 PB of disk storage distributed over primary and secondary systems, both running Teradata. The secondary system for disaster recovery is located about 1,000 miles from the primary one. Each system has a complete copy of the company's core data, organized as an enterprise data warehouse. Both copies are updated every 15 minutes, round the clock, and are continuously active servicing queries.
There are more than 5,000 users and about 10 million queries each day. The daily update volume ranges from 10 billion to 15 billion records per day. Thousands of tables are involved, and queries range from simple lookups to complex analyses that run for hours. The system is constantly managing a mixed workload with different service-level objectives for each of the various classes of work.
Given the scale of the system, the growth rates are even more remarkable: The number of users grew 25% last year, the number of queries doubled, and the size of the system has at least doubled each of the last four years.
EBay's experience shows how data warehouses don't just grow in quantity of stored data. They also expand in several dimensions at once, including data volume, number of users, query volume, data latency, and data and query complexity. Decisions on architecture and spending must take into account the likely growth of all these dimensions.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.