All the research points in the same direction: data volumes are growing at a rampant rate within most enterprises, with estimates ranging from 20 percent per year on the conservative side to 50 percent per year among the largest organizations. The prognosis is such that "within a few years, traditional row-oriented relational databases are really going to be pushing their limits," predicts Boris Evelson of Forrester Research. “There are a lot of DBAs who don’t want to hear it because they’re so entrenched in Oracle and IBM DB2 and Microsoft SQL Server, but I think a new era is coming.”
Enter data warehouse appliances and column-store databases. While the technologies are not new, the last few years have seen the appliance market take off, and in the last few months, column-store database options have multiplied. Venture capitalists are placing multi-million-dollar bets on both categories, and they are safe bets, too, given exploding data volumes and the corporate desire to do fast-yet-in-depth analyses of all available data.
“One of the major reasons that Wal-Mart, Staples and Amazon.com have been so successful is that they analyze their data, and they have it at the fingertips of the entire enterprise,” says Foster Hinshaw, CEO of upstart data warehouse appliance vendor Dataupia and a cofounder of Netezza. “You have to be able to drill down on all your data and understand what it means in terms of where you site a new store, what SKUs are moving, what product you put in different locations and what customer programs you offer tomorrow.”
While the appliances have a head start in wooing business away from the leading data warehouse vendors, in the right applications, column-store databases can offer a higher-performance and more cost-effective alternative. What's more, column-store database vendors are beginning to offer appliance configurations that offer speed- and ease-of-deployment advantages. This article looks at the promise and progress in column-store databases, the limitations of the technology and when appliances built on conventional row-based databases make a better choice. We also offer five dos and don'ts you should apply in any appliance/column-store database buying process.
Consider Columns Over Rows
What exactly is a column-store database? Unlike a conventional database that stores data in rows, with say, one complete customer record per row, the column-store looks at data vertically. In the case of a customer database, for example, it stores down the columns: all the first names, all the last names, all the cities, all the zips and so on. While the row approach is well suited to OLTP (online transaction processing) with lots of writes for each new customer transaction, the column-store approach is ideal for OLAP (online analytical processing) with lots of reads against particular attributes.
The first advantage of a column-store database is that it can focus only on those columns that figure in the query. If you want to explore, say, sales by zip code and product, for example, your query will only interrogate the zip code and product sku columns rather than wading through the names, addresses and all the other irrelevant information in each row. The second advantage of the column-store approach is that it can apply optimized compression (upwards of ten to one compression, and some claim higher) because the data in each column is consistent (all names, all zip codes, all product sku numbers, etc.). For the right type of analytic query, the amount of data coming back from a column store is always going to be less than a conventional database, so it's going to have less I/O and, therefore, better performance.
"If you're bringing back all the columns, a column-store database isn't going to perform any better than a row-store DBMS, but analytic applications are typically looking at all rows and only a few columns" says Gartner analyst Donald Feinberg. "When you put that type of application on a column-store DBMS, it outperforms anything that doesn't take a column-store approach."
Consider the Queries
There are plenty of examples of queries that are ideal for column-oriented databases, and the column-store vendors have all gravitated toward the low-hanging fruit. Telecommunications companies, for example, have billions of call data records (CDRs). Those records need to be in the enterprise data warehouse for billing and historical purposes, but columnar extracts can be placed in data mart for marketing analysis. Big retailers can do the same to look at purchase patterns and up-sell and cross-sell opportunities. Banks can look for transaction patterns and do anti-fraud analysis. Insurance companies set up marts to let actuaries quickly explore age, blood pressure and other characteristics of all policy holders without impacting the performance of the enterprise data warehouse.
In government, the Internal Revenue Service launched one of the first column-oriented deployments in 1996 using Sybase IQ in what started out as a two-terabyte warehouse. "Back to 1996, even two terabytes was considered a very large database, and it brought our query times down from days to a matter of hours, which was just a phenomenal boost in productivity," says Jeff Butler, director of research databases in the IRS Office of Research. "Today we're at 158 terabytes, and the queries that once took hours now take seconds."
The IRS analytic database stores population-level data (meaning no names or social security numbers) including every single individual and business tax return in the US over the last 10 years. The warehouse is used by internal IRS business units as well as by the General Accounting Office, Congress and the Joint Committee on Taxation to explore filing behavior, compliance behavior, delinquency and impacts of potential tax code changes.
"We don't know what data elements we're going to need to query from which tables from one question to the next, which this is the key reason we invested in a database technology like Sybase IQ," says Jeff Butler, director of research databases in the IRS Office of Research. "If you were building a static reporting application, you could choose just about any database because you can do a lot of custom optimization behind the scenes around those pre-determined data elements, but that's not our situation."
The IRS explores its database with a combination of tools including SAS (for statistical and predictive modeling), Hyperion Intelligence (for reporting), and the Sybase IQ SQL client (for query). Some of the totals, averages or basic statistics developed "could be the result of matching across many tables or sorting and aggregating across those tables, some of which are going to be in the hundreds of millions or even billions of rows," says Butler. "Our largest table is somewhere around 3 billon rows."
"If you put 3 billion rows in any DBMS and it has to read them all, you have a problem," comments Feinberg. "When you put that in a column-oriented database and you want to look at a single column or a few columns, it doesn't have to do 3 billion I/Os to do the analysis."
Price the Performance
What's confusing to many, these days, is the fact that most data warehouse appliance vendors are going after many of the same analytic applications that are targeted by column-oriented database vendors. And compounding the confusion, several column-oriented database vendors, including Sybase IQ, Vertica and ParAccel, have appliance-style offerings, either as suggested software/hardware configurations or as actual pre-configured bundles.
What almost all data warehouse appliances have in common, whether built on row-oriented or column-oriented databases, is a massive parallel processing (MPP), shared-nothing architecture. MPP means that the processing is spread across many nodes, usually on commodity hardware running on Linux. Shared nothing means that each node is independent, with its own memory and storage so it doesn't compete for resources. The result is high performance without the expense of the high-powered, symmetric multiprocessor (SMP) architecture servers that typically run conventional data warehouses.
From a purely technical perspective, the leading relational databases used in data warehousing can continue to scale (IBM claims into the petabytes). But the question is, how much will it cost and how difficult will it be to manage? Appliances are gaining ground precisely because they are faster and easier to deploy and maintain than conventional warehouses, which have to be tuned, optimized and, in the latest option, clustered to perform in large-scale deployments. In the case of Teradata, which was the first to exploit the MPP, shared-nothing architecture long ago, the upstarts are competing primarily on price.
"When a Teradata client is scaling up and they see performance falling down, they can always buy more nodes, but the list price for five nodes is roughly $7 million," points out Feinberg of Gartner. "You can put in a Sybase IQ, Vertica, Netezza of any of the others for a lot less money. The analytic applications are going to get great performance, the enterprise data warehouse is going to go back to having good performance, and I just spent a couple of million instead of $7 million." This may explain why the word on the street is that Teredata is getting much more aggressive about pricing.
Examine the Alternatives
Few organizations are replacing their incumbent enterprise data warehouses (EDW) with appliances or column-store database implementations. Rather, they're usually deploying data-mart-style applications to relieve data overloads or avoid new burdens on the EDW, thus delaying expensive upgrades. Some upstarts are hoping these deployment patterns will change. HP, for one, is hoping to bite into Teradata's enviable, decades-old grip on the largest data warehouse deployments. HP bills its Neoview appliance as ready for simultaneous loading and querying, mixed workloads, thousands of users and sub-second retrieval times. The IBM Balanced Warehouse appliances, too, are suitable for enterprise data warehouse use, says Feinberg of Gartner, and the Greenplum and DATAllegro offerings, too, are technically suitable for EDW use given that they (like the Teradata, IBM and HP offerings) are built on row-store databases.
The Netezza Performance Server, too, is built on a row-store database, but it actually has more in common with the column-store databases, says Feinberg. "It's much more of a specialized appliance that handles complex queries with lots of analytics," he says. In contrast to column-stores databases, Netezza reads all the rows, "but it does it with so many processors close to the data that it's still going to be very fast."
Given all the fuzzy lines in the market and the hard-to-verify claims made by appliance purveyors and column-store database vendors, here are five dos and don'ts analysts offer to help you make the right choice:
Do Look beyond price/performance. "Everybody tries to break it down to price per usable terabyte, but this is not a commodity market," warns James Kobielus of Forrester Research. "You're usually deploying in the context of both a BI deployment plus the back-end data integration and data quality technology you need to put together a complete application. Some vendors can provide soup to nuts, while others provide only a component and they will refer you to partners for the other pieces.
Don't be fooled by scalability claims. Some vendors dwell on ultimate scalability stats that are more theoretical than referenceable. Also keep in mind that column-store databases are inherently smaller. "Sybase recently ran a test storing a petabyte of data on Sybase IQ, but it only took up less than 260 terabytes of physical storage because of the compression achievable in column-store databases," says Feinberg.
Do take compatibility issues into account. While appliances promise ease of deployment, recognize that disparate databases breed complexity. "Your BI apps will be hitting on this new database," says Kobielus, "and that raises the question, 'does the appliance require that we rewrite BI and query applications or rearchitect the data model?'" Some appliance and column-store vendors, notably Dataupia and ParAccel, are compatible with Oracle, DB2 and MS SQL Server and can serve as a "drop-in-accelerator" role.
Don't assume column-store is always the best choice. Column-store databases are the performance champs when aggregating many rows and examining a few key attributes, "but they're not optimal for situations where you're looking at many attributes in a given record," warns Kobielus. "It's also not the best choice if you looking up particular rows or limited sets of records."
Do insist on a proof-of-concept test with your application and your data. "Don't trust the FUD or the TCP-H benchmark stats or anything else," says Feinberg. "You need to put your data out there and run your queries and see which alternative is best in your application."