Columns Better Than Rows For Data Warehouses, Says Vertica Systems
It's better for performance reasons to focus on the columns, says Michael Stonebraker, CTO of Vertica Systems.
When are columns better than rows as the way to organize your database?
The standard business database is a relational system that organizes data into tables, built with vertical columns and horizontal rows. Oracle, DB2, SQL Server, Sybase, and MySQL all are organized around the rows. But when it comes to data warehouses, said Michael Stonebraker, the former Berkeley database guru, it's better for performance reasons to focus on the columns.
He should know. He was the original architect of row-based Ingres and pioneer of relational databases used for the past 25 years in business. The row-based system is great for executing transactions, but a column-oriented system is a natural fit for data warehouses, he said.
"You get a factor of 50 times better performance on all classes of queries used to run most data warehouses," he says. Stonebraker is now CTO of Vertica Systems, supplier of the column-based Vertica database system. Vertica 2.0 was unveiled at the Data Warehousing Institute's World Conference in Las Vegas Tuesday.
Columns cut across individual transactions, and store an element of information that is standard to each transaction, such as the customer name or address or amount of purchase.
That's because a data warehouse frequently stores information on a company's transactions, with each transaction having many parts. A row may hold the 20 to 200 different elements of the typical transaction, such as customer name, address, and amount of purchase. Columns, on the other hand, cut across transactions and represent one piece of information from all transactions.
Data warehouse queries typically want to ask questions that that draw information out of all transactions, or large sets of them, rather than look at individual transactions. For example, a typical query might ask: What was the average sale during the month of February? Retrieving the Sales column gets all the data the query needs to determine the average.
A standard relational database, however, would retrieve all the rows reflecting individual sales for the month, load them into system memory, then find all sales records and generate an average from them. So being able to focus on the Sales column, the way Stonebraker and Vertica 2.0 does, leads to data warehouse query performance improvement.
Stonebraker said a second performance benefit flows from taking a column approach. Because columns contain similar information from each transaction, it's possible to derive a compression scheme for the data type, then apply it uniformly throughout the column. Rows can't be compressed as easily because the nature of the data varies widely from record to record in the row, such as name, ZIP code, and balance owed. Each would require a different compression scheme.
Compressing data in columns makes them quicker to store and retrieve and reduces the amount of disk required for storage. "In every data warehouse I see, compression is a good thing," said Stonebraker. And he claimed Vertica's compression approach is "2-3 times more effective. Columns compress better than row stores," he said.
Vertica CEO Ralph Breslauer says 20 customers have been equipped with Vertica since the first version emerged in September. One of them is Sonian Networks, a Boston firm that archives e-mail for enterprises. Greg Arnette, CTO of Sonia, said he considered PostgreSQL, MySQL, and Oracle as the database for pulling business intelligence out of its large e-mail archive, but opted for Vertica instead, partly from "knowing Dr. Stonebraker's reputation."
Arnette said he expects Sonian's e-mail archive to grow rapidly from a few terabytes to 10 petabytes (10,000 TB) sometime in 2009.
"We needed a system that was optimized for high-volume reporting," said Arnette. In Sonian tests, "it was a lot faster than MySQL and requires less CPU power," he claimed. He didn't test Vertica versus Oracle because of Oracle's "license fees and cost of administration."
The 10-employee company uses Amazon's "cloud" Simple Storage Service and computing service for its e-mail archive and Vertica systems, with both Sonian and Sonian customers using Vertica to get reports out of the archive.
Stonebraker said the performance advantage is most evident with large data warehouse systems. "I expect the data warehouse market to become completely column-store based, just because it is inherently higher performance," he said.
An academic experimental column-store database is a joint project of MIT, Yale, Brandeis, Brown, and the University of Massachusetts-Boston. An open source project based on a columnar database approach is MonetDB.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.