Of 'Elephants,' Column-Store Databases and the Von Neumann Architecture

Data Management

Listening to Dr. Michael Stonebraker extol the virtues of column-store databases... it's becoming clear that a new data storage architecture is the need of the day... Stonebraker also seemed to imply that column-store databases are wonderful not just for data warehouses, they are pretty good for conventional (transactional) uses as well. That, of course, doesn't seem right...

Rajan Chandras, Contributor

November 5, 2008

3 Min Read

Listening in to Dr. Michael Stonebraker decry "elephants" and extol the virtues of column-store databases in general and Vertica in particular, it's becoming clear that a totally new data storage architecture is the need of the day.

Dr. Stonebraker is, of course, a venerable figure in the world of databases, best known for his pioneering work on Ingres at UC Berkeley more than a quarter century ago. These days, however, in his role as CTO of Vertica, he is constrained to speak more or less unilaterally on the topic. In a recent presentation on Vertica, Dr. Stonebraker didn't actually call the leading relational database vendors - Oracle, IBM, Microsoft - "large, lumbering and slow." He did, however, repeatedly refer to them as "elephants." Very clever.You probably know of column-store databases and about Vertica, so I won't go into too many details here - IntelligentEnterprise.com has plenty of information to offer (check this update, this trend article and this blog).

Here's what's interesting. Towards the end of the presentation, I thought I heard Dr. Stonebraker clearly state/imply that column-store databases are wonderful not just for data warehouses, they are pretty good for conventional (transactional) uses as well. That, of course, doesn't seem right. The central premise of all conventional relational databases is to store the entire row on a single database "page," as far as possible, which makes for efficient storage and retrieval of a single row of data (i.e. a single tuple or entity instance) - thus making it efficient for systems that read or write transactions (one transaction typically deals with a single entity instance - for example, one customer order, one invoice, or for that matter, a single customer). Hence, careful planning around row size and page size is a key component of database design optimization.

This strength of conventional databases, when used for large, star-join sorts of queries, also turns into a weakness, since the typical data warehouse query only needs to look at a few columns and not the entire row of data (specifically, the columns in the SELECT and WHERE clauses). That's where column-store databases get their strength: because they store data by the column, the page now has a single column of data, organized in (whatever) sorting order. Queries now need to read less number of pages to get all the values, and sorting and matching is faster.

Consider what happens when we use a column-store database and read a single transaction - say, that customer master record or the customer order. This data is now spread across many pages, and reading the transaction suddenly becomes much less efficient. Now imagine a large-scale OLTP system. It's not clear how column-store databases will cater to this need. Conventional or column-stored representation - there's no getting away from the Yin and Yang of database organization.

This reminded me (rather laterally) of the Von Neumann single-instruction-single-data (SISD) bottleneck. How fast can you process data if you are constrained to operate each instruction on a single piece of data sequentially? Subsequent architectures, such as vector processing (SIMD) and parallel processing (MIMD, whether small-scale clustering or large-scale parallelism) got around the bottleneck by a fundamental shift in paradigm.

Similarly, we need an equally fundamental shift in database storage architecture that will take us past two critical bottlenecks in database organization and performance that exist today:

Differences in performance between, say, row-order and column-order databases
the need to physically replicate data merely in order to use it in two different situations

This is interesting and highly pertinent stuff. Stay tuned for more in the future. Your own insight is also invited.Listening to Dr. Michael Stonebraker extol the virtues of column-store databases... it's becoming clear that a new data storage architecture is the need of the day... Stonebraker also seemed to imply that column-store databases are wonderful not just for data warehouses, they are pretty good for conventional (transactional) uses as well. That, of course, doesn't seem right...

About the Author(s)

Rajan Chandras

Contributor

Rajan Chandras has over 20 years of experience and thought leadership in IT with a focus on enterprise data management. He is currently with a leading healthcare firm in New Jersey, where his responsibilities have included delivering complex programs in master data management, data warehousing, business intelligence, ICD-10 as well as providing architectural guidance to enterprise initiatives in healthcare reform (HCM/HCR), including care coordination programs (ACO/PCMH/EOC) and healthcare analytics (provider performance/PQR, HEDIS etc.), and customer relationship management analytics (CRM).

See more from Rajan Chandras

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice