Of 'Elephants,' Column-Store Databases and the Von Neumann Architecture
Listening to Dr. Michael Stonebraker extol the virtues of column-store databases... it's becoming clear that a new data storage architecture is the need of the day... Stonebraker also seemed to imply that column-store databases are wonderful not just for data warehouses, they are pretty good for conventional (transactional) uses as well. That, of course, doesn't seem right...
Listening in to Dr. Michael Stonebraker decry "elephants" and extol the virtues of column-store databases in general and Vertica in particular, it's becoming clear that a totally new data storage architecture is the need of the day.
Dr. Stonebraker is, of course, a venerable figure in the world of databases, best known for his pioneering work on Ingres at UC Berkeley more than a quarter century ago. These days, however, in his role as CTO of Vertica, he is constrained to speak more or less unilaterally on the topic. In a recent presentation on Vertica, Dr. Stonebraker didn't actually call the leading relational database vendors - Oracle, IBM, Microsoft - "large, lumbering and slow." He did, however, repeatedly refer to them as "elephants." Very clever.You probably know of column-store databases and about Vertica, so I won't go into too many details here - IntelligentEnterprise.com has plenty of information to offer (check this update, this trend article and this blog).
Here's what's interesting. Towards the end of the presentation, I thought I heard Dr. Stonebraker clearly state/imply that column-store databases are wonderful not just for data warehouses, they are pretty good for conventional (transactional) uses as well. That, of course, doesn't seem right. The central premise of all conventional relational databases is to store the entire row on a single database "page," as far as possible, which makes for efficient storage and retrieval of a single row of data (i.e. a single tuple or entity instance) - thus making it efficient for systems that read or write transactions (one transaction typically deals with a single entity instance - for example, one customer order, one invoice, or for that matter, a single customer). Hence, careful planning around row size and page size is a key component of database design optimization.
This strength of conventional databases, when used for large, star-join sorts of queries, also turns into a weakness, since the typical data warehouse query only needs to look at a few columns and not the entire row of data (specifically, the columns in the SELECT and WHERE clauses). That's where column-store databases get their strength: because they store data by the column, the page now has a single column of data, organized in (whatever) sorting order. Queries now need to read less number of pages to get all the values, and sorting and matching is faster.
Consider what happens when we use a column-store database and read a single transaction - say, that customer master record or the customer order. This data is now spread across many pages, and reading the transaction suddenly becomes much less efficient. Now imagine a large-scale OLTP system. It's not clear how column-store databases will cater to this need. Conventional or column-stored representation - there's no getting away from the Yin and Yang of database organization.
This reminded me (rather laterally) of the Von Neumann single-instruction-single-data (SISD) bottleneck. How fast can you process data if you are constrained to operate each instruction on a single piece of data sequentially? Subsequent architectures, such as vector processing (SIMD) and parallel processing (MIMD, whether small-scale clustering or large-scale parallelism) got around the bottleneck by a fundamental shift in paradigm.
Similarly, we need an equally fundamental shift in database storage architecture that will take us past two critical bottlenecks in database organization and performance that exist today:
Differences in performance between, say, row-order and column-order databases
the need to physically replicate data merely in order to use it in two different situations
This is interesting and highly pertinent stuff. Stay tuned for more in the future. Your own insight is also invited.Listening to Dr. Michael Stonebraker extol the virtues of column-store databases... it's becoming clear that a new data storage architecture is the need of the day... Stonebraker also seemed to imply that column-store databases are wonderful not just for data warehouses, they are pretty good for conventional (transactional) uses as well. That, of course, doesn't seem right...
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
InformationWeek Must Reads Oct. 21, 2014InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.