Start-Ups Bring Google's Parallel Processing To Data WarehousingAster and Greenplum have made Google's MapReduce compatible with SQL for use in the parallel data warehousing systems based on open source PostgreSQL.
Two startups, Greenplum and Aster Data Systems, announced this week that they are implementing MapReduce, a cluster computing framework originally used by Google to analyze and rank Web pages, into their enterprise data warehouse products in an aim to make it easier for businesses to quickly analyze huge amounts of data.
MapReduce is a software framework that allows users to complete many instances of tasks simultaneously across many computers, often commodity servers. It could also have potential use in multicore programming. While Google first implemented MapReduce, Greenplum and Aster have written their own implementations for use in their products.
"We took the model that Google used, farms of commodity servers, and we offer a database that can be deployed on massive farms of clusters and can transform hundreds of individual servers into a single database where people can store data and do complex analytics on it," Aster CTO and co-founder Tasso Argyros explained in an interview.
Both Aster and Greenplum have made MapReduce compatible with SQL for use in the parallel data warehousing systems based on open source PostgreSQL that the two companies already sell. With those powers combined, Greenplum Database and Aster's nCluster gain new ability to mine through massive amounts of data, according to both companies.
"Before this, the idea of analyzing that much data at that much scale was just too expensive and too hard," Greenplum co-founder and president Scott Yara said in an interview, noting that many of the uses of Greenplum will likely be for data sets in the tens of terabytes.
Database pioneers David DeWitt and Michael Stonebreaker criticized MapReduce earlier this year as being "a giant step backward in the programming paradigm for large-scale data intensive applications" and cited similar massively parallel capabilities in Teradata systems 20 years ago. They wrote in a blog that MapReduce is missing features routinely found in most current database systems and may not be as scalable as some other systems. However, by making MapReduce SQL-compatible, Aster and Greenplum should be able to do away with some of the concerns DeWitt and Stonebreaker raised, such as MapReduce's lack of business intelligence, data mining, replication and reporting tools, and the ability to do things like join data sets.
Possible applications for MapReduce in the enterprise include text analysis, graph analysis, machine learning, and new types of data transformation. With Greenplum's and Aster's implementations, MapReduce -- which relies on typical programming languages like C or PERL -- can deduce structure from a large amount of unstructured data and then be plugged into SQL for further analysis. According to Argyros, that allows Aster to overcome some of SQL's own shortcomings like, the ability to handle complex algorithmic analysis.
Aster and Greenplum have each received some big-name backing. Some of Google's earliest investors supported Aster as angel investors, and the company is now backed by venture capital firm Sequoia Capital. Greenplum brought in a $27 million round of funding earlier this year from a list of investors that included Sun Microsystems and SAP.
Aster and Greenplum aren't the only ones taking advantage of MapReduce. Apache has also developed an open source version of MapReduce known as Hadoop. Yahoo uses Hadoop for Web search and advertising, and The New York Times has used Hadoop in combination with Amazon Web services to transform millions of old articles that were each in several disparate scanned TIFF images into PDF format. Microsoft Research has developed a similar parallel computing framework known as Dryad.
For more on reinventing the data warehouse, read InformationWeek's analysis of game-changing databases and appliances. Download the report here (registration required).