On the one hand there's structured data that fits neatly into the columns and rows of relational databases. That data has been mastered by relational databases, and even when it gets big (meaning north of about 10 terabytes), there are options such as massively parallel processing supported by products such as EMC's Greenplum database.
On the other hand there's the array of semi-structured, unstructured, and inconsistent data types like server log files, sensor data, social-network comments, and other forms of text-centric information. For that world the Hadoop open-source project has emerged as the leading platform for making such information computable. (Hadoop also handles highly structured data, but mostly as a high-capacity, low-cost data store.)
[Want more on big data deployments? Check out this image gallery on 10 Lessons Learned By Big Data Pioneers.]
With Wednesday's release of the EMC Greenplum Modular Data Computing Appliance (DCA), EMC says it has unified these heretofore separate domains. It's a follow up to the company's announcement last May of Greenplum HD Community and Enterprise distributions of Hadoop software and a promise to deliver a Hadoop appliance.
Greenplum's Community edition includes Hadoop MapReduce, the HDFS distributed file system, the Apache Hive query tool, the HBase column-oriented data store, and ZooKeeper tool for configuring clusters. The Enterprise edition adds proprietary features for snapshotting and replication of Hadoop clusters as well as system management capabilities.
The Modular DCA is one box that can support multiple quarter-rack deployments that can be mixed, matched, and scaled. You can start with a standard Greenplum Database Module for scalable SQL analysis and add a quarter-rack Greenplum HD module for running EMC's Hadoop release.
Other quarter-rack options include the Greenplum Database High Capacity Module, which combines more storage and less compute capacity than a standard module for high-scale, long-term archival storage at a lower cost per terabyte. There's also a Greenplum Data Integration Accelerator (DIA) module designed to host partner applications, like predictive analytics capabilities from SAS, data-integration software from Informatica, and other options said to be in review.
EMC's modular approach lets you scale standard SQL, Hadoop, archival, or analytic application capacity in quarter-rack increments up to a total of six full racks. EMC says its approach will not only save money by eliminating the need for separate hardware platforms, it will also speed insight and minimize storage demands by streaming Hadoop analyses directly into the Greenplum database. In this approach, data doesn't have to be created and stored in one environment and then copied and moved into another.
EMC used the words "coprocessing" and "marriage" to describe the blend of SQL and Hadoop within the modular appliance, but it's not quite that harmonious just yet.