EMC Greenplum Modular Data Computing Appliance puts SQL and Hadoop in the same box, but is it a truly cohesive platform?
(click image for larger view)
Slideshow: 8 Big Data Deployments In Detail
Two separate worlds have emerged in big data analytics, but EMC announced a Greenplum appliance on Wednesday that aims to bring those two separate worlds together.
On the one hand there's structured data that fits neatly into the columns and rows of relational databases. That data has been mastered by relational databases, and even when it gets big (meaning north of about 10 terabytes), there are options such as massively parallel processing supported by products such as EMC's Greenplum database.
On the other hand there's the array of semi-structured, unstructured, and inconsistent data types like server log files, sensor data, social-network comments, and other forms of text-centric information. For that world the Hadoop open-source project has emerged as the leading platform for making such information computable. (Hadoop also handles highly structured data, but mostly as a high-capacity, low-cost data store.)
With Wednesday's release of the EMC Greenplum Modular Data Computing Appliance (DCA), EMC says it has unified these heretofore separate domains. It's a follow up to the company's announcement last May of Greenplum HD Community and Enterprise distributions of Hadoop software and a promise to deliver a Hadoop appliance.
Greenplum's Community edition includes Hadoop MapReduce, the HDFS distributed file system, the Apache Hive query tool, the HBase column-oriented data store, and ZooKeeper tool for configuring clusters. The Enterprise edition adds proprietary features for snapshotting and replication of Hadoop clusters as well as system management capabilities.
The Modular DCA is one box that can support multiple quarter-rack deployments that can be mixed, matched, and scaled. You can start with a standard Greenplum Database Module for scalable SQL analysis and add a quarter-rack Greenplum HD module for running EMC's Hadoop release.
Other quarter-rack options include the Greenplum Database High Capacity Module, which combines more storage and less compute capacity than a standard module for high-scale, long-term archival storage at a lower cost per terabyte. There's also a Greenplum Data Integration Accelerator (DIA) module designed to host partner applications, like predictive analytics capabilities from SAS, data-integration software from Informatica, and other options said to be in review.
EMC's modular approach lets you scale standard SQL, Hadoop, archival, or analytic application capacity in quarter-rack increments up to a total of six full racks. EMC says its approach will not only save money by eliminating the need for separate hardware platforms, it will also speed insight and minimize storage demands by streaming Hadoop analyses directly into the Greenplum database. In this approach, data doesn't have to be created and stored in one environment and then copied and moved into another.
EMC used the words "coprocessing" and "marriage" to describe the blend of SQL and Hadoop within the modular appliance, but it's not quite that harmonious just yet.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.