The advantage of the Greenplum HD Data Computing Appliance will be its ability to run a relational database and Hadoop on a single appliance. That will not only save money by eliminating the need for separate hardware platforms, it will also speed insight and minimize storage demands by streaming Hadoop analyses directly into the Greenplum database. In this approach, data doesn't have to be created and stored in one environment and then copied and moved into another.
"One team can handle both environments, and the streaming capabilities will drastically reduce the amount of duplicate data," Lonergan said.
EMC's Enterprise Edition software features a proprietary replacement of the Hadoop Distributed File System (HDFS). Developed by EMC partner MapR, EMC says this HDFS alternative delivers two- to five-times faster performance than the standard HDFS.
The MapR file system also provides capabilities not supported in standard HDFS, including snapshots and replication for backup and recovery, simple data loading and access via a network file system, automatic failure detection and notification for improved reliability, and multi-site management features.
As noted, EMC's planned appliance is not the first to support mixed structured and unstructured data analysis. Greenplum itself can already query HDFS from within the Greenplum database. And AsterData, which was recently acquired by Teradata, has won most of its customers on the strength of its SQL-MapReduce capabilities, which enable developers to handle many types of unstructured data query and processing jobs, though not quite to the degree supported by Hadoop.
Given the fast-moving state of Hadoop developments, there will undoubtedly be more novel combinations of Hadoop aimed at blended data-analysis capabilities. In fact, Lonergan acknowledged DataStax’s achievement and predicted that within two to three years, single platforms -- including EMC's -- will handle the mix of unstructured data and Hadoop-style analysis, structured data query with SQL analysis and data mining, and real-time, low-latency in-memory analysis of high volumes of information.
EMC has the first two covered and is "working aggressively" to cover the third, Lonergan said. SAP is tackling the second and third domain with its in-memory strategy, and SAP’s BusinessObjects analytics initiatives could lead to interest in unstructured-data analysis.
DataStax has addressed unstructured and real-time with Brisk, and it could add other open-source software for SQL-relational analysis.
Oracle has talked up the blend of transactional and analytics support, but it's currently an either-or proposition when it comes to configuring Exadata. Real-time, in-memory loading and analysis is also not part of the picture as yet with Exadata.
If innovators pioneer all-purpose analytic databases, it's easy to see that IBM, Oracle, Teradata and all other database contenders will have to respond.