Stepping up its pursuit of big-data analysis, EMC on Monday announced that it will release its own distributions of open-source Apache Hadoop distributed processing software, along with a related appliance that will analyze both structured and unstructured data on a single platform.
In a similar announcement, startup company DataStax on Monday released Brisk, a product that combines Apache Cassandra open-source software for large-scale transaction processing with a Hadoop distribution. The product provides a single platform combining a low-latency database for super high-volume Web and real-time applications with tightly coupled Hadoop analytics.
Throw SAP's well-publicized in-memory ambitions in with these new products, and a vision of the future emerges, with lots of leading IT vendors addressing mixed data-analysis on unified platforms, but more on that later.
Hadoop is quickly gaining adoption due to its ability to analyze massive volumes of unstructured information, a category that includes textual information, such as social-network comments and email messages, and machine-generated data, such as network logs, security logs, application logs and sensor data, that doesn't fit neatly into consistent columns and rows.
EMC says it will release an EMC Greenplum HD Community and Enterprise Edition distributions of Hadoop in the third quarter along with a Greenplum HD Data Computing Appliance. The latter will combine the Greenplum database and the Enterprise Edition distribution of Hadoop on a single appliance.
This isn't the first effort to analyze structured and unstructured data on a single platform, but if it's the first appliance to run a relational database and the Hadoop stack on a single hardware platform. The combination that should appeal to customers because it promises to improve performance while eliminating redundant hardware.
Unstructured data can't be analyzed in conventional relational databases, so organizations swamped with tens or hundreds of terabytes or more rely on Hadoop, which can spread processing across tens, hundreds, or thousands of compute nodes on commodity servers, depending on the scale of the deployment. Hadoop also provides a MapReduce engine, which helps split up workloads when handling particularly large sets of unstructured data.
To date, Hadoop deployments and conventional relational data warehouses have run on separate hardware platforms, yet companies usually need to do SQL-style analysis of the data sets that emerge from Hadoop analyses. Thus, plenty of data-integration and data-warehouse-appliance vendors have partnered with Cloudera, which has a popular Hadoop distribution and is the leading provider of enterprise-grade Hadoop services and support.
HP Vertica and Teradata, for example, integrate with Cloudera Hadoop deployments so data sets can be moved on to their platforms for further SQL analysis.
EMC Greenplum has also partnered with Cloudera, but with Monday's announcement it will effectively become a competitor by offering its own Hadoop software distributions, service and support, albeit with an emphasis on deployments on EMC appliances.
"With the amount of innovation that we see that's possible, it just makes much more sense for us to own the Hadoop distribution as part of our stack," said Luke Lonergan, a co-founder of Greenplum and chief technology officer of EMC's Data Computing Division.