EMC Tries To Unify Big Data Analytics

EMC Greenplum Modular Data Computing Appliance puts SQL and Hadoop in the same box, but is it a truly cohesive platform?
8 Big Data Deployments In Detail
(click image for larger view)
Slideshow: 8 Big Data Deployments In Detail
For now, what SQL and Hadoop deployments share within the Modular DCA is systems monitoring, management, and provisioning. So you can use the same software to provision capacity for either environment, track hardware resource utilization and disk faults, and phone home if there are hardware management problems.

What's not yet unified within the appliance is the actual management of data and workflow. Most would-be customers will have SQL-savvy experts who can run the Greenplum database modules and any SQL-based analytical applications. But any Hadoop deployment will require experts who are familiar with Hadoop data-management tools.

EMC's long-term vision is to further unify data-management and data-integration workflows across its SQL and Hadoop environments, but it's unclear how soon that might happen and how seamless the experience might be.

Many SQL-based data warehousing platforms now have integrations to Hadoop. And it's common for the results of MapReduce and data-transformation jobs carried out in Hadoop to be passed along to SQL databases for further analysis using tools that are familiar to a much broader base of data-management professionals.

The upshot is that putting SQL and Hadoop in the same box will be a pyrrhic victory unless EMC can somehow blur the boundaries between SQL and Hadoop processing and analysis. That's something competitor Aster Data, acquired last year by Teradata, has done with its SQL-MapReduce capabilities.

MapReduce is one of the most popular capabilities of Hadoop, used for pattern-detection, graph analysis, and time-series analysis on clickstreams and social-media data. With SQL-MapReduce, Aster supports these sorts of e-commerce and digital-marketing oriented analyses within the familiar confines of a SQL database.

But the real competition for EMC will be stand-alone Hadoop deployments. The key question for customers looking to use Hadoop will be cost. Hadoop is typically built scaled out on commodity hardware at a far lower cost than EMC Greenplum is used to charging. The company did not disclose the cost of the Modular DCA, but when EMC introduced its appliance last October, prices worked out to about $14,000 per terabyte for a standard SQL database.

In June IBM Netezza reset the benchmark price for high-capacity archival appliances at about $2,500 per terabyte with its June release, and EMC no doubt matched that pricing. But according to Cloudera, the leading commercial support provider for Hadoop, deployments of the open-source platform can cost as little as $250 per terabyte. No doubt EMC will have to be able to make a strong case for the advantages of its blended platform.

In a second EMC announcement on Wednesday, the company said it has built a super-high-end Greenplum Analytics Workbench, a huge 1,000-node test and development platform running the vendor's enterprise Hadoop distribution. EMC says it will host the Workbench and make it available early next year at no charge to Hadoop developers, EMC partners, and academic researchers.

Packed with the most advanced components available from EMC and partners including Intel, Seagate, and Mellanox Technologies, the Workbench is aimed at giving back to the Hadoop community by enabling developers to test large-scale applications.

"Our partners that do advanced work with data--companies like comScore, Equifax, Acxiom, and others--have expressed serious interest in this platform as a way to demonstrate their data products in new applications," said Luke Lonergan, co-founder of Greenplum and chief technology officer of EMC's Data Computing Division. comScore, for example, offers a composite Twitter sentiment-analysis data product that requires extensive Hadoop processing.

At the 2011 InformationWeek 500 Virtual Conference, C-level executives from leading global companies will gather to discuss how their organizations are turbo-charging business execution and growth. This virtual event happens Oct. 6. Find out more.