Will Spark, Google Dataflow Steal Hadoop's Thunder?
Apache Spark and Google's Cloud Dataflow service won't kill Hadoop, but they're aimed at the high-value role in big data analysis.
Google captured the big data community's attention last week by announcing Google Cloud Dataflow, a service that replaces MapReduce processing. Apache Spark will grab the spotlight at Spark Summit 2014 in San Francisco this week, and Databricks, the company behind Spark, will make more announcements that will shake up the big data world.
Dataflow and Spark are making waves because they're putting MapReduce, a core component of Hadoop, on the endangered-species list. The Hadoop community was already moving away from MapReduce because its slow, batchy nature and obscure programming approaches weren't compatible with enterprise adoption. Last year's Hadoop 2.0 release incorporated YARN (Yet Another Resource Negotiator) to support a much wider range of data-analysis approaches. YARN notwithstanding, Spark in particular might just replace much more than MapReduce, even if there's no intention to kill Hadoop.
Dataflow and Spark are similar in that they offer batch as well as iterative and streaming analysis approaches that are far broader and more capable than MapReduce batch processing. Dataflow is a service designed strictly for the Google Compute Cloud, so it's no direct threat to Hadoop. But Google wrote the whitepapers that inspired the development of Hadoop (way back in 2004). Google remains hugely influential in big data circles today, so developers are likely to follow its lead.
Spark has been around much longer than Dataflow. It was developed at UC Berkeley's AMPLab in 2009 and became an Apache open-source project in 2010. Spark is best known for in-memory machine learning, but it also supports SQL analysis and streaming analysis, and work is also underway to bring the popular R analytics library and graph analysis into the framework.
Follow the theme: Apache Spark addresses MapReduce, machine learning, SQL analysis, graph analysis, streaming analysis, and R analytics.
Cloudera also touts machine learning and stream processing (through Spark), but Impala is its SQL tool, and Hadoop's MapReduce is for batch processing.
"One of the great things about Apache Spark is that it's a single environment and you have a single API from which you can call machine learning algorithms, or you can do graph processing or SQL," said Ion Stoica, CEO of Databricks, in a phone interview with InformationWeek. "In terms of development, Spark supports SQL, Python, Java, Scala, and, soon, R, as input languages, but then it's a single system -- so it's not separate tools -- they're libraries that can all be called with one system."
The appeal of Spark hasn't been lost on Hadoop providers, all of which have partnered with Databricks to bring Spark into their software distributions. First among these partners was Cloudera, which partnered with Databricks in February and is now shipping Spark software and supporting production-ready deployments. MapR, IBM, Pivotal, and just last week, Hortonworks, have since joined the list.
Spark can't replace Hadoop outright because it's strictly for data analysis. It needs a high-scale storage layer upon which to operate. For that it uses
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.