Google captured the big data community's attention last week by announcing Google Cloud Dataflow, a service that replaces MapReduce processing. Apache Spark will grab the spotlight at Spark Summit 2014 in San Francisco this week, and Databricks, the company behind Spark, will make more announcements that will shake up the big data world.
Dataflow and Spark are making waves because they're putting MapReduce, a core component of Hadoop, on the endangered-species list. The Hadoop community was already moving away from MapReduce because its slow, batchy nature and obscure programming approaches weren't compatible with enterprise adoption. Last year's Hadoop 2.0 release incorporated YARN (Yet Another Resource Negotiator) to support a much wider range of data-analysis approaches. YARN notwithstanding, Spark in particular might just replace much more than MapReduce, even if there's no intention to kill Hadoop.
[Want more on the latest appearance of Spark? Read Hortonworks Certifies Spark On YARN, Hadoop.]
Dataflow and Spark are similar in that they offer batch as well as iterative and streaming analysis approaches that are far broader and more capable than MapReduce batch processing. Dataflow is a service designed strictly for the Google Compute Cloud, so it's no direct threat to Hadoop. But Google wrote the whitepapers that inspired the development of Hadoop (way back in 2004). Google remains hugely influential in big data circles today, so developers are likely to follow its lead.
Spark has been around much longer than Dataflow. It was developed at UC Berkeley's AMPLab in 2009 and became an Apache open-source project in 2010. Spark is best known for in-memory machine learning, but it also supports SQL analysis and streaming analysis, and work is also underway to bring the popular R analytics library and graph analysis into the framework.
"One of the great things about Apache Spark is that it's a single environment and you have a single API from which you can call machine learning algorithms, or you can do graph processing or SQL," said Ion Stoica, CEO of Databricks, in a phone interview with InformationWeek. "In terms of development, Spark supports SQL, Python, Java, Scala, and, soon, R, as input languages, but then it's a single system -- so it's not separate tools -- they're libraries that can all be called with one system."
The appeal of Spark hasn't been lost on Hadoop providers, all of which have partnered with Databricks to bring Spark into their software distributions. First among these partners was Cloudera, which partnered with Databricks in February and is now shipping Spark software and supporting production-ready deployments. MapR, IBM, Pivotal, and just last week, Hortonworks, have since joined the list.
Spark can't replace Hadoop outright because it's strictly for data analysis. It needs a high-scale storage layer upon which to operate. For that it uses
the basics of Hadoop -- the storage layer, management capabilities, and high availability and redundancy features -- as a data platform (just as Dataflow operates on top of the Google Cloud Datastore). But Hadoop vendors are counting on YARN to help them offer all sorts of analysis options on top of that same platform.
With Spark, Databricks is set to argue this week, organizations will be able to replace many components of Hadoop, not just MapReduce. Machine learning and stream processing are the obvious use cases, but Databricks will also highlight its SQL capabilities -- a threat to Hive, Impala, and Drill -- as well as its aspirations for graph analysis and R-based data mining. So what's left to do in other software?
[Want more on the grab for the analytic high ground? Read Pivotal Subscription Points To Real Value In Big Data.]
There's more to the Databricks announcements to be revealed Monday afternoon, but Hadoop vendors were already downplaying the potential impact of Google Cloud Dataflow and Spark last week in public forums and in response to questions submitted in email by InformationWeek.
"Traditional Hadoop's demise started in 2008, when Arun Murthy and the team at Yahoo saw the need for Hadoop to move beyond its MapReduce-only roots," said Shaun Connolly, Hortonworks' VP for corporate strategy, citing the work by now-Hortonworks-executive Murthy to lead the development of YARN. "The arrival of new engines such as Spark is a great thing, and by YARN-enabling them, we help ensure that Hadoop takes advantage of these new innovations in a way that enterprises can count on and consume."
Spark has lots of momentum, acknowledged MapR CMO Jack Norris, but he characterized it as a "very early" technology. "Yes, it can do a range of processing, but there are many issues in the framework that limit the use cases," Norris said. "One example is that it is dependent on available memory; any large dataset that exceeds that will hit a huge performance wall."
It's certainly true that it's very early days for Spark, but its ambitions to be the choice for many forms of data analysis should sound familiar. Teradata and Pivotal, for example, have attempted to stake out much of the same high ground of data analysis with their commercial tools, leaving Hadoop marginalized as just a high-scale, low-cost data-storage platform.
With Teradata, Hadoop is the big storage lake, but the analysis platform is its Aster database, which supports SQL as well as SQL-based MapReduce processing, Graph analysis, time-series analysis, and (as of last week) R-based analysis across its distributed cluster.
Pivotal has its own Hadoop distribution, Pivotal HD, and that's where it handles batch workloads. But for interactive analysis it's touting Greenplum database and the derivative HAWQ SQL-on-Hadoop option. For real-time processing it offers GemFire, SQLFire, and the derivative combination of the two, GemFire XD, which it describes as an in-memory alternative to Spark.
Spark's advantage is that it's broad, open source, and widely supported, including on the Cassandra NoSQL database and Amazon Web Services S3, on which it can also run. Spark's disadvantage is that it's very new and little known in the enterprise community. The promise of a simpler, more cohesive alternative to the menagerie of data analysis tools used with Hadoop is certainly compelling. But it has yet to be proven in broad production use that Spark tools are simpler, more cohesive, and as performant (or more performant) than the better-known options used today.
InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio