Big Data // Big Data Analytics
Commentary
6/30/2014
09:36 AM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Will Spark, Google Dataflow Steal Hadoop's Thunder?

Apache Spark and Google's Cloud Dataflow service won't kill Hadoop, but they're aimed at the high-value role in big data analysis.

Google captured the big data community's attention last week by announcing Google Cloud Dataflow, a service that replaces MapReduce processing. Apache Spark will grab the spotlight at Spark Summit 2014 in San Francisco this week, and Databricks, the company behind Spark, will make more announcements that will shake up the big data world.

Dataflow and Spark are making waves because they're putting MapReduce, a core component of Hadoop, on the endangered-species list. The Hadoop community was already moving away from MapReduce because its slow, batchy nature and obscure programming approaches weren't compatible with enterprise adoption. Last year's Hadoop 2.0 release incorporated YARN (Yet Another Resource Negotiator) to support a much wider range of data-analysis approaches. YARN notwithstanding, Spark in particular might just replace much more than MapReduce, even if there's no intention to kill Hadoop.

[Want more on the latest appearance of Spark? Read Hortonworks Certifies Spark On YARN, Hadoop.]

Dataflow and Spark are similar in that they offer batch as well as iterative and streaming analysis approaches that are far broader and more capable than MapReduce batch processing. Dataflow is a service designed strictly for the Google Compute Cloud, so it's no direct threat to Hadoop. But Google wrote the whitepapers that inspired the development of Hadoop (way back in 2004). Google remains hugely influential in big data circles today, so developers are likely to follow its lead.

Spark has been around much longer than Dataflow. It was developed at UC Berkeley's AMPLab in 2009 and became an Apache open-source project in 2010. Spark is best known for in-memory machine learning, but it also supports SQL analysis and streaming analysis, and work is also underway to bring the popular R analytics library and graph analysis into the framework.

Follow the theme: Apache Spark addresses MapReduce, machine learning, SQL analysis, graph analysis, streaming analysis, and R analytics.
Follow the theme: Apache Spark addresses MapReduce, machine learning, SQL analysis, graph analysis, streaming analysis, and R analytics.

Cloudera also touts machine learning and stream processing (through Spark), but Impala is its SQL tool, and Hadoop's MapReduce is for batch processing.
Cloudera also touts machine learning and stream processing (through Spark), but Impala is its SQL tool, and Hadoop's MapReduce is for batch processing.

"One of the great things about Apache Spark is that it's a single environment and you have a single API from which you can call machine learning algorithms, or you can do graph processing or SQL," said Ion Stoica, CEO of Databricks, in a phone interview with InformationWeek. "In terms of development, Spark supports SQL, Python, Java, Scala, and, soon, R, as input languages, but then it's a single system -- so it's not separate tools -- they're libraries that can all be called with one system."

The appeal of Spark hasn't been lost on Hadoop providers, all of which have partnered with Databricks to bring Spark into their software distributions. First among these partners was Cloudera, which partnered with Databricks in February and is now shipping Spark software and supporting production-ready deployments. MapR, IBM, Pivotal, and just last week, Hortonworks, have since joined the list.

Spark can't replace Hadoop outright because it's strictly for data analysis. It needs a high-scale storage layer upon which to operate. For that it uses

Next Page

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
BigDataMercs
50%
50%
BigDataMercs,
User Rank: Apprentice
8/23/2014 | 2:22:53 AM
Re: Great context
Be glad to. In short.... 

just my thoughts... 

It's the punchcard analogie to today's in memory, high availability expectations from the business world. They don't give a shit how "cute" it is under the hood... tactical answers... Can DoucheHoop produce? not really. The paradigm has shifted already... 

 

GodSpeed. 
souravtri
50%
50%
souravtri,
User Rank: Apprentice
7/18/2014 | 7:49:29 AM
Yes, Spark et al is the way forward!
While Hadoop's HDFS is great with (virtuallly infinite) distributed storage, but Hadoop's MapReduce sucks in terms of processing performance and support for easy access to data.

Spark happens to be a great step forward to mitigate above issues with signifincantly improved performance, polygotism, great with SQL (with Shark). It would also be interesting to see hardware advancements (DRAM) which can retain much more data in memory.

My believe, HDFS would continue in usage for storage and incremental improvements  in processing layer (like Spark) would strenghten real-time , fast access to data and analytics.

 

 
Li Tan
50%
50%
Li Tan,
User Rank: Ninja
7/1/2014 | 9:35:03 AM
Re: Life at Big Data's pinnacle is getting hazardous
I see Spark as a kind of enhancement to Hadoop at higher big data analysis level. It will not kill Hadoop but for sure some changes will happen. Something in Hadoop framework will get deprecated but the foundation will remain.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/1/2014 | 9:14:39 AM
Re: Life at Big Data's pinnacle is getting hazardous
Don't read the emergence of Spark as a death sentence for Hadoop. Spark needs a data platform like Hadoop (or Cassandra or a durable cloud storage option like S3) to run on top of. What it might replace is the managerie of data-analysis and processing tools -- Hadoop MapReduce, Hive, Impala, Mahout, etc. -- that run on top of Hadoop. HDFS, with redundance, high availability, management, and security features, is what remains.
Charlie Babcock
100%
0%
Charlie Babcock,
User Rank: Author
6/30/2014 | 8:54:24 PM
Life at Big Data's pinnacle is getting hazardous
Oracle launched version 1 in 1979, IBM's DB2 soon to follow, and relational database has reigned supreme for 30-32 years. Has life at the pinnacle for a data management system, such as Hadoop, shrunk to 10-12 years? I don't believe it. Still  you can see the timeline compression going on, with intense interest followed by thought leaders producing new systems in rapid succession. 
brunoaziza
50%
50%
brunoaziza,
User Rank: Apprentice
6/30/2014 | 3:11:14 PM
Re: Great context
Great article Doug.  Laurianne - I think the most obvious answer might be speed.  

MapReduce is a great framework but companies want to do analysis at scale struggle to get answers at the 'speed of business'.  

Using Spark, our algorithm ran about 100X faster to give you an idea (we ran about 50M rows in less than 50 seconds).

If you want to know what Spark is or how you can run Machine Learning at scale using Spark, please feel free to read a blog post we authored here

Analytically Yours,

Bruno
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
6/30/2014 | 1:40:36 PM
Re: Great context
MapReduce is clearly being reivented. The tougher question is whether Spark can usurp Hadoop-distributor-favored tools such as Hive, Impala, and other frameworks established and yet to come. There's a danger for Hadoop distributors in not having a piece of the high-value-analytics action.
Laurianne
100%
0%
Laurianne,
User Rank: Author
6/30/2014 | 1:07:57 PM
Great context
Great context on Spark, Doug. Anyone weighing the MapReduce shortcomings want to chime in here?
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Government Tech Digest Oct. 27, 2014
To meet obligations -- and avoid accusations of cover-up and incompetence -- federal agencies must get serious about digitizing records.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A roundup of the top stories and community news at InformationWeek.com.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.