Will Spark, Google Dataflow Steal Hadoop's Thunder? - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
Commentary
6/30/2014
09:36 AM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
LinkedIn
Twitter
RSS
50%
50%

Will Spark, Google Dataflow Steal Hadoop's Thunder?

Apache Spark and Google's Cloud Dataflow service won't kill Hadoop, but they're aimed at the high-value role in big data analysis.

the basics of Hadoop -- the storage layer, management capabilities, and high availability and redundancy features -- as a data platform (just as Dataflow operates on top of the Google Cloud Datastore). But Hadoop vendors are counting on YARN to help them offer all sorts of analysis options on top of that same platform.

With Spark, Databricks is set to argue this week, organizations will be able to replace many components of Hadoop, not just MapReduce. Machine learning and stream processing are the obvious use cases, but Databricks will also highlight its SQL capabilities -- a threat to Hive, Impala, and Drill -- as well as its aspirations for graph analysis and R-based data mining. So what's left to do in other software?

[Want more on the grab for the analytic high ground? Read Pivotal Subscription Points To Real Value In Big Data.]

There's more to the Databricks announcements to be revealed Monday afternoon, but Hadoop vendors were already downplaying the potential impact of Google Cloud Dataflow and Spark last week in public forums and in response to questions submitted in email by InformationWeek.

"Traditional Hadoop's demise started in 2008, when Arun Murthy and the team at Yahoo saw the need for Hadoop to move beyond its MapReduce-only roots," said Shaun Connolly, Hortonworks' VP for corporate strategy, citing the work by now-Hortonworks-executive Murthy to lead the development of YARN. "The arrival of new engines such as Spark is a great thing, and by YARN-enabling them, we help ensure that Hadoop takes advantage of these new innovations in a way that enterprises can count on and consume."

Spark has lots of momentum, acknowledged MapR CMO Jack Norris, but he characterized it as a "very early" technology. "Yes, it can do a range of processing, but there are many issues in the framework that limit the use cases," Norris said. "One example is that it is dependent on available memory; any large dataset that exceeds that will hit a huge performance wall."

Teradata does MapReduce, SQL, Graph, Time-Series, and R-based analyses all on its commercial Aster database, connecting to Hadoop as a data source.
Teradata does MapReduce, SQL, Graph, Time-Series, and R-based analyses all on its commercial Aster database, connecting to Hadoop as a data source.

Pivotal's Hadoop platform supports batch processing (meaning MapReduce). Commercial Greenplum and HAWQ support SQL, while GemFire XD supports streaming and iterative, in-memory analysis like machine learning.
Pivotal's Hadoop platform supports batch processing (meaning MapReduce). Commercial Greenplum and HAWQ support SQL, while GemFire XD supports streaming and iterative, in-memory analysis like machine learning.

It's certainly true that it's very early days for Spark, but its ambitions to be the choice for many forms of data analysis should sound familiar. Teradata and Pivotal, for example, have attempted to stake out much of the same high ground of data analysis with their commercial tools, leaving Hadoop marginalized as just a high-scale, low-cost data-storage platform.

With Teradata, Hadoop is the big storage lake, but the analysis platform is its Aster database, which supports SQL as well as SQL-based MapReduce processing, Graph analysis, time-series analysis, and (as of last week) R-based analysis across its distributed cluster.

Pivotal has its own Hadoop distribution, Pivotal HD, and that's where it handles batch workloads. But for interactive analysis it's touting Greenplum database and the derivative HAWQ SQL-on-Hadoop option. For real-time processing it offers GemFire, SQLFire, and the derivative combination of the two, GemFire XD, which it describes as an in-memory alternative to Spark.

Spark's advantage is that it's broad, open source, and widely supported, including on the Cassandra NoSQL database and Amazon Web Services S3, on which it can also run. Spark's disadvantage is that it's very new and little known in the enterprise community. The promise of a simpler, more cohesive alternative to the menagerie of data analysis tools used with Hadoop is certainly compelling. But it has yet to be proven in broad production use that Spark tools are simpler, more cohesive, and as performant (or more performant) than the better-known options used today.

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Oldest First  |  Newest First  |  Threaded View
Laurianne
100%
0%
Laurianne,
User Rank: Author
6/30/2014 | 1:07:57 PM
Great context
Great context on Spark, Doug. Anyone weighing the MapReduce shortcomings want to chime in here?
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
6/30/2014 | 1:40:36 PM
Re: Great context
MapReduce is clearly being reivented. The tougher question is whether Spark can usurp Hadoop-distributor-favored tools such as Hive, Impala, and other frameworks established and yet to come. There's a danger for Hadoop distributors in not having a piece of the high-value-analytics action.
brunoaziza
50%
50%
brunoaziza,
User Rank: Apprentice
6/30/2014 | 3:11:14 PM
Re: Great context
Great article Doug.  Laurianne - I think the most obvious answer might be speed.  

MapReduce is a great framework but companies want to do analysis at scale struggle to get answers at the 'speed of business'.  

Using Spark, our algorithm ran about 100X faster to give you an idea (we ran about 50M rows in less than 50 seconds).

If you want to know what Spark is or how you can run Machine Learning at scale using Spark, please feel free to read a blog post we authored here

Analytically Yours,

Bruno
Charlie Babcock
100%
0%
Charlie Babcock,
User Rank: Author
6/30/2014 | 8:54:24 PM
Life at Big Data's pinnacle is getting hazardous
Oracle launched version 1 in 1979, IBM's DB2 soon to follow, and relational database has reigned supreme for 30-32 years. Has life at the pinnacle for a data management system, such as Hadoop, shrunk to 10-12 years? I don't believe it. Still  you can see the timeline compression going on, with intense interest followed by thought leaders producing new systems in rapid succession. 
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/1/2014 | 9:14:39 AM
Re: Life at Big Data's pinnacle is getting hazardous
Don't read the emergence of Spark as a death sentence for Hadoop. Spark needs a data platform like Hadoop (or Cassandra or a durable cloud storage option like S3) to run on top of. What it might replace is the managerie of data-analysis and processing tools -- Hadoop MapReduce, Hive, Impala, Mahout, etc. -- that run on top of Hadoop. HDFS, with redundance, high availability, management, and security features, is what remains.
Li Tan
50%
50%
Li Tan,
User Rank: Ninja
7/1/2014 | 9:35:03 AM
Re: Life at Big Data's pinnacle is getting hazardous
I see Spark as a kind of enhancement to Hadoop at higher big data analysis level. It will not kill Hadoop but for sure some changes will happen. Something in Hadoop framework will get deprecated but the foundation will remain.
souravtri
50%
50%
souravtri,
User Rank: Apprentice
7/18/2014 | 7:49:29 AM
Yes, Spark et al is the way forward!
While Hadoop's HDFS is great with (virtuallly infinite) distributed storage, but Hadoop's MapReduce sucks in terms of processing performance and support for easy access to data.

Spark happens to be a great step forward to mitigate above issues with signifincantly improved performance, polygotism, great with SQL (with Shark). It would also be interesting to see hardware advancements (DRAM) which can retain much more data in memory.

My believe, HDFS would continue in usage for storage and incremental improvements  in processing layer (like Spark) would strenghten real-time , fast access to data and analytics.

 

 
BigDataMercs
50%
50%
BigDataMercs,
User Rank: Apprentice
8/23/2014 | 2:22:53 AM
Re: Great context
Be glad to. In short.... 

just my thoughts... 

It's the punchcard analogie to today's in memory, high availability expectations from the business world. They don't give a shit how "cute" it is under the hood... tactical answers... Can DoucheHoop produce? not really. The paradigm has shifted already... 

 

GodSpeed. 
LarsF931
50%
50%
LarsF931,
User Rank: Apprentice
11/25/2014 | 9:29:15 PM
Re: Great context
I think the issue with many dataflow libraries, Spark and Google dataflow included,  is lack of tooling and collaboration aspects.  Newcomers like dataflowanalytics.com are making a splash, allowing users to make performant dataflow apps quickly by leveraging other peoples components.
News
Think Like a Chief Innovation Officer and Get Work Done
Joao-Pierre S. Ruth, Senior Writer,  10/13/2020
Slideshows
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
News
Northwestern Mutual CIO: Riding Out the Pandemic
Jessica Davis, Senior Editor, Enterprise Apps,  10/7/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Slideshows
Flash Poll