Cloudera Sees Spark Emerging As Hadoop Engine - InformationWeek
IoT
IoT
Data Management // Software Platforms
News
9/9/2015
09:20 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%
RELATED EVENTS
Threat Intelligence Overload?
Aug 23, 2017
A wide range of threat intelligence feeds and services have cropped up keep IT organizations up to ...Read More>>

Cloudera Sees Spark Emerging As Hadoop Engine

Cloudera launches the One Platform Initiative to advance Spark as the data processing successor to MapReduce inside Hadoop.

6 Ways To Ask Smarter Questions Of Big Data
6 Ways To Ask Smarter Questions Of Big Data
(Click image for larger view and slideshow.)

As the Internet of Things starts producing streams of data, there will need to be a processor capable of handling them. Cloudera chief technologist Eli Collins says open source Spark will be that engine, and that means it's destined to become the default data processor inside of Hadoop.

That belief is behind Cloudera's launch of the One Platform Initiative on Wednesday, Sept. 9. Spark is currently the most active project inside the Apache Software Foundation.

Spark may be considered the rightful successor to MapReduce, which was born 15 years ago inside Google as part of the operations behind the world's leading search engine. Indeed, Collins said, there's now 50% more developer activity behind the Spark project than there is backing Hadoop itself.

In Collins's and Cloudera's view, that means Spark is eventually destined to replace MapReduce. It's not so much that MapReduce is deficient. On the contrary, many Spark algorithms implement the same ideas as MapReduce. But it's time for a more up-to-date design for data distribution on a cluster, Collins said in an interview.

"Spark builds on research and work done for MapReduce. It's a successor," Collins said on the eve of Cloudera's announcement of a One Platform Initiative.

[Want to learn more about what's behind Hadoop? See Big Data Moves Toward Real Time Analysis.]

The One Platform Initiative has a lot of work to do before Spark becomes that replacement. Hadoop was born as an extension of MapReduce in 2005 when Doug Cutting at Yahoo built a distributed file system (HDFS) to work with it -- creating Hadoop in the process.

Collins said some of the needed work on Spark has been underway for the last two years. "We thought 18 months ago that Spark was ready to be put in as the engine in place of MapReduce," and Cloudera committed developers to see that additional, enterprise-oriented work got done. Cloudera employs five of the committers on the Spark Project, or five times more than any of its competitors, Collins noted.

(Image: Mikhail Tolstoy/iStockphoto)

(Image: Mikhail Tolstoy/iStockphoto)

Cloudera has contributed over 370 patches and 43,000 lines of code to the project. It's worked closely with Intel, which chose Cloudera as a partner to further Spark development. From this work, Cloudera has gained insight into the challenges of running Spark in production environments, and it has observed how analytics teams want to use Hadoop, Collins said.

Nevertheless, getting Spark into Hadoop as a replacement engine is still a community effort. There are over 200 developers involved in Spark's ongoing development.

Between Cloudera, Hortonworks, and MapR, there are at least 2,000 companies making use of the current Hadoop based on MapReduce. For a version based on Spark to replace it, Spark will need to do more than just match the scale and performance of MapReduce jobs running today. Those jobs can involve hundreds of terabytes of data daily.

"Spark is well on its way to replace MapReduce to enable jobs with hundreds of executors each, running simultaneously on large multi-tenant clusters ... but there is still some heavy lifting to do," noted Mike Olson, Cloudera chief strategy officer, in the announcement.

Collins said Spark will need to be able to exceed MapReduce's capabilities. Spark is becoming a superset of MapReduce, able to provide all its functions and then some. Spark "can be an order of magnitude faster," said Collins. Its APIs "are a lot nicer for writing a data pipeline" to work with Hadoop, and you can create data applications in a number of programming languages. MapReduce prefers that you write those programs in Java.

The One Platform Initiative is focused on getting Spark ready to work with Hadoop in four key areas: security, scale, management, and streaming. The Internet of Things will generate streams of data. Spark already has a data streaming capability. Now it needs that capability to be integrated with Hadoop operations. In addition, Spark comes with an existing library of machine learning algorithms, a match for its ability to absorb and use data streamed off the Internet of Things.

At the same time, Spark needs to be integrated with other components of the Apache software library, including open source search Solr; open source Pig, a language for building Hadoop applications; HBase, a Java-based NoSQL system that runs on top of HDFS; and Hive, a data warehouse that works with Hadoop.

Collins noted that regulated industries need more security, access control, encryption for data both in motion and at rest, and auditability in a Spark-based Hadoop before they can use it. One goal of the One Initiative Project will be to place it in those industries.

Cloudera plans to incorporate an updated version of Spark as the core engine of its open source version of Hadoop, Cloudera CDH. That is expected to happen sometime in 2016 as it moves CDH into version 6.0, Collin said. It's currently on release 5.4.5.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
9/9/2015 | 1:34:24 PM
Apache Spark designed to process stream, do real time tasks
The HDFS file system spread data objects out across a cluster's distributed memory. MapReduce spread processing of the data out over a cluster, based on what data was located close to which processor. Processing proceeding in parallel jobs crunched through huge amounts of data in record time. But it was designed as a batch process, and the newer generation of applications have real time tasks. That's where Spark will come in.
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
IT Strategies to Conquer the Cloud
Chances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.
Video
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll