As the Internet of Things starts producing streams of data, there will need to be a processor capable of handling them. Cloudera chief technologist Eli Collins says open source Spark will be that engine, and that means it's destined to become the default data processor inside of Hadoop.
That belief is behind Cloudera's launch of the One Platform Initiative on Wednesday, Sept. 9. Spark is currently the most active project inside the Apache Software Foundation.
Spark may be considered the rightful successor to MapReduce, which was born 15 years ago inside Google as part of the operations behind the world's leading search engine. Indeed, Collins said, there's now 50% more developer activity behind the Spark project than there is backing Hadoop itself.
In Collins's and Cloudera's view, that means Spark is eventually destined to replace MapReduce. It's not so much that MapReduce is deficient. On the contrary, many Spark algorithms implement the same ideas as MapReduce. But it's time for a more up-to-date design for data distribution on a cluster, Collins said in an interview.
"Spark builds on research and work done for MapReduce. It's a successor," Collins said on the eve of Cloudera's announcement of a One Platform Initiative.
[Want to learn more about what's behind Hadoop? See Big Data Moves Toward Real Time Analysis.]
The One Platform Initiative has a lot of work to do before Spark becomes that replacement. Hadoop was born as an extension of MapReduce in 2005 when Doug Cutting at Yahoo built a distributed file system (HDFS) to work with it -- creating Hadoop in the process.
Collins said some of the needed work on Spark has been underway for the last two years. "We thought 18 months ago that Spark was ready to be put in as the engine in place of MapReduce," and Cloudera committed developers to see that additional, enterprise-oriented work got done. Cloudera employs five of the committers on the Spark Project, or five times more than any of its competitors, Collins noted.
Cloudera has contributed over 370 patches and 43,000 lines of code to the project. It's worked closely with Intel, which chose Cloudera as a partner to further Spark development. From this work, Cloudera has gained insight into the challenges of running Spark in production environments, and it has observed how analytics teams want to use Hadoop, Collins said.
Nevertheless, getting Spark into Hadoop as a replacement engine is still a community effort. There are over 200 developers involved in Spark's ongoing development.
Between Cloudera, Hortonworks, and MapR, there are at least 2,000 companies making use of the current Hadoop based on MapReduce. For a version based on Spark to replace it, Spark will need to do more than just match the scale and performance of MapReduce jobs running today. Those jobs can involve hundreds of terabytes of data daily.
"Spark is well on its way to replace MapReduce to enable jobs with hundreds of executors each, running simultaneously on large multi-tenant clusters ... but there is still some heavy lifting to do," noted Mike Olson, Cloudera chief strategy officer, in the announcement.
Collins said Spark will need to be able to exceed MapReduce's capabilities. Spark is becoming a superset of MapReduce, able to provide all its functions and then some. Spark "can be an order of magnitude faster," said Collins. Its APIs "are a lot nicer for writing a data pipeline" to work with Hadoop, and you can create data applications in a number of programming languages. MapReduce prefers that you write those programs in Java.
The One Platform Initiative is focused on getting Spark ready to work with Hadoop in four key areas: security, scale, management, and streaming. The Internet of Things will generate streams of data. Spark already has a data streaming capability. Now it needs that capability to be integrated with Hadoop operations. In addition, Spark comes with an existing library of machine learning algorithms, a match for its ability to absorb and use data streamed off the Internet of Things.
At the same time, Spark needs to be integrated with other components of the Apache software library, including open source search Solr; open source Pig, a language for building Hadoop applications; HBase, a Java-based NoSQL system that runs on top of HDFS; and Hive, a data warehouse that works with Hadoop.
Collins noted that regulated industries need more security, access control, encryption for data both in motion and at rest, and auditability in a Spark-based Hadoop before they can use it. One goal of the One Initiative Project will be to place it in those industries.
Cloudera plans to incorporate an updated version of Spark as the core engine of its open source version of Hadoop, Cloudera CDH. That is expected to happen sometime in 2016 as it moves CDH into version 6.0, Collin said. It's currently on release 5.4.5.