Hortonworks announced Thursday that Apache Spark, a technology quickly gaining interest for in-memory-accelerated machine learning and other analyses on high-scale data, has been certified to run on Apache YARN, the resource- management layer introduced last year with Apache Hadoop 2.0.
With this milestone, Spark is ready to run as a technology preview on the Hortonworks Data Platform (HDP), which is Hortonworks' Hadoop software distribution. A production-certified release is expected by this fall.
This is not the first appearance of Spark on Hadoop. In February, Cloudera introduced support for Spark using its commercial Cloudera Manager software to deploy, manage, and monitor the software. MapR introduced its own Spark deployment in April. Hortonworks stressed that its approach is 100% open source, using YARN (yet Another Resource Negotiator) to manage and monitor Spark components and workloads alongside other systems and analyses running on Hadoop.
[Want more on Apache Spark? Read MapR Brings Spark In-Memory Analysis To Hadoop.]
"Spark is now natively integrated into Hadoop, so its resources -- CPU, memory, and so on -- can be managed along with the other workloads running on a Hadoop cluster," explained Shaun Connolly, Hortonworks' VP corporate strategy, in an interview with InformationWeek. "That's important to get right because Spark is memory- and CPU-intensive, and you don't want to have to have siloed clusters dedicated to running those workloads."
The whole point of Hadoop 2.0 and YARN is to be able to run multiple workloads -- including Accumulo, Hive, MapReduce, Pig, Storm, Solr, and now, Spark -- against the same data sets, Connolly added.
Asked for comment on Hortonworks' announcement, Cloudera sent InformationWeek the following statement:
"Cloudera developers were the key drivers on YARN support for Spark, leveraging our expertise in YARN as well our developer group on Spark. Cloudera Manager is not orthogonal to YARN support and in fact, Cloudera Manager supports Spark on YARN. Additionally, almost all our customer deployments of Spark today are on top of the YARN framework and we have many customers who are running Spark through us."
Concurrent with Hortonwork's announcement, Spark developer and support provider Databricks announced that Hortonworks is an inaugural member of its Certified Spark Distribution program.
"We're committed to ensuring all Spark users have a terrific experience -- and we're thrilled that Hortonworks shares this vision," said Databricks business development executive Arsalan Tavakoli-Shiraji in a statement. "With the designation of Apache Spark as YARN Ready, enterprises can rest assured that Spark can run simultaneously and effectively with other mission-critical applications."
Customers are now free to download and install the HDP 2.1 Tech Preview Component of Apache Spark on the current HDP 2.0 distribution. Hortonworks expects the HDP 2.1 release, which will include Spark, to be certified for production use "within a handful of months," said Connolly. Hortonworks will support Spark along with the other software included in the distribution.
InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.