Apache Spark momentum continued this week as IBM's SystemML machine learning engine for Spark won acceptance as a project by the open source Apache Incubator. Separately, Databricks, the company whose founders were the original architects of Spark at the University of California, Berkeley's AMPLab, started showing off a preview of Spark 1.6 in the company's software-as-a-service implementation of the platform.
Spark 1.6 is due for general release in mid-December.
The announcements from these two influential companies in the Spark community built on what's already been fast-growing support for the big data platform. According to a Spark users survey, and information released last month by Databricks about Spark momentum, Spark is the most active open source project in all of big data, with over 600 contributors in the last 12 months, up from 314 in the previous 12 to 24 months. What's driving the widespread support for this year-old platform?
Rob Thomas, IBM Analytics vice president of product, compared Spark's expected utility and ultimate impact to that of the Linux operating system.
"The fact that Spark had a single programming model and the ability to analyze all types of data from all sources of data positioned it to have the impact in the industry that something like Linux did at the turn of the century," Thomas told InformationWeek in an interview. "Linux is an operating system for systems and computers. Spark will be the operating system around analytics and how data will be accessed." That's how big and important IBM thinks Spark will become.
That's also why IBM this week was celebrating the acceptance of its SystemML into the Apache Incubator.
IBM first announced plans to donate SystemML to the Spark ecosystem in June as part of a big commitment the company made to the open source project. IBM developed SystemML to complement Spark's MLlib, a set of libraries or algorithms in Spark that can be used to analyze a set of data. But Thomas has said that Spark's machine learning is the weakest part of the platform. IBM built and then donated SystemML to improve that component of Spark.
"Our contribution of SystemML was about making Spark better," he said. "It made sense for us to go down that route. We've are a huge supporter of Spark to date."
IBM last month added Spark to its BlueMix cloud platform.
[Need more background on Apache Spark? Read Spark Promoter Databricks Should Let Software Shine.]
Separately this week, Databricks began a preview of the newest version of Spark, 1.6.0, that customers can look at on the company's SaaS Spark platform now. The general release of Spark 1.6.0 is expected in mid-December.
The new version improves two things that are hugely important to Spark users -- performance and optimization -- Databricks cofounder Patrick Wendell told InformationWeek in an interview.
"Performance is a major theme of this release," Wendell said. Previous versions offered users a series of recommendations and settings to use to get the best performance out of memory. Such fine tuning for high performance will no longer be necessary.
"For most users, if they upgrade to this release they will experience a huge performance increase," Wendell said. "Users don't need to tune this at all."
The new version also includes a new API called the Dataset API. This is an extension of Spark's DataFrame API that enables developers to write programs much more concisely with far less code, Wendell said.
Another new big feature is optimized state storage in Spark Streaming, Wendell said. That can translate to a "10x performance gain for many workloads," Wendell wrote in a blog post about the new release this week.
Databricks has further details of the new release in the blog post and has scheduled a December 1 webinar to go over the improvements and changes.
**New deadline of Dec. 18, 2015** Be a part of the prestigious InformationWeek Elite 100! Time is running out to submit your company's application by Dec. 18, 2015. Go to our 2016 registration page: InformationWeek's Elite 100 list for 2016.