Apache Spark is the hottest open-source project going in big data analytics. But will this very young, and some say still green, technology flower into a broadly used platform?
Databricks, the commercial company developing and promoting Spark, is not only counting on the success of the open source software, it's also rabidly promoting a commercial, cloud-based service, Databricks Cloud, that's based on the same technology. The question is whether Databricks' commercial ambitions might corrode good will and support for Apache Spark before it can really get off the ground.
Spark has so much promise it's hard to see anything getting in its way. Developed at U.C. Berkeley's AMPLab in 2009, and contributed to Apache in 2010, Spark is best known for in-memory machine learning through its MLlib component, but it also supports graph, SQL, and streaming analysis with GraphX, Spark SQL, and Spark Streaming, respectively. In the works is SparkR for statistical analysis using the popular R language.
[Want more on analytics? Read IoT Is About Analysis, Not Things.]
In its role as Spark developer and promoter, Databricks appears to be doing everything it can to help Spark integrate with leading data sources, including relational databases, NoSQL databases, and Hadoop components. It's also bringing tools and libraries to big-data developers and data scientists so they can use Spark with favorite languages, including Java, Python, R, or Scala.
Working with commercial and open source vendors, Databricks has certified 11 distributions of Spark software and 35 applications, such as BI and analytics tools, to work with the platform (more than doubling the number of partnerships from 2013 to 2014). In short, Databricks is aiming to support all major styles of big data analysis working with all the leading tools, platforms, and languages.
It's no wonder "everybody is talking about Spark, and [VC Firm] Andreessen Horowitz has invested into every company that lists Spark somewhere on its website," as one big data insider told me late last year.
So everything is looking rosy for Spark. At this week's Spark Summit East in New York, Databricks was able to report that more than 500 companies are now using the technology. Some, like Alibaba Group and Tencent in China, are reportedly using Spark on a massive scale. Tencent is running an 8,000-node Spark cluster while Alibaba is analyzing as much as 1 petabyte of data per week on the platform, according to Databricks.
This brings us to Databricks and its commercial promotion of Databricks Cloud. This is a still-in-preview hosted service based on Spark, but it's aimed at broader market. Where Spark is a platform for developers and data scientists working in heterogeneous, on-premises environments, Databricks Cloud is presented as a quick-to-deploy, easy-to-use option that will render what were described by Databricks as "hard-to-deploy" and "slow-to-pay-off" on-premises systems like Hadoop unnecessary.
In his Wednesday keynote at Spark Summit, Databricks CEO Ion Stoica ran through the laundry list of drawbacks and delays encountered by those building out big data infrastructure and trying to get to valuable insights on-premises. He contrasted this with Databricks Cloud, which goes beyond Spark to include data-analyst- and business-analyst-friendly tools that create a much more broadly usable data-analysis environment.
Databricks Cloud Notebooks and Dashboards, for example, make it easy to create, save, share, and collaborate around analyses and reports, and a new Jobs feature, introduced this week, turns Notebooks into repeatable analyses, or data pipelines, that can be scheduled, resource-managed, tracked, and reused.
Between avoiding all the messiness running of various distributed platforms and disparate tools and all the advantages of providing a "unified platform with one API for batch, streaming, and interactive queries," Stoica said Databricks Cloud "obviates the need to use multiple systems and engines."
Needless to say, Databricks' "You don't need anything but Databricks Cloud" message doesn't strengthen the alliances and partnerships it's also trying to foster around Spark. Hadoop vendors, in particular, are already threatened by Spark because its success would diminish their analytical role and prospects for components such as Cloudera Impala.
Databricks is clearly stepping on toes, and I've been contacted by multiple big data vendors in recent months volunteering negative perspectives on Spark performance and market readiness. I've been offered interviews with luminaries ready to tout rival open-source projects said to outperform Spark on streaming performance, for example.
The point here is certainly not that Databricks should reign in its ambition or avoid stepping on toes. If Spark is better, more practical, more versatile, and more valuable technology, it will win over the user community. My concern is that Databricks' rabid promotion of Databricks Cloud and all the leaps in usability it has packed into that offering may stunt the growth of Spark.
In Databricks' defense, it tells me that its commercial ambitions are limited to the cloud, which will naturally lead to a wide open space for various distributions of Spark software and Spark-compatible tools to run on-premises. But if it's really interested in doing everything it can to ensure Spark's success, why not share those nice Notebook, Dashboard, and Jobs features of Databricks Cloud with the rest of the community?
"Those features are tailored and tuned by Databricks in the cloud, and the cluster-configuration and custom-management that we do is not something that we open source," Databricks' Ali Ghodsi, head of engineering, told InformationWeek in a phone interview Wednesday. "That's all highly tailored for the Databricks Cloud environment because we want to make sure that people get the best possible experience they can get in the cloud."
We've all seen this before. The community edition has A, B, and C, but the commercial offering also has X, Y, and Z. It just seems like early days for Spark to be putting such an emphasis and an overtly commercial push behind a young beta offering that won't even become generally available until later this year. Let Spark truly shine and catch on before you rush to cash in.
Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio