Spark Promoter Databricks Should Let Software Shine - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
Commentary
3/19/2015
01:05 PM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
Google+
LinkedIn
Twitter
RSS
100%
0%

Spark Promoter Databricks Should Let Software Shine

Can Databricks effectively shepherd the most promising open source project for big data analytics while also promoting its commercial cloud service?

7 Linux Facts That Will Surprise You
7 Linux Facts That Will Surprise You
(Click image for larger view and slideshow.)

Apache Spark is the hottest open-source project going in big data analytics. But will this very young, and some say still green, technology flower into a broadly used platform?

Databricks, the commercial company developing and promoting Spark, is not only counting on the success of the open source software, it's also rabidly promoting a commercial, cloud-based service, Databricks Cloud, that's based on the same technology. The question is whether Databricks' commercial ambitions might corrode good will and support for Apache Spark before it can really get off the ground.

Spark has so much promise it's hard to see anything getting in its way. Developed at U.C. Berkeley's AMPLab in 2009, and contributed to Apache in 2010, Spark is best known for in-memory machine learning through its MLlib component, but it also supports graph, SQL, and streaming analysis with GraphX, Spark SQL, and Spark Streaming, respectively. In the works is SparkR for statistical analysis using the popular R language.

[Want more on analytics? Read IoT Is About Analysis, Not Things.]

In its role as Spark developer and promoter, Databricks appears to be doing everything it can to help Spark integrate with leading data sources, including relational databases, NoSQL databases, and Hadoop components. It's also bringing tools and libraries to big-data developers and data scientists so they can use Spark with favorite languages, including Java, Python, R, or Scala.

Working with commercial and open source vendors, Databricks has certified 11 distributions of Spark software and 35 applications, such as BI and analytics tools, to work with the platform (more than doubling the number of partnerships from 2013 to 2014). In short, Databricks is aiming to support all major styles of big data analysis working with all the leading tools, platforms, and languages.

It's no wonder "everybody is talking about Spark, and [VC Firm] Andreessen Horowitz has invested into every company that lists Spark somewhere on its website," as one big data insider told me late last year.

Databricks stewardship of Spark has led to certifications and partnerships with a host of big data vendors.

Databricks stewardship of Spark has led to certifications and partnerships with a host of big data vendors.

So everything is looking rosy for Spark. At this week's Spark Summit East in New York, Databricks was able to report that more than 500 companies are now using the technology. Some, like Alibaba Group and Tencent in China, are reportedly using Spark on a massive scale. Tencent is running an 8,000-node Spark cluster while Alibaba is analyzing as much as 1 petabyte of data per week on the platform, according to Databricks.

This brings us to Databricks and its commercial promotion of Databricks Cloud. This is a still-in-preview hosted service based on Spark, but it's aimed at broader market. Where Spark is a platform for developers and data scientists working in heterogeneous, on-premises environments, Databricks Cloud is presented as a quick-to-deploy, easy-to-use option that will render what were described by Databricks as "hard-to-deploy" and "slow-to-pay-off" on-premises systems like Hadoop unnecessary.

In his Wednesday keynote at Spark Summit, Databricks CEO Ion Stoica ran through the laundry list of drawbacks and delays encountered by those building out big data infrastructure and trying to get to valuable insights on-premises. He contrasted this with Databricks Cloud, which goes beyond Spark to include data-analyst- and business-analyst-friendly tools that create a much more broadly usable data-analysis environment.

Databricks Cloud Notebooks and Dashboards, for example, make it easy to create, save, share, and collaborate around analyses and reports, and a new Jobs feature, introduced this week, turns Notebooks into repeatable analyses, or data pipelines, that can be scheduled, resource-managed, tracked, and reused.

Between avoiding all the messiness running of various distributed platforms and disparate tools and all the advantages of providing a "unified platform with one API for batch, streaming, and interactive queries," Stoica said Databricks Cloud "obviates the need to use multiple systems and engines."

Databricks CEO Ion Stoica focused almost entirely on promoting Databricks Cloud, the vendor's commercial offering, during this week's Spark Summit East event in New York.

Databricks CEO Ion Stoica focused almost entirely on promoting Databricks Cloud, the vendor's commercial offering, during this week's Spark Summit East event in New York.

Needless to say, Databricks' "You don't need anything but Databricks Cloud" message doesn't strengthen the alliances and partnerships it's also trying to foster around Spark. Hadoop vendors, in particular, are already threatened by Spark because its success would diminish their analytical role and prospects for components such as Cloudera Impala.

Databricks is clearly stepping on toes, and I've been contacted by multiple big data vendors in recent months volunteering negative perspectives on Spark performance and market readiness. I've been offered interviews with luminaries ready to tout rival open-source projects said to outperform Spark on streaming performance, for example.

The point here is certainly not that Databricks should reign in its ambition or avoid stepping on toes. If Spark is better, more practical, more versatile, and more valuable technology, it will win over the user community. My concern is that Databricks' rabid promotion of Databricks Cloud and all the leaps in usability it has packed into that offering may stunt the growth of Spark.

In Databricks' defense, it tells me that its commercial ambitions are limited to the cloud, which will naturally lead to a wide open space for various distributions of Spark software and Spark-compatible tools to run on-premises. But if it's really interested in doing everything it can to ensure Spark's success, why not share those nice Notebook, Dashboard, and Jobs features of Databricks Cloud with the rest of the community?

"Those features are tailored and tuned by Databricks in the cloud, and the cluster-configuration and custom-management that we do is not something that we open source," Databricks' Ali Ghodsi, head of engineering, told InformationWeek in a phone interview Wednesday. "That's all highly tailored for the Databricks Cloud environment because we want to make sure that people get the best possible experience they can get in the cloud."

We've all seen this before. The community edition has A, B, and C, but the commercial offering also has X, Y, and Z. It just seems like early days for Spark to be putting such an emphasis and an overtly commercial push behind a young beta offering that won't even become generally available until later this year. Let Spark truly shine and catch on before you rush to cash in.

Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
3/19/2015 | 10:08:44 PM
Good luck on that, Databricks
That's pretty good advice to Databricks, Doug. More power to them if they can promote an open source based product into profitability at an early stage. But most successful open source companies -- Red Hat, Cloudera, Docker -- let open source code downloads and word of mouth to spread the word that will help get their product established. Once they're established as the technical expertise in the field, they can charge for training, technical support, testing and certification of compatibility. It's a slow way to get rich, but Red Hat now has a very broad customer regularly renewing subscriptions without being dunned. Some expenses, like marketing and sales, go away when the model works.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
3/19/2015 | 1:52:54 PM
More examples of Databricks Cloud Centrism
As noted, Databricks has cultivated many partnerships and has many certified distributions and apps, but the ones favored with front-and-center attention during the keynotes were Databricks Cloud Partners. ZoomData, for example, demonstrated a very cool BI BI suite that runs on top of Databricks Cloud. Partner PanTera showed off an equally cool mapping and data-visualization app that runs on Databricks Cloud. Spark partners are all well and good, but here's another example of Databricks putting all the spotlights on Databricks Cloud. 
Slideshows
What Digital Transformation Is (And Isn't)
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/4/2019
Commentary
Watch Out for New Barriers to Faster Software Development
Lisa Morgan, Freelance Writer,  12/3/2019
Commentary
If DevOps Is So Awesome, Why Is Your Initiative Failing?
Guest Commentary, Guest Commentary,  12/2/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll