Databricks Spark Plans: Big Data Q&A - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
01:36 PM
Connect Directly

Databricks Spark Plans: Big Data Q&A

Databricks has a bold vision -- based on Apache Spark -- to become big data's epicenter of analysis. Executives Ion Stoica and Arsalan Tavakoli discuss the details and how Google Cloud Dataflow compares.

If Databricks has its way, Apache Spark will become a pervasive choice for big data analysis, whether the work is done on the premises or in the cloud.

Spark was developed in UC Berkeley's AMPLab in 2009, and the project was committed to Apache open source in 2010. Databricks was set up as a commercial company with a mission to promote Spark and ensure a great user experience and support.

Ion Stoica, a UC Berkeley professor since 2000, took leave to serve as CEO of Databricks. In this interview with InformationWeek he's joined by Arsalan Tavakoli, a McKinsey & Co. and UC Berkeley PhD program veteran, who leads Databricks' business development.

[Want more on Spark's potential impact? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

What are Databricks' ambitions, and how does it compare to Google Cloud Dataflow? Read on for a look inside one of the hottest open source projects in the world of big data.

InformationWeek: So tell us about Databricks the company. What's the mission?

Ion Stoica: When we founded Databricks, the key goal was to drive the adoption of the Apache Spark ecosystem. We've done better than expected. Today Spark is part of every major Hadoop distribution: Cloudera, Hortonworks, IBM, MapR, and Pivotal. We're even happier about the fact that Spark is now packaged in non-Hadoop distributions, such as DataStax's distribution of Apache Cassandra. It's also available on Amazon Web Services (AWS), so you can use Spark to process data in S3 (AWS Simple Storage Service).

So we've been successful in driving adoption of Spark, but more importantly, we want every user to have a great experience. To achieve that, we've partnered with several companies that are shipping Spark, and we're working closely with them to make sure that Spark customers are satisfied. Those partners are Cloudera, DataStax, and MapR, and we'll add new partners soon.

To ensure a strong application ecosystem, in February we announced an application certification program for apps that run on Apache Spark. The response has been great, and we have already certified a dozen applications. In addition, we've introduced a certification for Spark distributions, and we have five companies in that program: DataStax, Hortonworks, IBM, MapR, and Pivotal.

IW: What's the difference between partners and certified distributors?

Arsalan Tavakoli: Cloudera, MapR, and DataStax all provide Level 1 enterprise support to customers around Spark. We provide Level 2 and Level 3 support, and we train their pre-sales and post-sales support staff. In the case of distributors, like Hortonworks and Pivotal, they package certified Spark software, but they don't provide an enterprise support option as yet. Hortonworks has certified Spark on YARN, and they've introduced a technical preview before going [general availability].

[Author's Note: On July 1, SAP announced a certification of Spark on SAP Hana with Databricks through which it will distribute Spark software, but it's not yet providing Spark support.]

IW: We hear Databricks doesn't want to be a first-level support provider for software. Is that correct?

Stoica: That's correct. We're not the first line of support. That's why we partner with other companies who can distribute Spark and provide the first line of support. In fact, we do not have our own software distribution. Our strategy is quite simple: We want to build value around Apache Spark. That makes sense because, the more Spark users there are, the more customers we're going to have for our products and services.

IW: So what's the problem that Spark solves?

Stoica: If you look around today, [you'll see] companies like Google, Facebook, and Amazon that have built huge businesses by mining their data. Almost every other company is collecting data with the goal of using it to improve revenue, reduce their cost, or optimize their business in some way. However, doing that is hard. Look no farther than the fact that companies like Google, Facebook, Microsoft, and so on spend billions of dollars every year to develop data-analysis tools, systems, and big-data products.

Depending on who you are in your organization, you're tasked with one of the following: build a Hadoop cluster and manage it if you are part of the IT organization; build a data pipeline on top of Hadoop if you are a software engineer or data scientist; or use the data pipeline to turn information into value by building big-data products if you are a software engineer, data scientist, or data analyst.

Every one of these tasks is hard. Clusters are hard to set up and maintain, and it might take six to nine months. To build a data pipeline you need to stitch together a hodgepodge of tools -- MapReduce, Hive, Impala, Drill, Mahout, Giraph... When you look at this entire data pipeline, it's very complex. It requires you to integrate disparate sets of clunky tools, and even after you do that, you still have to navigate data and, even harder, develop and maintain applications. So extracting value from the data remains a struggle.

IW: So that's Apache Spark, but now you're introducing Databricks Cloud. How do the two relate to each other?

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
Threaded  |  Newest First  |  Oldest First
D. Henschen
D. Henschen,
User Rank: Author
7/1/2014 | 5:04:02 PM
Big interest, big money at stake
The Spark vision is certainly captivating and getting lots of attention. I'm anxious to talk to users about ease of use, consistency across the framework and library-by-library capabilities. Can it be best of breed in each category of analysis? I have my doubts. Expect commercial vendors, in particular, to start poking holes and leveling criticisms. How broad, standards-based and up to data is the SQL? How many algos? Is R highly scalable? I have no idea what the answers are, but a lot of money is at stake, so expect to hear lots of questions and assertions design to sow seeds of fear, uncertainty and doubt.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll