Databricks has a bold vision -- based on Apache Spark -- to become big data's epicenter of analysis. Executives Ion Stoica and Arsalan Tavakoli discuss the details and how Google Cloud Dataflow compares.
If Databricks has its way, Apache Spark will become a pervasive choice for big data analysis, whether the work is done on the premises or in the cloud.
Spark was developed in UC Berkeley's AMPLab in 2009, and the project was committed to Apache open source in 2010. Databricks was set up as a commercial company with a mission to promote Spark and ensure a great user experience and support.
Ion Stoica, a UC Berkeley professor since 2000, took leave to serve as CEO of Databricks. In this interview with InformationWeek he's joined by Arsalan Tavakoli, a McKinsey & Co. and UC Berkeley PhD program veteran, who leads Databricks' business development.
What are Databricks' ambitions, and how does it compare to Google Cloud Dataflow? Read on for a look inside one of the hottest open source projects in the world of big data.
InformationWeek: So tell us about Databricks the company. What's the mission?
Ion Stoica: When we founded Databricks, the key goal was to drive the adoption of the Apache Spark ecosystem. We've done better than expected. Today Spark is part of every major Hadoop distribution: Cloudera, Hortonworks, IBM, MapR, and Pivotal. We're even happier about the fact that Spark is now packaged in non-Hadoop distributions, such as DataStax's distribution of Apache Cassandra. It's also available on Amazon Web Services (AWS), so you can use Spark to process data in S3 (AWS Simple Storage Service).
So we've been successful in driving adoption of Spark, but more importantly, we want every user to have a great experience. To achieve that, we've partnered with several companies that are shipping Spark, and we're working closely with them to make sure that Spark customers are satisfied. Those partners are Cloudera, DataStax, and MapR, and we'll add new partners soon.
To ensure a strong application ecosystem, in February we announced an application certification program for apps that run on Apache Spark. The response has been great, and we have already certified a dozen applications. In addition, we've introduced a certification for Spark distributions, and we have five companies in that program: DataStax, Hortonworks, IBM, MapR, and Pivotal.
IW: What's the difference between partners and certified distributors?
Arsalan Tavakoli: Cloudera, MapR, and DataStax all provide Level 1 enterprise support to customers around Spark. We provide Level 2 and Level 3 support, and we train their pre-sales and post-sales support staff. In the case of distributors, like Hortonworks and Pivotal, they package certified Spark software, but they don't provide an enterprise support option as yet. Hortonworks has certified Spark on YARN, and they've introduced a technical preview before going [general availability].
[Author's Note: On July 1, SAP announced a certification of Spark on SAP Hana with Databricks through which it will distribute Spark software, but it's not yet providing Spark support.]
IW: We hear Databricks doesn't want to be a first-level support provider for software. Is that correct?
Stoica: That's correct. We're not the first line of support. That's why we partner with other companies who can distribute Spark and provide the first line of support. In fact, we do not have our own software distribution. Our strategy is quite simple: We want to build value around Apache Spark. That makes sense because, the more Spark users there are, the more customers we're going to have for our products and services.
IW: So what's the problem that Spark solves?
Stoica: If you look around today, [you'll see] companies like Google, Facebook, and Amazon that have built huge businesses by mining their data. Almost every other company is collecting data with the goal of using it to improve revenue, reduce their cost, or optimize their business in some way. However, doing that is hard. Look no farther than the fact that companies like Google, Facebook, Microsoft, and so on spend billions of dollars every year to develop data-analysis tools, systems, and big-data products.
Depending on who you are in your organization, you're tasked with one of the following: build a Hadoop cluster and manage it if you are part of the IT organization; build a data pipeline on top of Hadoop if you are a software engineer or data scientist; or use the data pipeline to turn information into value by building big-data products if you are a software engineer, data scientist, or data analyst.
Every one of these tasks is hard. Clusters are hard to set up and maintain, and it might take six to nine months. To build a data pipeline you need to stitch together a hodgepodge of tools -- MapReduce, Hive, Impala, Drill, Mahout, Giraph... When you look at this entire data pipeline, it's very complex. It requires you to integrate disparate sets of clunky tools, and even after you do that, you still have to navigate data and, even harder, develop and maintain applications. So extracting value from the data remains a struggle.
IW: So that's Apache Spark, but now you're introducing Databricks Cloud. How do the two relate to each other?
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business wonít wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.