5 min read

Databricks Spark Plans: Big Data Q&A

Databricks has a bold vision -- based on Apache Spark -- to become big data's epicenter of analysis. Executives Ion Stoica and Arsalan Tavakoli discuss the details and how Google Cloud Dataflow compares.

Stoica: Our goal with Databricks Cloud is to dramatically simplify data analysis and processing. In particular, we alleviate the need to set up and manage a cluster. For that, we provide a hosted solution, which makes it very easy to instantiate and manage clusters. To obviate the need to deal with a zoo of tools, we are leveraging Apache Spark, which integrates the functionality of many of the leading big data tools and systems.

To simplify data analysis, we are introducing Databricks Workplace, which includes three components: Notebooks, Dashboards, and Job Launcher. Notebooks support interactive query processing and visualization, as well as collaboration, so multiple users can do joint data exploration. Once you create one or more Notebooks, you can take the most interesting results and create and publish dashboards. Finally, the Job Launcher lets you run arbitrary Spark jobs, either periodically, based on triggers, or in production on a regular schedule. Databricks Cloud can also take inputs from other storage systems, and you can use your favorite BI tools through an ODBC connector.

[Want more on Databricks' product? Read Databricks Cloud: Next Step For Spark.]

IW: What's the timeline for general availability, and can you say anything about pricing?

Tavakoli: At this point, it's limited availability, but we'll be ramping up capacity, and we expect GA to be in the fall. As for pricing, we aren't really talking about it yet, but it will be in a tiered model to give people predictability based on usage capacity. It will start at a couple of hundred dollars per user, per month.


Databricks' depiction of Databricks Cloud replacing the many components of Hadoop used in today's big-data analyses.

IW: There are many data-processing engines and frameworks out there, but it sounds like you've tried to cover the key bases with Spark.

Stoica: We do believe that the vast majority of data-analysis jobs can be done on top of Spark. We have Spark Streaming for streaming analysis. Spark SQL is a new component for SQL where before we had Shark. We have powerful libraries for machine learning with MLLib and for Graph processing GraphX. We are also announcing Spark R to bind to the R language. This is the strength of our platform.

IW: Have you considered running on a standalone, generic clustered server platform?

Tavakoli: Spark acts as a processing and computation engine, but we have deferred the data management. We can run on multiple storage systems, including HDFS, Cassandra with support from DataStax, or Amazon S3. With Databricks Cloud, we use S3 because that's more prevalent in the cloud, but we're just as happy to work with a customer who has their data in HDFS. Hadoop handles the data management, but Spark provides the processing engine. In the on-premises world, most organizations are doing [big] data management in Hadoop clusters. We will continue to invest to make sure that Spark plays well with YARN and Hadoop clusters.

IW: How usable are Spark and Databricks Cloud to someone with SQL experience?

Tavakoli: We support SQL, Python, Java, Scala, and, very soon, R as input languages, and from there it's a single system. So think of Spark capabilities not as separate tools, but as libraries that can be called from within one system.

IW: But you still need MapReduce expertise, machine-learning expertise, SQL expertise, and so on, right?

Stoica: We think it's a better and more consistent experience. Instead of having separate schedulers, task management, and so on for each engine, in Spark the level of abstraction is much higher. It's a single execution engine, and all the libraries can share the same data. There's no data transformation required, and furthermore, the data is all in memory. The analogy might be the Microsoft tools in Office. You cut and paste from Excel, and you can bring it into PowerPoint. That's the level of integration within Spark.

IW: How would you compare Databricks Cloud and Google Cloud Dataflow? Are you on parallel tracks?

Stoica: Absolutely, but there are a few differences. One difference is that our API is open, because it's built on Apache Spark. You can build your application on top of Databricks Cloud, and then you can take your application and you can run on premises on Cloudera's Hadoop distribution.

Second, Google Cloud Dataflow is more targeted toward developers. Through our Workspace and applications like Notebooks and Dashboards, we're trying to provide a much higher level of abstraction and make it easier to use for data scientists and data analysts. A last point is that I think we provide a more complete, end-to-end data pipeline than what Google Cloud Dataflow provides because of all the analysis components we offer.

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more (registration required).