Databricks Spark Plans: Big Data Q&A - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
News
7/1/2014
01:36 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Databricks Spark Plans: Big Data Q&A

Databricks has a bold vision -- based on Apache Spark -- to become big data's epicenter of analysis. Executives Ion Stoica and Arsalan Tavakoli discuss the details and how Google Cloud Dataflow compares.

Stoica: Our goal with Databricks Cloud is to dramatically simplify data analysis and processing. In particular, we alleviate the need to set up and manage a cluster. For that, we provide a hosted solution, which makes it very easy to instantiate and manage clusters. To obviate the need to deal with a zoo of tools, we are leveraging Apache Spark, which integrates the functionality of many of the leading big data tools and systems.

To simplify data analysis, we are introducing Databricks Workplace, which includes three components: Notebooks, Dashboards, and Job Launcher. Notebooks support interactive query processing and visualization, as well as collaboration, so multiple users can do joint data exploration. Once you create one or more Notebooks, you can take the most interesting results and create and publish dashboards. Finally, the Job Launcher lets you run arbitrary Spark jobs, either periodically, based on triggers, or in production on a regular schedule. Databricks Cloud can also take inputs from other storage systems, and you can use your favorite BI tools through an ODBC connector.

[Want more on Databricks' product? Read Databricks Cloud: Next Step For Spark.]

IW: What's the timeline for general availability, and can you say anything about pricing?

Tavakoli: At this point, it's limited availability, but we'll be ramping up capacity, and we expect GA to be in the fall. As for pricing, we aren't really talking about it yet, but it will be in a tiered model to give people predictability based on usage capacity. It will start at a couple of hundred dollars per user, per month.

Databricks' depiction of Databricks Cloud replacing the many components of Hadoop used in today's big-data analyses.
Databricks' depiction of Databricks Cloud replacing the many components of Hadoop used in today's big-data analyses.

IW: There are many data-processing engines and frameworks out there, but it sounds like you've tried to cover the key bases with Spark.

Stoica: We do believe that the vast majority of data-analysis jobs can be done on top of Spark. We have Spark Streaming for streaming analysis. Spark SQL is a new component for SQL where before we had Shark. We have powerful libraries for machine learning with MLLib and for Graph processing GraphX. We are also announcing Spark R to bind to the R language. This is the strength of our platform.

IW: Have you considered running on a standalone, generic clustered server platform?

Tavakoli: Spark acts as a processing and computation engine, but we have deferred the data management. We can run on multiple storage systems, including HDFS, Cassandra with support from DataStax, or Amazon S3. With Databricks Cloud, we use S3 because that's more prevalent in the cloud, but we're just as happy to work with a customer who has their data in HDFS. Hadoop handles the data management, but Spark provides the processing engine. In the on-premises world, most organizations are doing [big] data management in Hadoop clusters. We will continue to invest to make sure that Spark plays well with YARN and Hadoop clusters.

IW: How usable are Spark and Databricks Cloud to someone with SQL experience?

Tavakoli: We support SQL, Python, Java, Scala, and, very soon, R as input languages, and from there it's a single system. So think of Spark capabilities not as separate tools, but as libraries that can be called from within one system.

IW: But you still need MapReduce expertise, machine-learning expertise, SQL expertise, and so on, right?

Stoica: We think it's a better and more consistent experience. Instead of having separate schedulers, task management, and so on for each engine, in Spark the level of abstraction is much higher. It's a single execution engine, and all the libraries can share the same data. There's no data transformation required, and furthermore, the data is all in memory. The analogy might be the Microsoft tools in Office. You cut and paste from Excel, and you can bring it into PowerPoint. That's the level of integration within Spark.

IW: How would you compare Databricks Cloud and Google Cloud Dataflow? Are you on parallel tracks?

Stoica: Absolutely, but there are a few differences. One difference is that our API is open, because it's built on Apache Spark. You can build your application on top of Databricks Cloud, and then you can take your application and you can run on premises on Cloudera's Hadoop distribution.

Second, Google Cloud Dataflow is more targeted toward developers. Through our Workspace and applications like Notebooks and Dashboards, we're trying to provide a much higher level of abstraction and make it easier to use for data scientists and data analysts. A last point is that I think we provide a more complete, end-to-end data pipeline than what Google Cloud Dataflow provides because of all the analysis components we offer.

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more (registration required).

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
0%
100%
D. Henschen,
User Rank: Author
7/1/2014 | 5:04:02 PM
Big interest, big money at stake
The Spark vision is certainly captivating and getting lots of attention. I'm anxious to talk to users about ease of use, consistency across the framework and library-by-library capabilities. Can it be best of breed in each category of analysis? I have my doubts. Expect commercial vendors, in particular, to start poking holes and leveling criticisms. How broad, standards-based and up to data is the SQL? How many algos? Is R highly scalable? I have no idea what the answers are, but a lot of money is at stake, so expect to hear lots of questions and assertions design to sow seeds of fear, uncertainty and doubt.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

News
Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
News
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
Slideshows
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll