Big Data // Big Data Analytics
News
6/30/2014
01:27 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Databricks Cloud: Next Step For Spark

Databricks, the company behind the hot Apache Spark project, announced new tools, new partners, and new funding to power a cloud service on Amazon Web Services.

Hadoop Jobs: 9 Ways To Get Hired
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)

Apache Spark is already one of the most active open source projects in the big data world, but announcements made on Monday by Spark promoter and support firm Databricks could really heat things up.

The announcements, made at the sold-out Spark Summit 2014 in San Francisco, include the launch of the Databricks Cloud service on Amazon Web Services, set for general availability this fall, and the close of $33 million in new venture capital funding. This news comes on the heels of fresh partnership announcements with Hadoop distributor Hortonworks and Cassandra NoSQL database developer DataStax. And Databricks had already struck up partnerships with all other Hadoop distributors, including Cloudera, MapR, IBM, and Pivotal.

[Want more on Databricks' plans? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

Spark is already deployable on AWS, but Databricks Cloud is a managed service based on Spark that will be supported directly by Databricks. Spark is best known for in-memory machine learning, but it also supports streaming analysis and SQL analysis, and work is underway on adding support for the popular R analytics library and graph analysis. All of these capabilities are exposed through a new Databricks Workplace component of Databricks Cloud, with Notebook, Dashboard, and Job-Launcher apps that the vendor said make it easy to build data-analysis pipelines from storage and ETL, to dashboarding and reporting, and on to advanced analytics and collaboration.

Databricks is not presenting Spark or Databricks Cloud as a replacement for Hadoop -- the platform needs to run on top of a data platform such as Hadoop, Cassandra, or S3. But it is saying that Spark can replace many of the familiar data-analysis components that run on top of Hadoop, including MapReduce, Pig, Hive, Impala, Drill, and more.

Databricks' depiction of Databricks Cloud replacing the many components of Hadoop used in today's big-data analyses. Apache Spark, the foundation of Databricks Cloud, runs on Hadoop, Cassandra, or Amazon Web Services S3.
Databricks' depiction of Databricks Cloud replacing the many components of Hadoop used in today's big-data analyses. Apache Spark, the foundation of Databricks Cloud, runs on Hadoop, Cassandra, or Amazon Web Services S3.

"We believe that the vast majority of jobs that organizations do in existing data pipelines can be done on top of Spark," said Ion Stoica, Databricks CEO, in a phone interview with InformationWeek.

Databricks depicts a "Typical Data Pipeline" being replaced outright by Databricks Cloud, with ODBC connections making it possible to use conventional BI tools including MicroStrategy, QlikView, or Tableau Software (see images).

For now, Databricks Cloud runs exclusively on AWS S3, but Stoica said the company will also explore options to run on other clouds, including Google Compute Cloud and Microsoft Azure. If customers want or need to deploy on premises, Hadoop and Cassandra are both options to provide the basics of a data platform, including storage, high availability, redundancy, and so on. But with this week's announcements, Databricks is stepping up and saying that Spark can handle the vast majority of high-value analytical work, whether that's MapReduce, machine learning, graph processing, streaming analysis or R-based data mining, or the SQL role that Hadoop vendors have been squabbling about over the last year or more.

Can Spark really supplant such an array or data-analysis tools? That's the subject of our analysis, "Will Spark, Google Dataflow Steal Hadoop's Thunder?," which includes reactions from Hadoop vendors. Our first take is that Spark has a lot to prove in real-world production deployments before it can reshape big data analysis as we know it.

With Monday's announcement, Databricks Cloud enters limited beta release. Stoica says it will be ready on Amazon by this fall, starting at "a couple of hundred dollars" per user, per month.

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/1/2014 | 11:58:43 AM
Re: Black Widow of big data technology?
Again, it's not a case of Spark vs. Hadoop. Spark runs on Hadoop, but it uses only its most basic storage functionality. My "bite their heads off" comment hints at the value in all the data-analysis stuff on top of Hadoop. Cloudera, for example, has Impala, a partically commercial offering, that Spark could replace. The data-analysis done on top of Hadoop is the valuable high ground -- read, profit-driver for a Hadoop distributor and support provider.
Li Tan
50%
50%
Li Tan,
User Rank: Ninja
7/1/2014 | 9:42:59 AM
Re: Black Widow of big data technology?
At the current stage I will keep my finger crossed and see what will happen next with Spark. From the information available so far, Spark is trying to improve on big data analytics by providing a neat and straightforward data pipeline. But how it performs compared to Hadoop? Is there any live case available so far?
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
6/30/2014 | 3:27:16 PM
Black Widow of big data technology?
Databricks has partnered with every single Hadoop distributor, all of which have sung Spark's praises. But they also tend to describe Spark as something just for in-memory machine learning. I'm sure they're not thrilled to hear Databricks suggesting that Spark can replace the likes of Hive, Impala, and quite a few other analysis options that run on top of Hadoop. Is it a case of biting the head off your mate after consumating a partnership?
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 9, 2014.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.