Databricks, the company behind the hot Apache Spark project, announced new tools, new partners, and new funding to power a cloud service on Amazon Web Services.
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)
Apache Spark is already one of the most active open source projects in the big data world, but announcements made on Monday by Spark promoter and support firm Databricks could really heat things up.
The announcements, made at the sold-out Spark Summit 2014 in San Francisco, include the launch of the Databricks Cloud service on Amazon Web Services, set for general availability this fall, and the close of $33 million in new venture capital funding. This news comes on the heels of fresh partnership announcements with Hadoop distributor Hortonworks and Cassandra NoSQL database developer DataStax. And Databricks had already struck up partnerships with all other Hadoop distributors, including Cloudera, MapR, IBM, and Pivotal.
Spark is already deployable on AWS, but Databricks Cloud is a managed service based on Spark that will be supported directly by Databricks. Spark is best known for in-memory machine learning, but it also supports streaming analysis and SQL analysis, and work is underway on adding support for the popular R analytics library and graph analysis. All of these capabilities are exposed through a new Databricks Workplace component of Databricks Cloud, with Notebook, Dashboard, and Job-Launcher apps that the vendor said make it easy to build data-analysis pipelines from storage and ETL, to dashboarding and reporting, and on to advanced analytics and collaboration.
Databricks is not presenting Spark or Databricks Cloud as a replacement for Hadoop -- the platform needs to run on top of a data platform such as Hadoop, Cassandra, or S3. But it is saying that Spark can replace many of the familiar data-analysis components that run on top of Hadoop, including MapReduce, Pig, Hive, Impala, Drill, and more.
Databricks' depiction of Databricks Cloud replacing the many components of Hadoop used in today's big-data analyses. Apache Spark, the foundation of Databricks Cloud, runs on Hadoop, Cassandra, or Amazon Web Services S3.
"We believe that the vast majority of jobs that organizations do in existing data pipelines can be done on top of Spark," said Ion Stoica, Databricks CEO, in a phone interview with InformationWeek.
Databricks depicts a "Typical Data Pipeline" being replaced outright by Databricks Cloud, with ODBC connections making it possible to use conventional BI tools including MicroStrategy, QlikView, or Tableau Software (see images).
For now, Databricks Cloud runs exclusively on AWS S3, but Stoica said the company will also explore options to run on other clouds, including Google Compute Cloud and Microsoft Azure. If customers want or need to deploy on premises, Hadoop and Cassandra are both options to provide the basics of a data platform, including storage, high availability, redundancy, and so on. But with this week's announcements, Databricks is stepping up and saying that Spark can handle the vast majority of high-value analytical work, whether that's MapReduce, machine learning, graph processing, streaming analysis or R-based data mining, or the SQL role that Hadoop vendors have been squabbling about over the last year or more.
Can Spark really supplant such an array or data-analysis tools? That's the subject of our analysis, "Will Spark, Google Dataflow Steal Hadoop's Thunder?," which includes reactions from Hadoop vendors. Our first take is that Spark has a lot to prove in real-world production deployments before it can reshape big data analysis as we know it.
With Monday's announcement, Databricks Cloud enters limited beta release. Stoica says it will be ready on Amazon by this fall, starting at "a couple of hundred dollars" per user, per month.
InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business wonít wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.