DataStax Brings Spark To Cassandra - InformationWeek
Data Management // Big Data Analytics
01:08 PM
Connect Directly

DataStax Brings Spark To Cassandra

DataStax promises real-time analysis for big-data apps on Cassandra through an integration with Databricks' Spark in-memory framework.

10 Hadoop Hardware Leaders
10 Hadoop Hardware Leaders
(Click image for larger view and slideshow.)

Apache Spark has become the darling of the big data world, with vendors seemingly lined up around the block to add the in-memory analysis framework to their platforms. DataStax became the latest company to join that club last week when it announced plans to integrate Spark and the Apache Cassandra NoSQL database management system (DBMS) that it develops and supports.

DataStax has partnered with Databricks, the company founded by the creators of Apache Spark, to build a supported, open source integration between the two platforms. The partners expect to have the integration ready by this summer. The benefit will be in-memory analysis for applications such as online recommendations and personalization, fraud detection, event messaging, and Internet of Things/sensor-data detection in manufacturing and IT settings, according to Martin Van Ryswyk, executive VP, engineering, at DataStax.

[Want more on Spark? Read MapR Brings Spark In-Memory Analysis To Hadoop.]

"Analytics on real-time data is important because people want to look at what the customer is looking at or buying right now, do a quick analysis against historical or location data, and offer them something different," says Van Ryswyk in a phone interview with InformationWeek. "This also happens in fraud scenarios when you need to stop a transaction right now, not two hours from now once you've seen all that data in batch mode on Hadoop or a data warehouse."

Databricks has partnerships for Spark integration with both Cloudera and MapR in the Hadoop world and with analytics vendor Alpine Data Labs. The link with Cassandra brings Spark into online transactional environments. Cassandra user and DataStax customer Ooyala, a video analytics platform company, built an integration between Cassandra and Spark on its own. According to Kelvin Chu, compute and data team lead at Ooyala, it's a powerful combination.

DataStax says it will integrate Cassandra with the Spark Core Engine, so it will take advantage of all types of analysis on the framework.
DataStax says it will integrate Cassandra with the Spark Core Engine, so it will take advantage of all types of analysis on the framework.

"With Cassandra as the data store and Spark for data crunching, these new analytic capabilities are making the processing of large data volumes a breeze," said Chu in a statement. "Spark on Cassandra is giving us the power to act on things in real time, which means faster decisions and faster results."

Not everyone agrees Cassandra and Spark are up to "real-time" standards. In-memory data grid vendor ScaleOut Software, for one, asserts that Spark doesn't handle real-time state changes and that Cassandra's eventual consistency approach will limit the performance potential of the combination.

"Spark does not handle real-time state changes to individual data items in a resilient distributed data set; it can only stream data and change the whole data set," says Bill Bain, CEO of ScaleOut, in a phone interview with InformationWeek. "HDFS has a similar limitation because you can't update data in HDFS; all you can do is append to HDFS files." [Editor's note: this quote was revised at the request of Bill Bain. The original quote addressed shortcomings of Cassandra rather than HDFS.]

ScaleOut says in-memory data grids have been used for years to support airline reservation systems, e-commerce shopping carts, and financial-trading applications, and will make the difference between real-time and near-real-time performance.

For companies already deployed on Cassandra, the difference between real-time and near-real-time isn't going to diminish the value of Spark integration. But it's something to consider for those hoping to get sub-second performance out of a big data platform.

You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
D. Henschen
D. Henschen,
User Rank: Author
5/13/2014 | 10:42:07 AM
Re: A plus-plus combination of open source code projects
ScaleOut would disagree with your "online" characterization of Spark. They almost make it sound positively batchy in terms of processing data sets, not individual data points like an ACID-compliance and CRUD-capable database or other OLTP system.
Charlie Babcock
Charlie Babcock,
User Rank: Author
5/12/2014 | 8:18:21 PM
A plus-plus combination of open source code projects
Spark is not just an in-memory system but an enhancement to a distributed data management system like Cassandra. It's already changed Hadoop from a batch system into an online-processing system on a well-managed, distributed cluster. This is an interesting combination of open source code projects.

D. Henschen
D. Henschen,
User Rank: Author
5/12/2014 | 2:28:44 PM
On CQL and cluster deployment
Forgot to note that Van Ryswyk said the integration will give you the option of running separate Cassandra and Spark clusters or running both on the same cluster -- though you'd have to be careful that the latter approach would not impact transactional performance (either by beefing up the hardware or being careful about analysis loads). Van Ryswyk also insisted that Spark, which is looking to support SQL analysis, will not overlap with the Cassandra CQL query language since the latter is mostly for setting up predefined analyses and does not support ad-hoc querying.
AI & Machine Learning: An Enterprise Guide
James M. Connolly, Executive Managing Editor, InformationWeekEditor in Chief,  9/27/2018
How to Retain Your Best IT Workers
John Edwards, Technology Journalist & Author,  9/26/2018
10 Highest-Paying IT Job Skills
Cynthia Harvey, Contributor, NetworkComputing,  9/12/2018
Register for InformationWeek Newsletters
Current Issue
The Next Generation of IT Support
The workforce is changing as businesses become global and technology erodes geographical and physical barriers.IT organizations are critical to enabling this transition and can utilize next-generation tools and strategies to provide world-class support regardless of location, platform or device
White Papers
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Sponsored Video
Flash Poll