DataStax Brings Spark To Cassandra - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
News
5/12/2014
01:08 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

DataStax Brings Spark To Cassandra

DataStax promises real-time analysis for big-data apps on Cassandra through an integration with Databricks' Spark in-memory framework.

10 Hadoop Hardware Leaders
10 Hadoop Hardware Leaders
(Click image for larger view and slideshow.)

Apache Spark has become the darling of the big data world, with vendors seemingly lined up around the block to add the in-memory analysis framework to their platforms. DataStax became the latest company to join that club last week when it announced plans to integrate Spark and the Apache Cassandra NoSQL database management system (DBMS) that it develops and supports.

DataStax has partnered with Databricks, the company founded by the creators of Apache Spark, to build a supported, open source integration between the two platforms. The partners expect to have the integration ready by this summer. The benefit will be in-memory analysis for applications such as online recommendations and personalization, fraud detection, event messaging, and Internet of Things/sensor-data detection in manufacturing and IT settings, according to Martin Van Ryswyk, executive VP, engineering, at DataStax.

[Want more on Spark? Read MapR Brings Spark In-Memory Analysis To Hadoop.]

"Analytics on real-time data is important because people want to look at what the customer is looking at or buying right now, do a quick analysis against historical or location data, and offer them something different," says Van Ryswyk in a phone interview with InformationWeek. "This also happens in fraud scenarios when you need to stop a transaction right now, not two hours from now once you've seen all that data in batch mode on Hadoop or a data warehouse."

Databricks has partnerships for Spark integration with both Cloudera and MapR in the Hadoop world and with analytics vendor Alpine Data Labs. The link with Cassandra brings Spark into online transactional environments. Cassandra user and DataStax customer Ooyala, a video analytics platform company, built an integration between Cassandra and Spark on its own. According to Kelvin Chu, compute and data team lead at Ooyala, it's a powerful combination.

DataStax says it will integrate Cassandra with the Spark Core Engine, so it will take advantage of all types of analysis on the framework.
DataStax says it will integrate Cassandra with the Spark Core Engine, so it will take advantage of all types of analysis on the framework.

"With Cassandra as the data store and Spark for data crunching, these new analytic capabilities are making the processing of large data volumes a breeze," said Chu in a statement. "Spark on Cassandra is giving us the power to act on things in real time, which means faster decisions and faster results."

Not everyone agrees Cassandra and Spark are up to "real-time" standards. In-memory data grid vendor ScaleOut Software, for one, asserts that Spark doesn't handle real-time state changes and that Cassandra's eventual consistency approach will limit the performance potential of the combination.

"Spark does not handle real-time state changes to individual data items in a resilient distributed data set; it can only stream data and change the whole data set," says Bill Bain, CEO of ScaleOut, in a phone interview with InformationWeek. "HDFS has a similar limitation because you can't update data in HDFS; all you can do is append to HDFS files." [Editor's note: this quote was revised at the request of Bill Bain. The original quote addressed shortcomings of Cassandra rather than HDFS.]

ScaleOut says in-memory data grids have been used for years to support airline reservation systems, e-commerce shopping carts, and financial-trading applications, and will make the difference between real-time and near-real-time performance.

For companies already deployed on Cassandra, the difference between real-time and near-real-time isn't going to diminish the value of Spark integration. But it's something to consider for those hoping to get sub-second performance out of a big data platform.

You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
5/13/2014 | 10:42:07 AM
Re: A plus-plus combination of open source code projects
ScaleOut would disagree with your "online" characterization of Spark. They almost make it sound positively batchy in terms of processing data sets, not individual data points like an ACID-compliance and CRUD-capable database or other OLTP system.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
5/12/2014 | 8:18:21 PM
A plus-plus combination of open source code projects
Spark is not just an in-memory system but an enhancement to a distributed data management system like Cassandra. It's already changed Hadoop from a batch system into an online-processing system on a well-managed, distributed cluster. This is an interesting combination of open source code projects.

 
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
5/12/2014 | 2:28:44 PM
On CQL and cluster deployment
Forgot to note that Van Ryswyk said the integration will give you the option of running separate Cassandra and Spark clusters or running both on the same cluster -- though you'd have to be careful that the latter approach would not impact transactional performance (either by beefing up the hardware or being careful about analysis loads). Van Ryswyk also insisted that Spark, which is looking to support SQL analysis, will not overlap with the Cassandra CQL query language since the latter is mostly for setting up predefined analyses and does not support ad-hoc querying.
Slideshows
9 Steps Toward Ethical AI
Cynthia Harvey, Freelance Journalist, InformationWeek,  5/15/2019
Commentary
How to Assess Digital Transformation Efforts
Lisa Morgan, Freelance Writer,  5/14/2019
Commentary
Is AutoML the Answer to the Data Science Skills Shortage?
Guest Commentary, Guest Commentary,  5/10/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
A New World of IT Management in 2019
This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.
Slideshows
Flash Poll