Big Data // Big Data Analytics
News
5/12/2014
01:08 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

DataStax Brings Spark To Cassandra

DataStax promises real-time analysis for big-data apps on Cassandra through an integration with Databricks' Spark in-memory framework.

10 Hadoop Hardware Leaders
10 Hadoop Hardware Leaders
(Click image for larger view and slideshow.)

Apache Spark has become the darling of the big data world, with vendors seemingly lined up around the block to add the in-memory analysis framework to their platforms. DataStax became the latest company to join that club last week when it announced plans to integrate Spark and the Apache Cassandra NoSQL database management system (DBMS) that it develops and supports.

DataStax has partnered with Databricks, the company founded by the creators of Apache Spark, to build a supported, open source integration between the two platforms. The partners expect to have the integration ready by this summer. The benefit will be in-memory analysis for applications such as online recommendations and personalization, fraud detection, event messaging, and Internet of Things/sensor-data detection in manufacturing and IT settings, according to Martin Van Ryswyk, executive VP, engineering, at DataStax.

[Want more on Spark? Read MapR Brings Spark In-Memory Analysis To Hadoop.]

"Analytics on real-time data is important because people want to look at what the customer is looking at or buying right now, do a quick analysis against historical or location data, and offer them something different," says Van Ryswyk in a phone interview with InformationWeek. "This also happens in fraud scenarios when you need to stop a transaction right now, not two hours from now once you've seen all that data in batch mode on Hadoop or a data warehouse."

Databricks has partnerships for Spark integration with both Cloudera and MapR in the Hadoop world and with analytics vendor Alpine Data Labs. The link with Cassandra brings Spark into online transactional environments. Cassandra user and DataStax customer Ooyala, a video analytics platform company, built an integration between Cassandra and Spark on its own. According to Kelvin Chu, compute and data team lead at Ooyala, it's a powerful combination.

DataStax says it will integrate Cassandra with the Spark Core Engine, so it will take advantage of all types of analysis on the framework.
DataStax says it will integrate Cassandra with the Spark Core Engine, so it will take advantage of all types of analysis on the framework.

"With Cassandra as the data store and Spark for data crunching, these new analytic capabilities are making the processing of large data volumes a breeze," said Chu in a statement. "Spark on Cassandra is giving us the power to act on things in real time, which means faster decisions and faster results."

Not everyone agrees Cassandra and Spark are up to "real-time" standards. In-memory data grid vendor ScaleOut Software, for one, asserts that Spark doesn't handle real-time state changes and that Cassandra's eventual consistency approach will limit the performance potential of the combination.

"Spark does not handle real-time state changes to individual data items in a resilient distributed data set; it can only stream data and change the whole data set," says Bill Bain, CEO of ScaleOut, in a phone interview with InformationWeek. "HDFS has a similar limitation because you can't update data in HDFS; all you can do is append to HDFS files." [Editor's note: this quote was revised at the request of Bill Bain. The original quote addressed shortcomings of Cassandra rather than HDFS.]

ScaleOut says in-memory data grids have been used for years to support airline reservation systems, e-commerce shopping carts, and financial-trading applications, and will make the difference between real-time and near-real-time performance.

For companies already deployed on Cassandra, the difference between real-time and near-real-time isn't going to diminish the value of Spark integration. But it's something to consider for those hoping to get sub-second performance out of a big data platform.

You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
5/13/2014 | 10:42:07 AM
Re: A plus-plus combination of open source code projects
ScaleOut would disagree with your "online" characterization of Spark. They almost make it sound positively batchy in terms of processing data sets, not individual data points like an ACID-compliance and CRUD-capable database or other OLTP system.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
5/12/2014 | 8:18:21 PM
A plus-plus combination of open source code projects
Spark is not just an in-memory system but an enhancement to a distributed data management system like Cassandra. It's already changed Hadoop from a batch system into an online-processing system on a well-managed, distributed cluster. This is an interesting combination of open source code projects.

 
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
5/12/2014 | 2:28:44 PM
On CQL and cluster deployment
Forgot to note that Van Ryswyk said the integration will give you the option of running separate Cassandra and Spark clusters or running both on the same cluster -- though you'd have to be careful that the latter approach would not impact transactional performance (either by beefing up the hardware or being careful about analysis loads). Van Ryswyk also insisted that Spark, which is looking to support SQL analysis, will not overlap with the Cassandra CQL query language since the latter is mostly for setting up predefined analyses and does not support ad-hoc querying.
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Dec. 9, 2014
Apps will make or break the tablet as a work device, but don't shortchange critical factors related to hardware, security, peripherals, and integration.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of December 7, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program!
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.