Apache Spark has become the darling of the big data world, with vendors seemingly lined up around the block to add the in-memory analysis framework to their platforms. DataStax became the latest company to join that club last week when it announced plans to integrate Spark and the Apache Cassandra NoSQL database management system (DBMS) that it develops and supports.
DataStax has partnered with Databricks, the company founded by the creators of Apache Spark, to build a supported, open source integration between the two platforms. The partners expect to have the integration ready by this summer. The benefit will be in-memory analysis for applications such as online recommendations and personalization, fraud detection, event messaging, and Internet of Things/sensor-data detection in manufacturing and IT settings, according to Martin Van Ryswyk, executive VP, engineering, at DataStax.
[Want more on Spark? Read MapR Brings Spark In-Memory Analysis To Hadoop.]
"Analytics on real-time data is important because people want to look at what the customer is looking at or buying right now, do a quick analysis against historical or location data, and offer them something different," says Van Ryswyk in a phone interview with InformationWeek. "This also happens in fraud scenarios when you need to stop a transaction right now, not two hours from now once you've seen all that data in batch mode on Hadoop or a data warehouse."
Databricks has partnerships for Spark integration with both Cloudera and MapR in the Hadoop world and with analytics vendor Alpine Data Labs. The link with Cassandra brings Spark into online transactional environments. Cassandra user and DataStax customer Ooyala, a video analytics platform company, built an integration between Cassandra and Spark on its own. According to Kelvin Chu, compute and data team lead at Ooyala, it's a powerful combination.
"With Cassandra as the data store and Spark for data crunching, these new analytic capabilities are making the processing of large data volumes a breeze," said Chu in a statement. "Spark on Cassandra is giving us the power to act on things in real time, which means faster decisions and faster results."
Not everyone agrees Cassandra and Spark are up to "real-time" standards. In-memory data grid vendor ScaleOut Software, for one, asserts that Spark doesn't handle real-time state changes and that Cassandra's eventual consistency approach will limit the performance potential of the combination.
"Spark does not handle real-time state changes to individual data items in a resilient distributed data set; it can only stream data and change the whole data set," says Bill Bain, CEO of ScaleOut, in a phone interview with InformationWeek. "HDFS has a similar limitation because you can't update data in HDFS; all you can do is append to HDFS files." [Editor's note: this quote was revised at the request of Bill Bain. The original quote addressed shortcomings of Cassandra rather than HDFS.]
ScaleOut says in-memory data grids have been used for years to support airline reservation systems, e-commerce shopping carts, and financial-trading applications, and will make the difference between real-time and near-real-time performance.
For companies already deployed on Cassandra, the difference between real-time and near-real-time isn't going to diminish the value of Spark integration. But it's something to consider for those hoping to get sub-second performance out of a big data platform.
You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)