DataStax Cassandra Release Packs More Than Spark - InformationWeek
Data Management // Big Data Analytics
11:40 AM
Connect Directly

DataStax Cassandra Release Packs More Than Spark

DataStax Spark support may grab headlines, but a bring-your-own-Hadoop connector in DataStax Enterprise 4.5 deserves equal billing.

Hadoop Jobs: 9 Ways To Get Hired
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)

Yes, DataStax formally introduced previously announced integrations to the hot Apache Spark data-analysis framework on Monday, but the wider DataStax Enterprise 4.5 release also brings important new capabilities to Cassandra users.

Before we get to the Spark integrations, which were announced in May, consider the basic choice of where you handle analytics when using the Cassandra NoSQL database for a massively scaled application. DataStax has long included Hadoop software components in its DataStax Enterprise (DSE) distribution to support batch analytics -- via MapReduce -- on Cassandra data. Yet DataStax does not portray itself as a Hadoop vendor.

"We're squarely focused on online, Web, and mobile transactional applications, and we're happy to leave the data warehouse use cases to Cloudera, Hortonworks, and others," explained Robin Schumacher, DataStax' VP of products, in a phone interview with InformationWeek.

[Want more on Apache Spark? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

DSE 4.5 introduces an integration with external Hadoop clusters that's important in its own right, but it also provides more options for implementing Apache Spark. This new option, dubbed Bring Your Own Hadoop (BYOH), lets users bridge the gap between hot, operational data running in Cassandra and the historical data that they keep in Hadoop-based data warehouse, said Schumacher. "We haven't been able to do that well in the past."

With BYOH, you can kick off, say, a Hive query that joins data from a Cassandra table with a Hadoop Hive table that exists in Hadoop. DataStax specifically announced partnerships and certified integrations with Cloudera and Hortonworks.

What's a use case for using Cassandra and Hadoop together? If you're doing fraud detection for example, you might want to compare online transactional patterns for a given customer (happening in real time on Cassandra) with their historical transaction patterns (as captured in historical data on a Hadoop cluster). If new transactions differ from norms, you could use that insight to trigger a security screen.

DataStax' Apache Spark support means certified Spark software now ships with DSE 4.5, and it is supported by DataStax (with level-two and level-three support from Spark promoter Databricks). DataStax has upgraded its visual system-management tools to support point-and-click deployment of Spark nodes as well as Cassandra nodes. DSE 4.5 also provides high-availability features for Spark that ensure resilience and failover.

DataStax has contributed the basics of its Cassandra-Spark integration work to the Apache Spark Community, including a data-connectivity layer, data-type mappings, and performance optimizations that enable Cassandra and Spark to work better together.

There are multiple ways to deploy Spark and DSE in combination. Any way you do it, you'll gain Spark's rich data-analysis options -- SQL, machine learning, stream processing, MapReduce, graph processing, and R -- as well as its in-memory performance advantage, which DataStax puts at up to 100 times faster than using batch MapReduce methods.

One way to deploy is to run Cassandra and Spark on the same nodes, but that's not the best choice if you're handing time-sensitive or mission-critical workloads. A second option is to dedicate certain nodes to Cassandra and others to Spark. These isolated nodes won't compete for CPU or data resources, yet you can map data from the transactional to the analytical nodes.

With BYOH, there are yet more options for combining the transactional and analytical worlds, but if you're taking advantage of this option, your Hadoop cluster is likely holding more than just historical transactional data from Cassandra. If that's the case you'll probably want to run Spark on top of Hadoop, because that's where all the analytical data resides. It's an easy choice, too, because every major Hadoop distributor now includes certified Spark software, and two vendors, Cloudera and MapR, also offer Spark support.

"Customers can port their operational Cassandra data over to a Hadoop data lake at some point and analyze it in a data warehouse style using Spark," said Schumacher, noting that some Cassandra users do that today in a rolling time-window fashion, automatically moving older data over to Hadoop. The BYOH option will make that easier.

Other improvements in DSE 4.5 include a new diagnostic Performance Service that helps administrators spot problems within a Cassandra cluster, whether that's particular user activities or nodes suffering from possible hardware failures. In addition, DataStax' OpsCenter 5.0, the latest release of DataStax' operations management software, now supports DSE clusters up to 1,000 nodes. A new Best Practice Expert feature in OpsCenter 5.0 provides point-and-click analysis of DSE configuration settings, pointing out possible security vulnerabilities or ways to optimize storage or memory configuration.

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
D. Henschen
D. Henschen,
User Rank: Author
7/2/2014 | 1:14:57 PM
BYOH creates certified integrations for common practice
DataStax' Hadoop capabilities have been there since DSE 1.0, but I suspect most Hadoop users were already finding ways to port data from Cassandra over to more mainstream distributions like Cloudera and Hortonworks. I suspect they are favoring those two and dissing MapR because the latter does a lot to goose HBase (Hadoop NoSQL database) performance, thereby presenting competition to Cassandra and DataStax.

How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
2017 State of the Cloud Report
As the use of public cloud becomes a given, IT leaders must navigate the transition and advocate for management tools or architectures that allow them to realize the benefits they seek. Download this report to explore the issues and how to best leverage the cloud moving forward.
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on for the week of November 6, 2016. We'll be talking with the editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll