Big Data // Big Data Analytics
News
7/2/2014
11:40 AM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

DataStax Cassandra Release Packs More Than Spark

DataStax Spark support may grab headlines, but a bring-your-own-Hadoop connector in DataStax Enterprise 4.5 deserves equal billing.

Hadoop Jobs: 9 Ways To Get Hired
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)

Yes, DataStax formally introduced previously announced integrations to the hot Apache Spark data-analysis framework on Monday, but the wider DataStax Enterprise 4.5 release also brings important new capabilities to Cassandra users.

Before we get to the Spark integrations, which were announced in May, consider the basic choice of where you handle analytics when using the Cassandra NoSQL database for a massively scaled application. DataStax has long included Hadoop software components in its DataStax Enterprise (DSE) distribution to support batch analytics -- via MapReduce -- on Cassandra data. Yet DataStax does not portray itself as a Hadoop vendor.

"We're squarely focused on online, Web, and mobile transactional applications, and we're happy to leave the data warehouse use cases to Cloudera, Hortonworks, and others," explained Robin Schumacher, DataStax' VP of products, in a phone interview with InformationWeek.

[Want more on Apache Spark? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

DSE 4.5 introduces an integration with external Hadoop clusters that's important in its own right, but it also provides more options for implementing Apache Spark. This new option, dubbed Bring Your Own Hadoop (BYOH), lets users bridge the gap between hot, operational data running in Cassandra and the historical data that they keep in Hadoop-based data warehouse, said Schumacher. "We haven't been able to do that well in the past."

With BYOH, you can kick off, say, a Hive query that joins data from a Cassandra table with a Hadoop Hive table that exists in Hadoop. DataStax specifically announced partnerships and certified integrations with Cloudera and Hortonworks.

What's a use case for using Cassandra and Hadoop together? If you're doing fraud detection for example, you might want to compare online transactional patterns for a given customer (happening in real time on Cassandra) with their historical transaction patterns (as captured in historical data on a Hadoop cluster). If new transactions differ from norms, you could use that insight to trigger a security screen.

DataStax' Apache Spark support means certified Spark software now ships with DSE 4.5, and it is supported by DataStax (with level-two and level-three support from Spark promoter Databricks). DataStax has upgraded its visual system-management tools to support point-and-click deployment of Spark nodes as well as Cassandra nodes. DSE 4.5 also provides high-availability features for Spark that ensure resilience and failover.

DataStax has contributed the basics of its Cassandra-Spark integration work to the Apache Spark Community, including a data-connectivity layer, data-type mappings, and performance optimizations that enable Cassandra and Spark to work better together.

There are multiple ways to deploy Spark and DSE in combination. Any way you do it, you'll gain Spark's rich data-analysis options -- SQL, machine learning, stream processing, MapReduce, graph processing, and R -- as well as its in-memory performance advantage, which DataStax puts at up to 100 times faster than using batch MapReduce methods.

One way to deploy is to run Cassandra and Spark on the same nodes, but that's not the best choice if you're handing time-sensitive or mission-critical workloads. A second option is to dedicate certain nodes to Cassandra and others to Spark. These isolated nodes won't compete for CPU or data resources, yet you can map data from the transactional to the analytical nodes.

With BYOH, there are yet more options for combining the transactional and analytical worlds, but if you're taking advantage of this option, your Hadoop cluster is likely holding more than just historical transactional data from Cassandra. If that's the case you'll probably want to run Spark on top of Hadoop, because that's where all the analytical data resides. It's an easy choice, too, because every major Hadoop distributor now includes certified Spark software, and two vendors, Cloudera and MapR, also offer Spark support.

"Customers can port their operational Cassandra data over to a Hadoop data lake at some point and analyze it in a data warehouse style using Spark," said Schumacher, noting that some Cassandra users do that today in a rolling time-window fashion, automatically moving older data over to Hadoop. The BYOH option will make that easier.

Other improvements in DSE 4.5 include a new diagnostic Performance Service that helps administrators spot problems within a Cassandra cluster, whether that's particular user activities or nodes suffering from possible hardware failures. In addition, DataStax' OpsCenter 5.0, the latest release of DataStax' operations management software, now supports DSE clusters up to 1,000 nodes. A new Best Practice Expert feature in OpsCenter 5.0 provides point-and-click analysis of DSE configuration settings, pointing out possible security vulnerabilities or ways to optimize storage or memory configuration.

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/2/2014 | 1:14:57 PM
BYOH creates certified integrations for common practice
DataStax' Hadoop capabilities have been there since DSE 1.0, but I suspect most Hadoop users were already finding ways to port data from Cassandra over to more mainstream distributions like Cloudera and Hortonworks. I suspect they are favoring those two and dissing MapR because the latter does a lot to goose HBase (Hadoop NoSQL database) performance, thereby presenting competition to Cassandra and DataStax.

 
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest September 18, 2014
Enterprise social network success starts and ends with integration. Here's how to finally make collaboration click.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
The weekly wrap-up of the top stories from InformationWeek.com this week.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.