Yes, DataStax formally introduced previously announced integrations to the hot Apache Spark data-analysis framework on Monday, but the wider DataStax Enterprise 4.5 release also brings important new capabilities to Cassandra users.
Before we get to the Spark integrations, which were announced in May, consider the basic choice of where you handle analytics when using the Cassandra NoSQL database for a massively scaled application. DataStax has long included Hadoop software components in its DataStax Enterprise (DSE) distribution to support batch analytics -- via MapReduce -- on Cassandra data. Yet DataStax does not portray itself as a Hadoop vendor.
"We're squarely focused on online, Web, and mobile transactional applications, and we're happy to leave the data warehouse use cases to Cloudera, Hortonworks, and others," explained Robin Schumacher, DataStax' VP of products, in a phone interview with InformationWeek.
[Want more on Apache Spark? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]
DSE 4.5 introduces an integration with external Hadoop clusters that's important in its own right, but it also provides more options for implementing Apache Spark. This new option, dubbed Bring Your Own Hadoop (BYOH), lets users bridge the gap between hot, operational data running in Cassandra and the historical data that they keep in Hadoop-based data warehouse, said Schumacher. "We haven't been able to do that well in the past."
With BYOH, you can kick off, say, a Hive query that joins data from a Cassandra table with a Hadoop Hive table that exists in Hadoop. DataStax specifically announced partnerships and certified integrations with Cloudera and Hortonworks.
What's a use case for using Cassandra and Hadoop together? If you're doing fraud detection for example, you might want to compare online transactional patterns for a given customer (happening in real time on Cassandra) with their historical transaction patterns (as captured in historical data on a Hadoop cluster). If new transactions differ from norms, you could use that insight to trigger a security screen.
DataStax' Apache Spark support means certified Spark software now ships with DSE 4.5, and it is supported by DataStax (with level-two and level-three support from Spark promoter Databricks). DataStax has upgraded its visual system-management tools to support point-and-click deployment of Spark nodes as well as Cassandra nodes. DSE 4.5 also provides high-availability features for Spark that ensure resilience and failover.
DataStax has contributed the basics of its Cassandra-Spark integration work to the Apache Spark Community, including a data-connectivity layer, data-type mappings, and performance optimizations that enable Cassandra and Spark to work better together.
There are multiple ways to deploy Spark and DSE in combination. Any way you do it, you'll gain Spark's rich data-analysis options -- SQL, machine learning, stream processing, MapReduce, graph processing, and R -- as well as its in-memory performance advantage, which DataStax puts at up to 100 times faster than using batch MapReduce methods.
One way to deploy is to run Cassandra and Spark on the same nodes, but that's not the best choice if you're handing time-sensitive or mission-critical workloads. A second option is to dedicate certain nodes to Cassandra and others to Spark. These isolated nodes won't compete for CPU or data resources, yet you can map data from the transactional to the analytical nodes.
With BYOH, there are yet more options for combining the transactional and analytical worlds, but if you're taking advantage of this option, your Hadoop cluster is likely holding more than just historical transactional data from Cassandra. If that's the case you'll probably want to run Spark on top of Hadoop, because that's where all the analytical data resides. It's an easy choice, too, because every major Hadoop distributor now includes certified Spark software, and two vendors, Cloudera and MapR, also offer Spark support.
"Customers can port their operational Cassandra data over to a Hadoop data lake at some point and analyze it in a data warehouse style using Spark," said Schumacher, noting that some Cassandra users do that today in a rolling time-window fashion, automatically moving older data over to Hadoop. The BYOH option will make that easier.
Other improvements in DSE 4.5 include a new diagnostic Performance Service that helps administrators spot problems within a Cassandra cluster, whether that's particular user activities or nodes suffering from possible hardware failures. In addition, DataStax' OpsCenter 5.0, the latest release of DataStax' operations management software, now supports DSE clusters up to 1,000 nodes. A new Best Practice Expert feature in OpsCenter 5.0 provides point-and-click analysis of DSE configuration settings, pointing out possible security vulnerabilities or ways to optimize storage or memory configuration.
InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.