Apache's Cassandra Adds Column Data Analysis

Keynote at the Cassandra Summit outlined features in the 0.7 release of the NoSQL database system, notably support for secondary indexes.

Charles Babcock, Editor at Large, Cloud

August 19, 2010

5 Min Read

Analytics Gallery: 2010 Data Center Operational Trends Report

(click for larger image and for full photo gallery)

NoSQL database system Cassandra recently launched the beta version its 0.7 feature improvement release with support for secondary indexes. The support makes it easier to analyze data found in a single column, such as finding a certain age grouping in a column of birth dates.

Cassandra is an open source project sponsored by the Apache Software Foundation to push forward the development of the key value store, NoSQL system. Jonathan Ellis, who founded the project while working for Rackspace, was the keynote speaker at the Cassandra Summit held at San Francisco's Mission Bay Conference Center Aug. 10. Current uses of Cassandra include Facebook, Digg, and Twitter, which stores 15 million tweets a day in Cassandra.

Ellis, in an interview, said the addition of secondary indexes to Cassandra makes it possible to index columns in the tables of the Cassandra database. Primary indexes, which are already supported, are based on rows in the database.

In addition, the 0.7 release includes support for rows that contain more than two GBs of data in Cassandra table; in the past two GBs was the limit for a row. The 0.7 release can also create families of columns while the database is running. Previously, a node needed to be shut down for a family to be generated from the data stored on it. A family can be queried and a response obtained more quickly than a query being required to review all the columns in the database, he said.

Ellis talked about the features of the 0.7 release in his address, The Present and Future of Cassandra. He was introduced by Bill Boebel, VP of strategy at Rackspace, who said Rackspace wished to contribute to open source projects that lead to more cloud computing software. As more users adopt the code, "we'll get a percentage of them" as cloud users in the company's cloud infrastructure offering. Rackspace is an investor in the company that Ellis co-founded, Riptano, to supply training and support. His partner was Matt Pfeil, also a former Rackspace employee.

Ellis said about 200 Cassandra users were at the summit. He asked attendees how many of them were using Cassandra in production systems and about a third indicated they were, he said. Boebel termed the Cassandra event "the biggest individual NoSQL event held thus far." MongoDB, a document-oriented, NoSQL system, held its own user conference in San Francisco May 3 with a similar number of attendees.

A non-scientific poll conducted by Hacker News among startup developers found the open source MySQL database system still the most popular choice for establishing a company database, followed by the PostgreSQL database project. On their heels came NoSQL systems, MongoDB, third, and CouchDB and Cassandra tied for sixth. Redis and Microsoft databases took the fourth and fifth place spots. Cassandra Summit attendees came from throughout the United States, as well as from Japan, Switzerland, and Australia, he said.

Cassandra was downloaded 14,000 times during the month of July from its distribution server at the Apache Foundation. Four thousand people a day visit the project.

Cassandra is a member of the growing set of so-called NoSQL systems which are organized, like relational databases, into rows and tables but they dispense with two-phase commits and other transaction guarantees found in relational databases. While handling masses of data on a website efficiently, a NoSQL system tolerates slight delays in updates that would have to occur all at once in a relational system. Thus, it might be possible for two users issuing the same query at the same time to get slightly different answers as Cassandra plays catch-up on data updates.

The NoSQL systems try to build in high reliability for operations on a server cluster by creating copies of data on three different nodes. A piece of hardware can fail and there will still be an original and backup copy. Ellis is working on a Hinted Handoff feature in Cassandra which allows a node to be temporarily absent from a cluster and processing of data to continue anyway. A fourth copy is created, with a thread directed to the missing node that it is to update its data set when it comes back on line. If the node reappears, the fourth copy is deleted and the system proceeds with three copies, as before.

"The discussion was pretty technical," said Ellis in a follow-up e-mail message, "but so was the audience."

He said the beta release of Cassandra 0.7 will be tested by users for a month and then, with revisions, become the final release. The target date for Cassandra 1.0 is still undecided, he added.

In Boebel's introduction, he said Ellis organized the original Cassandra project while at Rackspace and built the community around it. When Ellis and fellow Rackspace employee Matt Pfeil left the company to found Riptano, a Cassandra support firm, Rackspace backed the move by investing in the new venture. Riptano supplies consulting and technical support for Cassandra. Ellis is CTO and Pfeil is CEO of Riptano.

"Riptano is doing very well -- we're up to 11 employees now, mostly engineers. So far we've basically been riding the coattails of Cassandra's success," said Ellis.

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights