Why We Picked Cassandra For Big Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IT Leadership
09:06 AM
Juan Valencia
Juan Valencia

Why We Picked Cassandra For Big Data

Easy scale-outs, high write throughputs, and lower costs were key, but Cassandra does have its limitations.

Like good carpenters, data engineers know that different tasks require different tools. Picking the right tools -- and knowing how to use them -- can be the most important part of any job. Here's how we settled on Cassandra for the core operating database behind ShareThis, and what we learned about big-data modeling along the way.

ShareThis keeps track of who's sharing what with whom online, and which channels they're using to share. Our ecosystem includes 120 social communities, 3 million publisher sites and apps, and nearly 200 million people. We can tell which stories are trending, and we can track content in motion in real-time as it's shared across social platforms. On a typical day, we're looking at about a million shares, and including other social signals such as click-backs, we see about 1.4 billion social signals per month.

Originally we were running on MongoDB, but it didn't scale well as the number of writes in our database continued to increase. We considered a few other options, including Cassandra, and some memory base options such Membase and Couchbase. Cassandra's write throughput is what lets us write such large volumes into Cassandra at a given time and ingest more data in a given period; it means more data and lower cost, which stood out significantly. Now that we've been using it for a while it is extremely stable, but we do continue to look at new alternatives. We recently evaluated it against Aerospike, for example, which is known for allowing ACID transactions, which are safer. We found that Aerospike performs really well regarding both write throughput (ingestion) and low-latency queries, but is limited in column width, while Cassandra allows for really wide columns.

[Should analytics features be sacrificed for accessibility? Read Analytics Showdown: Should Apps Be Simpler, Or Smarter?]

We've made a lot of mistakes in data modeling over the course of development. Setting up our data models correctly was tricky. The first few models were not easy to query because we didn't take advantage of Cassandra's columnar format to create wide rows. In order to make sure you make full use of Cassandra features you have to store the data in a way that Cassandra understands -- the extra up-front time saves a ton down the line. Eventually, we learned to make sure that our data model matches the way Cassandra models its data. In order to reap the most benefits from Cassandra, you really have to know your data and build your data model to conform to the way that Cassandra thinks.  

One of Cassandra's strengths is high write throughput on commodity hardware, which enables us to scale infrastructure very quickly. Because we handle terabytes of data, a high write rate is critical to us. And because it's hard to predict loads, fast scalability translates into a competitive business advantage. The ability to scale up the cluster without changing the code, for example, is a huge asset. In a traditional database your data is statically partitioned, and when you add data you have to repartition. However, in Cassandra, it can autobalance to make sure data is spread between nodes.

Image: Cassandra
Image: Cassandra

Recently, a customer used our Publisher Analytics product to see in real-time which social media channels its website readers were using to share content. The analytics also revealed that the channels people were sharing to were different from the channels people were sharing from. People were sharing to Facebook, Twitter, and Google Plus and sharing from Pinterest and LinkedIn. The customer quickly put Pinterest and LinkedIn buttons on its site. From our perspective, data analytics become genuinely valuable when you can use them as those kinds of action triggers. For example, a publisher can see how socially engaged with their content site visitors are in real time, which is extremely attractive to advertisers that want to connect with consumers at relevant moments. I don't want to suggest that we've discovered the Holy Grail of real-time big data analytics, but we have created a practical model that produces tangible business value from social data.

Cassandra does come with trade-offs. Unlike traditional databases it's nearly impossible to do a join between two different data sets, or to run functions across large swaths of data. This means that you have to know your data in advance and create the data structures you will need to query in advance. If you can't do this, Cassandra might never work for you. But if you do happen to know exactly how your data will look and are able to keep your querying limited to how your data is structured originally, then Cassandra offers a good choice.

Bottom line: Cassandra gives us the ability to scale out by simply adding a node to a cluster, and letting the cluster rebalance itself, which saves on operational overhead. Maintaining high write throughput with a minimal number of nodes lets us manage infrastructure costs more effectively. When you're running a business that provides real-time big data analytics, keeping things simple and managing infrastructure costs intelligently are critical objectives.

Apply now for the 2015 InformationWeek Elite 100, which recognizes the most innovative users of technology to advance a company's business goals. Winners will be recognized at the InformationWeek Conference, April 27-28, 2015, at the Mandalay Bay in Las Vegas. Application period ends Jan. 16, 2015.

Juan Valencia is principal engineer at ShareThis. He develops and designs software libraries and frameworks for use in distributed web applications; designs and implements the ShareThis real-time and batch analytics pipelines; researches new technologies and methodologies to ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
12/22/2014 | 12:32:01 AM
Re: How did you "widen" your columns?
I don't work at the company listed in the article, but the following is how I'd approach the problem.

CREATE TABLE purchases_per_customer (
    customer_id uuid,
    purchase_date timestamp,
    item text,
    price int,
    age int,
    address text,
    PRIMARY KEY (customer_id, purchase_date, item, price, age, address)

This will allow me to quickly access all recent purchase for a customer.

Importantly, my data will be physically stored in sorted order. The data will be sorted by purchase_date and then by item within any given date.

Also, I pushed all of my columns into the cell (aka physical column) name and therefore I have only 1 cell per metric. Nothing is stored in the cell value since everything has been packed into the cell name. This is a good move since it eliminates the data overhead of repeatedly storing column names in the cell name (which can be large for a high-volume metrics application).

The main benefit this gives me is the ability to get all recent purchases for a customer with a single on-disk access. This data model will also give very fast range scans by purchase date.

Of course, I'd probably also want to query by item, so I would then store that in another table. In Cassandra, writes are super fast and I tend to model my data to become CPU bound, not I/O bound, so I'd probably turn on compression for this table.
Charlie Babcock
Charlie Babcock,
User Rank: Author
12/18/2014 | 7:57:52 PM
How did you "widen" your columns?
That's interesting, the fact that Cassandra handles wide columns well and they can be used to maximize query and write throughput. I wish you had given a concrete example of something you did to combine more data into a single colum. For example if the picture of the customer, age, address and recent purchases were previously in separate columns, could you combine them into a single column for a quick customer view with one query? Or is my relational orientation showing?
User Rank: Apprentice
12/18/2014 | 11:24:36 AM
Tossing your MongoDB Investment is Painful
Full disclosure, I work for Tokutek...   Throwing away your investment in MongoDB is painful, but sometimes the best option.  

I'm curious.  Had you heard of the Tokutek fork of MongoDB (called TokuMX)?  If yes, did you consider it as a option?
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
Is Cloud Migration a Path to Carbon Footprint Reduction?
Joao-Pierre S. Ruth, Senior Writer,  10/5/2020
IT Spending, Priorities, Projects: What's Ahead in 2021
Jessica Davis, Senior Editor, Enterprise Apps,  10/2/2020
White Papers
Register for InformationWeek Newsletters
2020 State of DevOps Report
2020 State of DevOps Report
Download this report today to learn more about the key tools and technologies being utilized, and how organizations deal with the cultural and process changes that DevOps brings. The report also examines the barriers organizations face, as well as the rewards from DevOps including faster application delivery, higher quality products, and quicker recovery from errors in production.
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Flash Poll