Like good carpenters, data engineers know that different tasks require different tools. Picking the right tools -- and knowing how to use them -- can be the most important part of any job. Here's how we settled on Cassandra for the core operating database behind ShareThis, and what we learned about big-data modeling along the way.
ShareThis keeps track of who's sharing what with whom online, and which channels they're using to share. Our ecosystem includes 120 social communities, 3 million publisher sites and apps, and nearly 200 million people. We can tell which stories are trending, and we can track content in motion in real-time as it's shared across social platforms. On a typical day, we're looking at about a million shares, and including other social signals such as click-backs, we see about 1.4 billion social signals per month.
Originally we were running on MongoDB, but it didn't scale well as the number of writes in our database continued to increase. We considered a few other options, including Cassandra, and some memory base options such Membase and Couchbase. Cassandra's write throughput is what lets us write such large volumes into Cassandra at a given time and ingest more data in a given period; it means more data and lower cost, which stood out significantly. Now that we've been using it for a while it is extremely stable, but we do continue to look at new alternatives. We recently evaluated it against Aerospike, for example, which is known for allowing ACID transactions, which are safer. We found that Aerospike performs really well regarding both write throughput (ingestion) and low-latency queries, but is limited in column width, while Cassandra allows for really wide columns.
[Should analytics features be sacrificed for accessibility? Read Analytics Showdown: Should Apps Be Simpler, Or Smarter?]
We've made a lot of mistakes in data modeling over the course of development. Setting up our data models correctly was tricky. The first few models were not easy to query because we didn't take advantage of Cassandra's columnar format to create wide rows. In order to make sure you make full use of Cassandra features you have to store the data in a way that Cassandra understands -- the extra up-front time saves a ton down the line. Eventually, we learned to make sure that our data model matches the way Cassandra models its data. In order to reap the most benefits from Cassandra, you really have to know your data and build your data model to conform to the way that Cassandra thinks.
One of Cassandra's strengths is high write throughput on commodity hardware, which enables us to scale infrastructure very quickly. Because we handle terabytes of data, a high write rate is critical to us. And because it's hard to predict loads, fast scalability translates into a competitive business advantage. The ability to scale up the cluster without changing the code, for example, is a huge asset. In a traditional database your data is statically partitioned, and when you add data you have to repartition. However, in Cassandra, it can autobalance to make sure data is spread between nodes.
Recently, a customer used our Publisher Analytics product to see in real-time which social media channels its website readers were using to share content. The analytics also revealed that the channels people were sharing to were different from the channels people were sharing from. People were sharing to Facebook, Twitter, and Google Plus and sharing from Pinterest and LinkedIn. The customer quickly put Pinterest and LinkedIn buttons on its site. From our perspective, data analytics become genuinely valuable when you can use them as those kinds of action triggers. For example, a publisher can see how socially engaged with their content site visitors are in real time, which is extremely attractive to advertisers that want to connect with consumers at relevant moments. I don't want to suggest that we've discovered the Holy Grail of real-time big data analytics, but we have created a practical model that produces tangible business value from social data.
Cassandra does come with trade-offs. Unlike traditional databases it's nearly impossible to do a join between two different data sets, or to run functions across large swaths of data. This means that you have to know your data in advance and create the data structures you will need to query in advance. If you can't do this, Cassandra might never work for you. But if you do happen to know exactly how your data will look and are able to keep your querying limited to how your data is structured originally, then Cassandra offers a good choice.
Bottom line: Cassandra gives us the ability to scale out by simply adding a node to a cluster, and letting the cluster rebalance itself, which saves on operational overhead. Maintaining high write throughput with a minimal number of nodes lets us manage infrastructure costs more effectively. When you're running a business that provides real-time big data analytics, keeping things simple and managing infrastructure costs intelligently are critical objectives.
Apply now for the 2015 InformationWeek Elite 100, which recognizes the most innovative users of technology to advance a company's business goals. Winners will be recognized at the InformationWeek Conference, April 27-28, 2015, at the Mandalay Bay in Las Vegas. Application period ends Jan. 16, 2015.