February 26, 2010
Guess what? Cassandra is going to tweet. The open source Cassandra data management system is going to replace the MySQL database system at Twitter, the latest of several MySQL replacements at social networking sites, according to Ryan King, a software engineer at Twitter.
Facebook and Digg, which used to rely on the open source MySQL database system, now part of Oracle, have already made the switch.
Cassandra can be run on large server clusters and is capable of taking in very large amounts of data at a time, performing sorts and calling up relevant data quickly. It's an example of the new types of data handling systems that are powering large Web applications, particularly social networking sites which deal with hundreds of thousands or millions of users.
The implementers of Cassandra and other cloud-based systems, which include Hadoop, Google's Big Table, MemCacheDB, Voldemort, CouchDB and MongoDB, are often referred to as the NoSQL movement. Thier proponents consider traditional relational databases, which use the SQL data access language, unsuitable for the superlarge tasks that confront them.
King is quoted in an interview posted last week on the MyNoSQL blog stating that Twitter wanted a system that could keep up with its growth, as tweets have gone from 2 million a day to over 50 million in 2009.
"We have a lot of data. The growth factor in that data is huge and the rate of growth is accelerating," he said in the blog posting.
King also said in the interview that Twitter sought a system with no single point of failure, which could execute highly scalable writes, and had a healthy open source community behind it.
Cassandra is an Apache Software Foundation project that originally came out of Facebook, which created it to manage its masses of data. The project recently moved out of first-year, incubator status at Apache to full project status and has an active developer group.
Jonathan Ellis is the Cassandra project management committee chair at Apache or general manager. In an interview Friday, he said Cassandra can be loaded with data by an application from a relational database and it will work with it as well as other sources. The implementers of "NoSQL" style systems don't necessarily rule out working with Oracle, IBM's DB2, MySQL , Sybase or Microsoft's SQL Server.
At the same time, Ellis wants to keep the rebellious sounding NoSQL name meaning something distinct from a SQL-based database system. It's hard sound has been explained away by some as meaning "Not only SQL." But Ellis says, nothing doing. "It has a combative connotation" and that's appropriate. "It's catchy. People remember it. It's bringing attention to the way you don't have to keep doing things the way relational databases have dictated" for 30 years now.
He went on to explain that where relational systems rely on strictly defined tables with a set number of columns per row, Cassandra can effortlessly expand the number of rows, or data items being grouped together in the system.
Cassandra can likewise expand without application programmer intervention across a server cluster. More hardware can be brought on line and Cassandra can activate itself on a new node, calling out to the cluster load balancer for work to do.
On the other hand, Cassandra doesn't do joins, where related information is brought together from multiple tables into a new table in response to an SQL query. It doesn't guarantee referential integrity, where the user knows the data being used reflects the latest updates. It also can't process transactions, with a guarantee that the transaction will either be completed or discarded, the way relational systems do, concedes Ellis.
Since big Web systems need to assimilate large masses of data and make it available, frequently in read only fashion, systems like Cassandra concentrate on more immediate goals than the pristine data handling rules of relational systems. "Most Web applications do more reads than writes," said Ellis, which changes a key priority in a big online data handling system.
Relational databases with big jobs tend to be put on one piece of big, expensive hardware, such as an IBM mainframe or UltraSparc Unix server, because it's hard to run SQL systems on large clusters. Huge amounts of overhead are generated as the data is divided up and systems check constantly with each other on the integrity of the data being used.
Cassandra, on the other hand, takes to clusters like guppies to water. You can add more machines to a cluster running Cassandra without disrupting its operation, and soon the work is spread over a larger base, Ellis said.
That's why systems like Cassandra are gaining currency with Twitter and other social networking sites that deal with millions of users and masses of data. Yahoo, for example, uses a cluster of 4,000 servers to run Hadoop to index the results of a comprehensive Web crawl. The job still takes 73 hours, but it would take a lot longer if done by conventional relation database means.
Eric Evans, a co-worker of Ellis' at Rackspace in San Antonio, Texas, a managed service and cloud service provider, came up with the name "NoSQL," and Ellis hopes it continues to be used for the Cassandra type systems.
Social applications in the cloud may suddenly expand the amount of information they are collecting per user, as Facebook and Twitter frequently do. With a traditional database, that would mean a DBA would have to stop the database system, redefine the tables to add more columns and restart to collect the data.
With Cassandra, "You don't need to worry about a set number of rows You can add hundreds, thousands or millions of new rows. It lets you have as many rows as you want," said Ellis.
About the Author(s)
You May Also Like