The social networking site opts for the open source Cassandra data management system in what's becoming a not-uncommon move.
He went on to explain that where relational systems rely on strictly defined tables with a set number of columns per row, Cassandra can effortlessly expand the number of rows, or data items being grouped together in the system.
Cassandra can likewise expand without application programmer intervention across a server cluster. More hardware can be brought on line and Cassandra can activate itself on a new node, calling out to the cluster load balancer for work to do.
On the other hand, Cassandra doesn't do joins, where related information is brought together from multiple tables into a new table in response to an SQL query. It doesn't guarantee referential integrity, where the user knows the data being used reflects the latest updates. It also can't process transactions, with a guarantee that the transaction will either be completed or discarded, the way relational systems do, concedes Ellis.
Since big Web systems need to assimilate large masses of data and make it available, frequently in read only fashion, systems like Cassandra concentrate on more immediate goals than the pristine data handling rules of relational systems. "Most Web applications do more reads than writes," said Ellis, which changes a key priority in a big online data handling system.
Relational databases with big jobs tend to be put on one piece of big, expensive hardware, such as an IBM mainframe or UltraSparc Unix server, because it's hard to run SQL systems on large clusters. Huge amounts of overhead are generated as the data is divided up and systems check constantly with each other on the integrity of the data being used.
Cassandra, on the other hand, takes to clusters like guppies to water. You can add more machines to a cluster running Cassandra without disrupting its operation, and soon the work is spread over a larger base, Ellis said.
That's why systems like Cassandra are gaining currency with Twitter and other social networking sites that deal with millions of users and masses of data. Yahoo, for example, uses a cluster of 4,000 servers to run Hadoop to index the results of a comprehensive Web crawl. The job still takes 73 hours, but it would take a lot longer if done by conventional relation database means.
Eric Evans, a co-worker of Ellis' at Rackspace in San Antonio, Texas, a managed service and cloud service provider, came up with the name "NoSQL," and Ellis hopes it continues to be used for the Cassandra type systems.
Social applications in the cloud may suddenly expand the amount of information they are collecting per user, as Facebook and Twitter frequently do. With a traditional database, that would mean a DBA would have to stop the database system, redefine the tables to add more columns and restart to collect the data.
With Cassandra, "You don't need to worry about a set number of rows You can add hundreds, thousands or millions of new rows. It lets you have as many rows as you want," said Ellis.