NoSQL Basics, Benefits and Best-Fit Scenarios

Data Management

A popular new movement aims to take SQL database management systems out of the stack. But when is this emerging approach right for you?

InformationWeek Staff, Contributor

October 10, 2010

11 Min Read

For more than 20 years, SQL has been the unquestioned ruler of the database world. But now it's facing its greatest challenge.

Many of the largest and fastest-growing databases in the world belong to Web search engines and social networks (such as Google, Facebook and Twitter) or other internet companies (such as Zynga, maker of the games Farmville and Mafia Wars). Those companies routinely reject high-end commercial database management systems (DBMS) from the likes of Oracle or IBM. So do smaller companies in similar industries, who can only dream of running databases that big. Their reasons commonly include:

They don't want to pay license fees, and indeed have a strong bias toward open source software.
They don't need high-end features.
In fact, they don't need most SQL functionality.
They aren't that excited about writing SQL anyway (nor generating SQL via, say, an object-relational mapping layer).
License fees aside, they believe commercial database architectures and features get in the way of scalability.

In short, some of the largest and most innovative applications in the world are being built by people who don't see much value in DBMS in general, nor in high-end SQL DBMS in particular.

This rejection of proven products may sound like madness at first, but it turns out to make a certain sense. Databases are being built for single applications, and developers are optimizing performance, networking considerations, and/or software license fees at the expense of application extensibility. In such scenarios, DBMS lose their traditional roles as powerful DML (Data Manipulation Language) interpreters; rather, application programmers have to code every bit of data manipulation smarts themselves. Also falling by the wayside are most DBMS performance optimizations for different classes of queries. All that data management subsystems are used for is to read and write small amounts of data in a very rapid manner, and then to back up, replicate, and otherwise manage the already-stored data.

Consider the following use cases:

A large Web site exists primarily to accept and serve back photographs, songs, small snippets of text, and the contents of simple database records. Almost everything is keyed on user IDs. Throughput is massive. (Think of social networking or photo sharing sites.
Most of what happens in the database is that counters keep getting incremented, and not for transactions in which real money changes hands. So write locks are terrible bottlenecks, and transaction integrity -- while nice-to-have -- is not essential. (Think of social gaming or article sharing sites.)
A central server coordinates application versions and some amount of data across a broad number of occasionally-connected instances, perhaps for a broad variety of applications. (Mobile computing and social gaming sites need this sort of functionality.)

E-commerce aside, these use cases cover a large fraction of what's going on in Internet innovation. And while they all can be satisfied with traditional relational DBMS (which after all can be used to do pretty much anything), they all fit the RDBMS-unfriendly template that joins are inessential or secondary, transaction semantics are inessential or secondary, and two-phase commit is an overly restrictive way of replicating data. Assumptions and Scenarios

The challenges to traditional database management ideas are piling up. For years, world-class applications have been built using MySQL, hardly the most robust of DBMS. These applications haven't used more than a tiny fraction of MySQL's capabilities. Indeed, the biggest systems have relied on "sharding" MySQL -- putting different rows of a MySQL table onto different machines, and relying on application logic to know which machine to access. If those applications use any joins at all, they're only ones that will never cause data to move to from one node to another as part of the join resolution. The same applications often also rely on an in-memory key-value store called memcached. (More on the "key-value" data model below.)

And since all that isn't already far enough from the relational DBMS mainstream for some developers' tastes, it's beginning to be superseded by a popular new movement called "NoSQL," which aspires to get SQL-based DBMS out of the stack entirely.

Before going further, let's clear up one point: "NoSQL database" is not the same as "non-SQL DBMS." True, NoSQL technically stands for "Not Only SQL"; but taking that to an extreme (which some marketers do) is misleading. After all, non-SQL DBMS have flourished literally since the invention of database management systems more than 40 years ago. Some leading pre-relational mainframe DBMS -- notably IMS and Adabas -- survive to this day. Small enterprise databases, built on Microsoft Access or Apple FileMaker, may have nothing to do with SQL. Though medium-sized enterprises usually run on relational DBMS, Intersystems Cache and various "multi-value" systems also have had considerable success. Even large enterprises often use special-purpose systems, such as multidimensional "OLAP" servers, but these don't have much to do with the NoSQL market.

Rather, NoSQL DBMS start from three design premises:

Transaction semantics are unimportant, and locking is downright annoying.
Joins are also unimportant, especially joins of any complexity.
There are some benefits to having a DBMS even so.

NoSQL DBMS further incorporate one or more of three assumptions:

The database will be big enough that it should be scaled across multiple servers.
The application should run well if the database is replicated across multiple geographically distributed data centers, even if the connection between them is temporarily lost.
The database should run well if the database is replicated across a host server and a bunch of occasionally-connected mobile devices.

In addition, NoSQL advocates commonly favor the idea that a database should have no fixed schema, other than whatever emerges as a byproduct of the application-writing process.

"Not Only SQL" is hardly the only terminological problem around NoSQL. Much of the innovation in the NoSQL arena revolves around "consistency," but that word does not mean the same thing as it does in ACID (Atomicity, Consistency, Isolation, Durability). If anything, consistency is closer to "durability," in that it refers to the desirable property of getting a correct answer back from the DBMS even in a condition of (partial) failure. In essence, there are three reasonable approaches to consistency in a replicated data scenario:

1. Traditional/near-perfect consistency, in which processing stops until the system is assured that an update has propagated to all replicas. (This is typically enforced via a two-phase commit protocol.) The downside to this model, of course, is that a single node failure can bring at least part of the system to a halt.
2. Eventual consistency, in which inaccurate reads are permissible just so long as the data is synchronized "eventually." With eventual consistency, the network is rarely a bottleneck at all – but data accuracy may be less than ideal.
3. Read-your-writes (RYW) consistency, in which data from any single write is guaranteed to be read accurately, even in the face of a small number of network outages or node failures. However, a sequence of errors can conceivably produce inaccurate reads in ways that perfect consistency would forbid.

Some systems allow tuning (such as configuration) as to which consistency model is supported; others are more locked in to a particular choice.

The theory behind all this is Eric Brewer's CAP Theorem,for Consistency, Availability, and Partition Tolerance, the point being that you can't have all three of those in the same system. But be warned -- "Availability" and "Partition" are used in unconventional word-senses too. If Not SQL, Then What?

A number of strategies have been used to address NoSQL needs, most of which can be roughly divided into four groups:

Simple key-value store.
Quasi-tabular.
Fully SQL/tabular (!).
Document/object.

DBMS based on graphical data models are also sometimes suggested to be part of NoSQL, as are the file systems that underlie many MapReduce implementations. But as a general rule, those data models are most effective for analytic use cases somewhat apart from the NoSQL mainstream.

A key-value store is like a relational DBMS in which there only can be a single, three-column entity-attribute-value table, and in which you can't do self-joins. (In that analogy, the key part of the key-value pair may be thought of as an entity-attribute composite.) Thus, any conception of "object" has to live in the application logic; the data management software is little more than an intelligent storage system. Key-value stores may have modest performance advantages over the more efficient implementations of other models, but otherwise there's little advantage to using a key-value store. (One exception: You might want to use a persistent data store -- such as Membase from Membase, Inc. (the former Northscale) -- as the target for porting an existing memcached-based application.) Most key-value store products, Membase included, have or soon are planned to have alternative interfaces with at least somewhat richer data models.

More powerful are the quasi-tabular systems such as Cassandra, HBase,or (the original one) Google BigTable. In these, you can store what are essentially rows without worrying about whether each row has values for the same set of columns. Thus, a quasi-tabular database is like a relational database -- albeit one with lots of NULL values -- but with its schema controlled by the application program rather than a DBA.

The most prominent NoSQL implementations at big-name Web companies are of Cassandra or HBase, with Facebook, Twitter, Digg, StumbleUpon, and many others having joined the bandwagon. Both Cassandra and HBase are open source projects; neither is deemed to yet have reached its 1.0 release. But they have significant production installations even so. The go-to vendors for Cassandra and HBase are Riptanoand Hadoopspecialist Cloudera,respectively. (HBase is closely tied to the Hadoop MapReduce project.)

There's also a new generation of SQL-based systems that seem to overcome some of the NoSQL community's objections to conventional SQL DBMS, including Schooner, Clustrix, dbShards,VoltDB,and Akiban. These often come in key-value flavors as well, with a performance advantage of less than 2:1 versus the SQL implementations. Schooner somewhat aside, most of these vendors are still in early days in terms of getting actual customers.

Finally, there are the NoSQL document/object stores, most notably CouchDB (which boasts a Lotus Notes-like replication model) and MongoDB (which has a standard NoSQL laundry list of replication options). These directly store JSON (JavaScript Object Notation) objects -- collections of name-value pairs. CouchDB and MongoDB also both have ways of indexing, querying, and/or updating individual "fields" within the document schema. CouchDB and MongoDB both have considerable numbers of users, generally for applications that don't seem to demand large data volumes or high throughput. The go-to vendor for CouchDB is CouchOne or, if you have a larger database, Cloudant. The company behind MongoDB is 10gen. Should You Adopt NoSQL Technology?

Naturally, the answer to the question above depends on a variety of factors. Key considerations include:

Immaturity. The very term "NoSQL" has only been around since 2009. Most NoSQL "products" are open source projects backed by companies with fewer than 20 employees.
Open source. Many NoSQL adopters are constrained, by money or ideology, to avoid closed-source products. Conversely, it is difficult to deal with NoSQL products' immaturity unless you're comfortable with the rough-and-tumble of open source software development.
Internet orientation. A large fraction of initial NoSQL implementations are for Web or other Internet projects such as mobile applications.
Schema mutability. If you'd like to have different schemas for different parts of the same "table," NoSQL may be for you. If you like the database reusability guarantees of the relational model, NoSQL may be a poor fit.
Project size. For a large (and suitable) project, the advantages of NoSQL technology may be large enough to outweigh its disadvantages. For a small, ultimately disposable project, the disadvantages of NoSQL may be minor. In between those extremes, you may be better off with SQL.

Remember, too, that the choice of SQL DBMS goes far beyond the "Big Three" or Big Four" of Oracle, IBM DB2, Microsoft SQL Server, and SAP/Sybase Adaptive Server Anywhere. MySQL, PostgreSQL, and other mid-range SQL DBMS -- open source or otherwise -- might meet your needs. So might some of the scale-out-oriented startups cited above. Or if your needs are more analytic, there's a whole range of powerful and cost-effective specialized products, from vendors such as Aster Data, EMC/Greenplum, Teradata and others.

The bottom line is that for cutting-edge applications -- and not just Internet-centric ones -- NoSQL technology can make sense. But in many use cases, NoSQL's drawbacks are likely to outweigh advantages.

Curt Monash runs Monash Research, which provides strategic, analysis-based advice to users and vendors of advanced information technology. He also writes the blogs DBMS2, Text Technologies, and Strategic Messaging. Write him at [email protected]

About the Author(s)

InformationWeek Staff

Contributor

See more from InformationWeek Staff

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice