It's an exciting time in the database world. We're casting aside the relational database management system shackles for NoSQL (and "NewSQL," but I'll just use NoSQL as a term for both) systems that let us achieve better availability and scalability by relaxing data consistency requirements. That is, NoSQL systems are built to scale horizontally -- so you can run lots and lots of different servers, minimizing the impact of any of them going down -- and to handle the complexities that arise when you spread data across lots of servers.
Well, at least that's what the marketing literature says.
The problem is that it's very difficult to validate the claims of NoSQL database vendors with any kind of rigorous testing. Sure, it's easy enough to set up a NoSQL database server, unplug one or more machines from the network, and verify that the vendor's claims are accurate for that use case. But in the real world, the number of ways that machines can be separated from one another across a network (otherwise known as partitioning) means there are far more potential snafus than just a complete disconnect. Think very slow response times (when do you timeout?), servers that can only send packets, servers that can only receive packets, and situations where servers cannot tell whether the network is down or not. Yet, in practice, when selecting a vendor, the only test most IT teams can do to verify that a database server is fault-tolerant for its infrastructure is the "unplug" test. If the system passes, we assume the rest of the marketing claims are correct, and we move forward. Maybe that's a good assumption, maybe not.
Which brings us to Kyle Kingsbury, also known as "Aphyr." In May, Kingsbury embarked on a Carly Rae Jepsen-inspired quest to test more rigorously how various databases handle different types of partitions. Kingsbury's Call Me Maybe project ("Hey I just met you / Our network's crazy / But here's my data / So store it maybe") revolves around Jepsen, an open source project that he built in his spare time over the course of a few hundred hours. Jepsen makes testing different types of partitions much easier. A few points: Kingsbury customizes analysis to each database's documentation and marketing claims about how it's supposed to work, and he runs everything on the same physical machine so nodes have perfectly synchronized clocks (DBs run in virtualized containers so that they look like separate hosts). And he doesn't test really nasty situations, like partially connected networks, where each server can see only some of the other servers.
The focus of Kingsbury's analyses with Jepsen is to evaluate how strongly consistent these databases can be -- that is, if you set up a network partition before data can be written everywhere, and then you have a network recovery sometime later, what's happened to your data? And for each of the database servers he's tested so far (PostgreSQL, Redis, MongoDB and Riak have been written up; Cassandra, Kafka, NuoDB and ZooKeeper are in the works and will be presented at Strange Loop), he's identified behavior that is unexpected, based on documentation and marketing. From finding bugs (with a particular configuration, MongoDB treated network errors as successful acknowledgements, which 10gen quickly fixed) to design flaws (Redis' Sentinel protocol can't lead to strong consistency because it's asynchronous) to the fact that default configurations are rarely a good choice, Kingsbury shows that we can't rely on vendor's marketing statements or product documentation, or even a simple "pull the plug" test.
What that means for IT is that, first and foremost, you need to test the application design and database software you're planning to use before you commit. Draw the topology of your application, and then ask questions about what happens as different types of network partitions come into play. The Jepsen project may help you do much more thorough testing than you could afford to do on your own.
Clearly, we're moving past the period where an application lives on a single database server; instead, the disparate data that drives a given application will live in different databases. That said, I reached out to Kingsbury to ask if he has guidance for organizations looking at different NoSQL options, beyond his blog posts. His response: Decide what objectives you have for your application, and make a database selection based on API, performance and consistency requirements. He also says that, for most applications, the best option is likely to have both strongly consistent data stores and strongly available data stores, and be clear on which is best suited for any particular data set.
Seven more questions Kingsbury recommends IT ask about database architectures:
-- What if we cut the network here?
-- What if we isolate a master?
-- What if there's no connected majority?
-- What if there's asymmetric connectivity?
-- How about intermittent connectivity?
-- What if clocks are unsynchronized?
-- What if a node pauses for a few minutes and then comes back?
He also has some personal preferences: Redis for caching, Cassandra for very large clusters and high-volume writes, Hadoop for extremely large amounts of data where slow query time is OK, Riak as an AP key-value store where conflicting versions of an object can be resolved, ZooKeeper for small pieces of CP state data, and PostgreSQL for relational data. (Here's an explanation of CP vs. AP, based on the CAP theorem.)
I have two other recommendations. First, Kingsbury and Peter Bailis, a graduate student at UC Berkeley, have co-written an excellent and detailed overview of network partitions, titled "The Network Is Reliable," that provides a tremendous amount of evidence that many networks are anything but reliable. And if you'd like a shorter -- but still technical -- overview of Jepsen and Kingsbury's analyses, check out "Jepsen: Testing the Partition Tolerance of PostgreSQL, Redis, MongoDB and Riak."
Ultimately, the problem isn't that databases are fallible and have bugs. We should expect that. It's also not that you're going to lose data. This too is expected; you just need to know how much and why and communicate that to the data owners. The problem is that we're building mission-critical applications using distributed databases and simply assuming that those databases will work as advertised. That is an incorrect and dangerous assumption: Distributed systems don't fail as cleanly as single-node systems, so we need to ask more questions than just, "What happens if this node fails?" Find out, maybe.