Conventional wisdom says distributed databases are the future. You run a bunch of servers in a bunch of locations, scaling horizontally, and if any individual instance fails, it doesn't take your business services with it. But let's face it. Distributed database servers can go south in many multifaceted ways -- all of them horribly complicated. And it's not as if we protect our databases all that well, even when they're not scattered about. Now we're supposed to trust new database products and their cloud-based providers, and maybe even try new approaches like building failover logic into applications?
That's the million-dollar question for IT teams with perfectly functional conventional relational database management systems (RDBMS), many of them watching the rise in distributed databases with trepidation. It seems like every day we read about the awesomeness of NoSQL or NewSQL or [insert trendy database here] and how the crushing flood of big data will soon make your trustworthy relational database irrelevant. We can hear the internal conversation: "What planet are these guys on? Sure, some special applications need new database architectures. But our mission-critical applications run on a time-tested relational setup -- Oracle, SQL Server, MySQL, sometimes replicating to an online slave -- and deliver pretty good performance, thanks very much. And dammit, we need joins."
We hear you. Most IT teams have made some operational improvements over the years, maybe abandoning mixing SQL queries with code and HTML in PHP or ASP in favor of MVC frameworks and object-relational mapping. So it's OK that your databases look similar to how they did 10 years ago, right?
Only if your users' and customers' expectations are also frozen in time. Twenty years ago, financial institutions would shut down for three hours each night, the distributed database expert Eric Brewer, developer of the CAP theorem (a.k.a. Brewer theorem), said in a NoSQL discussion that's well worth a listen. Of course, credit cards were taken via those mechanical carbon-copy metal sliders, too. Today, how often do Twitter, Facebook, LinkedIn, Amazon, and Google have scheduled downtime? Or any downtime, for that matter?
That's the standard to which you're being held.
As tempting as it may be to cling to your current setup, either you need distributed databases in your architecture now, or you will soon enough. Don't think you can just jump into the NoSQL game a few weeks before launching a service, either. Bad choices have sunk entire companies (see: Friendster). Nor can you delegate database selection to your developers. IT leaders need to huddle with the business and assign consistency and availability requirements to all critical data before making a move.
Still not convinced? We'll make the case, but first, let's lay some groundwork. In its simplest form, a distributed database is one that stretches across multiple nodes. In practice, we exclude conventional relational database management systems that are distributed only in terms of having a master and one or more slaves (though we would consider various multimaster RDBMS setups as being distributed). The reason we're drawing a hard line between systems with a single write node and those with multiple write nodes is because the rise of distributed databases is about the mandate to scale writes, and you will quickly hit limits on scaling with only one node. It's all about availability.
Despite the importance of availability, IT teams typically run web applications on a single master (best case: a simple master-slave setup) with an expected hardware lifespan of three or so years. If they don't get hit with a drive failure or enough traffic to cripple their architecture, they think they're smart. They're wrong. Bottom line: Conventional relational database deployments are not good at being available all the time. There are plenty of reasons for that, and we'll explore them in the context of what needs to change. But suffice it to say that the rise in distributed databases comes from a concerted attempt to make systems more available. And fundamentally, you make systems more available by having a lot of different sets of hardware responding to requests, so that failures are not fatal. Availability is an easily solvable problem for systems that are read-only; deploy lots of data replicas, and you're done. But when you need to write to your system on a regular basis, things get complicated for three main reasons.
First, if you live in a world with partitions -- meaning some nodes in a distributed system may get cut off -- you can have either strict availability or strict consistency, but not both (that's the CAP theorem). Fact: We live in a world with partitions -- systems go down and timeouts happen all the time. When system A can't talk to system B, and both are told separately to update the same record with different information, either the systems stop writing anything (so availability suffers), or they become inconsistent (at least for a little while).
Second, the conversation has moved from the initial CAP theorem to a more nuanced view of the availability/consistency spectrum. Brewer's presentation gives an overview, but essentially, databases that focus on availability are getting more consistent, and databases that focus on consistency are getting more available, and hopefully we'll meet in the middle one day. But we're not there yet. The best way to improve availability and consistency now is to think about how systems should run in light of a timeout and how systems should heal after timeouts end. We're describing partitions here in terms of timeouts because it's not always possible to know whether you have a partition, but you can know whether you've had a timeout.
When a timeout happens, the system has to decide whether to take some action and fix problems later or throw away the request. Note that retrying after a timeout is just delaying making a decision. If the decision is to fix problems later, the system also needs to decide when -- at the last minute or as soon as possible?
Finally, there's the question of when (or if) you tell the user what's transpired.
Real-world business processes tend to be highly available and eventually consistent (as opposed to less available and strictly consistent). For example, your inventory system indicates that you have one widget left. A customer buys it on your website, but when you go to fulfill the order, that last widget is broken. So you now have to unwind the transaction because your inventory system was inconsistent with reality.
Some of today's NoSQL and NewSQL distributed databases let you tune parameters around availability and consistency. Others are more fixed. But these products are so new that the marketing claims are very often different from performance reality, Kyle Kingsbury finds in his Call Me Maybe project.
Security presents the third big challenge for distributed databases. Relational databases don't exactly have a great history here -- problems range from SQL injection to limited granularity to ensure that users are allowed to modify only certain fields at the database level. Spreading your database across multiple servers in multiple datacenters will make security even more difficult, because there will be more avenues -- physically and virtually -- to compromise those servers.
Distributed databases are likely to be less secure than conventional relational database environments for a few other reasons, as well. These products are still fairly immature, and the development focus tends to be more on new functionality and less on security niceties like adding granularity in access or having security experts review code to spot potential problems. New versions come out constantly, and each one increases the likelihood that developers have added a security hole. These databases aren't battle tested, and if there's still an element of security by obscurity, it won't last.
By now you're probably thinking, "Thanks but no thanks" to distributed databases. But hear us out. Availability is important enough that it's worth powering through the problems. Let's look at some real-world examples to show why. We'll also take a moment to discuss the wisdom of essentially making your application part of your database -- a tactic that's appealing but ultimately risky.
InformationWeek Tech Digest, Nov. 10, 2014Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?