Facebook Blames Outage On Database Failure
Programming chief says lengthy downtime was traced to an automated system designed to detect and fix error conditions.
Facebook officials said the lengthy outage that hit the site Thursday was the result of a glitch in the social media network's database software.
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition," said Robert Johnson, Facebook's director of software engineering, in a blog post.
More Storage Insights
- The Untapped Potential of Mobile Apps for Commercial Customers
- Get Actionable Insight with Security Intelligence for Mainframe Environments
- When Infrastructure Really Matters, A Focus on High-End Storage
- Next-gen Private, Public and Hybrid Clouds Reinvent Businesses
Johnson said software that's designed to detect and fix such errors backfired, compounding the original problem. "An automated system for verifying configuration values ended up causing much more damage than it fixed," said Johnson.
"The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values form the persistent store," Johnson continued. "This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid," he said.
"We made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," said Johnson.
"As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover," said Johnson.
The glitch brought the world's most popular social networking site to a standstill for two and one-half hours Thursday. Johnson called it "the worst outage we've had in over four years."
Johnson did not state whether the company planned to offer credits or other compensation to advertisers who's campaigns were offline during the outage.
Facebook in just a few years has grown from founder Mark Zuckerberg's college project to a multibillion dollar global software empire that's beginning to rival Google and Microsoft on some fronts. As such, customers and users have become less forgiving when it comes to outages and other glitches as Facebook enters the tech industry's big league.
Johnson seemed to recognize that in his blog post. "We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously," he said.
New database options, as well as concerns about licensing and security, could shake the status quo. Read about that and more--including a look at NoSQL and how to keep unstructured data safe--in the new all-digital issue of InformationWeek. Download it report here (registration required).