Infrastructure // Storage
News
9/24/2010
09:31 AM
Connect Directly
RSS
E-Mail
50%
50%
Repost This

Facebook Blames Outage On Database Failure

Programming chief says lengthy downtime was traced to an automated system designed to detect and fix error conditions.

Facebook officials said the lengthy outage that hit the site Thursday was the result of a glitch in the social media network's database software.

"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition," said Robert Johnson, Facebook's director of software engineering, in a blog post.

Johnson said software that's designed to detect and fix such errors backfired, compounding the original problem. "An automated system for verifying configuration values ended up causing much more damage than it fixed," said Johnson.

"The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values form the persistent store," Johnson continued. "This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid," he said.

"We made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," said Johnson.

"As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover," said Johnson.

The glitch brought the world's most popular social networking site to a standstill for two and one-half hours Thursday. Johnson called it "the worst outage we've had in over four years."

Johnson did not state whether the company planned to offer credits or other compensation to advertisers who's campaigns were offline during the outage.

Facebook in just a few years has grown from founder Mark Zuckerberg's college project to a multibillion dollar global software empire that's beginning to rival Google and Microsoft on some fronts. As such, customers and users have become less forgiving when it comes to outages and other glitches as Facebook enters the tech industry's big league.

Johnson seemed to recognize that in his blog post. "We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously," he said.

Comment  | 
Print  | 
More Insights
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Government, May 2014
Protecting Critical Infrastructure: A New Approach NIST's cyber-security framework gives critical-infrastructure operators a new tool to assess readiness. But will operators put this voluntary framework to work?
Video
Slideshows
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.