January 14, 2014
Deadly Downtime: The Worst Network Outages Of 2013
Deadly Downtime: The Worst Network Outages Of 2013 (click image for larger view and for slideshow)
Dropbox says it knows how to avoid the type of outage that struck its service Friday: In the future, it will check the state of its servers before making additions to their code.
That will prevent an operating system update from being applied to a running production server, a process that usually results in the server shutting down. Enterprise users who rely on Dropbox for offsite storage already understand the principle. Some may be wondering whether Dropbox has the operational smarts to be relied upon for the long term.
Dropbox was trying to upgrade some servers' operating systems Friday evening in "a routine maintenance episode" when a buggy script caused some of the updates to be applied to production servers, a move that resulted in the maintenance effort being anything but routine. Dropbox customers experienced a 2.5-hour loss of access to the service, with some services out for much of the weekend.
Dropbox uses thousands of database servers to store pictures, documents, and other complex user data. Each database system includes a master database server and two slaves, an approach that leaves two copies of the data intact in case of a server hardware failure. The maintenance script appears to have launched new servers on running database servers.
"A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down," wrote Akhil Gupta, head of infrastructure, in a post-mortem blog Sunday.
[Some cloud customers are getting fed up with outages. See Amazon Cloud Outage Causes Customer To Leave.]
Dropbox went off the air abruptly Friday between 5:30 and 6:00 p.m. Pacific time. For two hours, Dropbox's site remained dark, then reappeared at 8:00 p.m., according to user observations posted to Twitter and other sites. Users were able to log in again starting about 8:30 p.m. PT.
It wasn't clear from Gupta's post mortem how many servers had been directly affected; at one point he referred to "a handful." Gupta assured customers, "Your files were never at risk during the outage. These databases do not contain file data. We use them to provide some of our features (for example, photo album sharing, camera uploads, and some API features)."
On the other hand, operation of some of the paired master/slave database systems appear to have been entirely lost, something that a cloud operation tries at all times to avoid. Normally, if a master system is lost, its operations are taken offline long enough for the two slaves to create a third copy of the data and appoint one of the three as the new master.
Figure 2: (Source: Wikipedia.)
Gupta explained in the blog, "To restore service as fast as possible, we performed the recovery from our backups." This suggests that Dropbox had to load stored copies of database systems to get its production systems running again.
"We were able to restore most functionality within 3 hours," he wrote, "but the large size of some of our databases slowed recovery, and it took until 4:40 p.m. PT [Sunday] for core service to fully return," Gupta wrote. This was at least 46 hours and 40 minutes after the outage began. Dropbox Photo Lab service was still being worked on after 48 hours.
Two-and-a-half hours into the outage, Dropbox responded to rumors and denied that its site had been hacked. At 8:30 p.m. Friday, the company tweeted: "Dropbox site is back up! Claims of leaked user info are a hoax. The outage was caused during internal maintenance. Thanks for your patience!"
One Twitter user agreed: "Dropbox not hacked, just stupid."
Gupta's post mortem took forthright responsibility for the outage, admitting Dropbox caused it with the faulty operating system upgrade script. It reassured users about their data, while explaining why it had taken so long to bring all services back online. But the fact that master/slave systems seem to have gone down together in "a routine maintenance episode" is not fully explained. If the operating system upgrade were staged so that only one of three database servers was changed at a time, two systems would have remained intact and recovery would have been faster.
Gupta said Dropbox has learned from the incident. It was already checking the state of a running server during an update to see whether its data was in active use, a red flag that should have protected the production servers. The post mortem didn't explain why it wasn't protected, but Gupta said Dropbox servers will receive an added layer of protection.
In the future, servers being updated will be called to verify their state before executing an incoming update command. "This enables machines that self-identify as running critical processes to refuse potentially destructive operations," he wrote.
Gupta noted that Dropbox has grown quickly to serve "hundreds of millions of users," and this growth has required Dropbox to regularly upgrade and repurpose servers.
He came to the heart of the issue at the very end of his blog: "When running infrastructure at large scale, the standard practice of running multiple slaves provides redundancy. However, should those slaves fail, the only option is to restore from backup."
Dropbox, like many other web services, makes extensive use of the MySQL open source database system. Its strengths are in the speed of reading and serving data, not on backup and recovery. "The standard tool used to recover MySQL data from backups is slow when dealing with large data sets," Gupta noted, a fact that MySQL database administrators have known for years.
Rapidly growing services are usually focused on the simplest, cheapest technologies that will help them deliver the service, and such components often perform admirably well. To make MySQL function better and faster in recovery, Dropbox has developed a tool that pulls the data from MySQL database server logs to replay the events leading up to failure. The Dropbox tool can extract the data in parallel, which "enables much faster recovery from large MySQL backups," Gupta wrote.
Ending on a positive note, he noted Dropbox "plans to open source this tool, so others can benefit from what we've learned." Another lesson learned might be: rapid growth usually emphasizes the tools that support growth, not the tools that support recovery when something goes wrong.
Charles Babcock is an editor-at-large for InformationWeek, having joined the publication in 2003. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week.
Private clouds are moving rapidly from concept to production. But some fears about expertise and integration still linger. Also in the Private Clouds Step Up issue of InformationWeek: The public cloud and the steam engine have more in common than you might think. (Free registration required.)
About the Author(s)
You May Also Like