Post-mortem analysis says Friday's cloud service outage was caused by bad script in routine maintenance update.
Gupta said Dropbox has learned from the incident. It was already checking the state of a running server during an update to see whether its data was in active use, a red flag that should have protected the production servers. The post mortem didn't explain why it wasn't protected, but Gupta said Dropbox servers will receive an added layer of protection.
In the future, servers being updated will be called to verify their state before executing an incoming update command. "This enables machines that self-identify as running critical processes to refuse potentially destructive operations," he wrote.
Gupta noted that Dropbox has grown quickly to serve "hundreds of millions of users," and this growth has required Dropbox to regularly upgrade and repurpose servers.
He came to the heart of the issue at the very end of his blog: "When running infrastructure at large scale, the standard practice of running multiple slaves provides redundancy. However, should those slaves fail, the only option is to restore from backup."
Dropbox, like many other web services, makes extensive use of the MySQL open source database system. Its strengths are in the speed of reading and serving data, not on backup and recovery. "The standard tool used to recover MySQL data from backups is slow when dealing with large data sets," Gupta noted, a fact that MySQL database administrators have known for years.
Rapidly growing services are usually focused on the simplest, cheapest technologies that will help them deliver the service, and such components often perform admirably well. To make MySQL function better and faster in recovery, Dropbox has developed a tool that pulls the data from MySQL database server logs to replay the events leading up to failure. The Dropbox tool can extract the data in parallel, which "enables much faster recovery from large MySQL backups," Gupta wrote.
Ending on a positive note, he noted Dropbox "plans to open source this tool, so others can benefit from what we've learned." Another lesson learned might be: rapid growth usually emphasizes the tools that support growth, not the tools that support recovery when something goes wrong.
Charles Babcock is an editor-at-large for InformationWeek, having joined the publication in 2003. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week.
Private clouds are moving rapidly from concept to production. But some fears about expertise and integration still linger. Also in the Private Clouds Step Up issue of InformationWeek: The public cloud and the steam engine have more in common than you might think. (Free registration required.)
Google in the Enterprise SurveyThere's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity products, and 69 percent cite Google Apps' good or excellent mobility. But progress could still stall: 59 percent of nonusers distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.