Cloud // Cloud Storage
News
1/14/2014
10:27 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Dropbox Takes Blame For Cloud Outage

Post-mortem analysis says Friday's cloud service outage was caused by bad script in routine maintenance update.

Gupta said Dropbox has learned from the incident. It was already checking the state of a running server during an update to see whether its data was in active use, a red flag that should have protected the production servers. The post mortem didn't explain why it wasn't protected, but Gupta said Dropbox servers will receive an added layer of protection.

In the future, servers being updated will be called to verify their state before executing an incoming update command. "This enables machines that self-identify as running critical processes to refuse potentially destructive operations," he wrote.

Gupta noted that Dropbox has grown quickly to serve "hundreds of millions of users," and this growth has required Dropbox to regularly upgrade and repurpose servers.

He came to the heart of the issue at the very end of his blog: "When running infrastructure at large scale, the standard practice of running multiple slaves provides redundancy. However, should those slaves fail, the only option is to restore from backup."

Dropbox, like many other web services, makes extensive use of the MySQL open source database system. Its strengths are in the speed of reading and serving data, not on backup and recovery. "The standard tool used to recover MySQL data from backups is slow when dealing with large data sets," Gupta noted, a fact that MySQL database administrators have known for years.

Rapidly growing services are usually focused on the simplest, cheapest technologies that will help them deliver the service, and such components often perform admirably well. To make MySQL function better and faster in recovery, Dropbox has developed a tool that pulls the data from MySQL database server logs to replay the events leading up to failure. The Dropbox tool can extract the data in parallel, which "enables much faster recovery from large MySQL backups," Gupta wrote.

Ending on a positive note, he noted Dropbox "plans to open source this tool, so others can benefit from what we've learned." Another lesson learned might be: rapid growth usually emphasizes the tools that support growth, not the tools that support recovery when something goes wrong.

Charles Babcock is an editor-at-large for InformationWeek, having joined the publication in 2003. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week.

Private clouds are moving rapidly from concept to production. But some fears about expertise and integration still linger. Also in the Private Clouds Step Up issue of InformationWeek: The public cloud and the steam engine have more in common than you might think. (Free registration required.)

Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
jemison288
50%
50%
jemison288,
User Rank: Ninja
1/15/2014 | 1:43:01 PM
Re: Really? Has anyone looked at Amazon's outage history?
You can't compare Dropbox's internal database servers with Amazon running Infrastructure-as-a-Service for thousands (tens? hundreds? of thousands) of customers. They're not equivalent; Dropbox's internal database servers shouldn't go down like they did.  Amazon has *never* had an internal database failure like Dropbox had here, and Amazon does a heck of a lot more than Dropbox does.


Running database servers so they don't go down is a well-understood and known problem.  Running patches on production servers is verbotem and shouldn't happen.  This is reminiscent of Dropbox's security issues (allowing anyone to access anyone else's account for more than 4 hours; employees putting Dropbox customer information into an employee Dropbox); it makes it look like Dropbox has a lazy culture toward security and operational integrity.
Stratustician
50%
50%
Stratustician,
User Rank: Ninja
1/15/2014 | 11:38:38 AM
Really? Has anyone looked at Amazon's outage history?
In all fairness, as a service, Dropbox has been pretty reliable.  Sure, it's suffered from the odd glitch, but the reality is that these technologies are still managed by humans, and we make errors.  I'd understand if a) you were running critical databases in Dropbox (hopefully not) or b) it was a paid service (most of us using the free version).

Let's compare it to the wonderful track record of the Amazon AWS folks.  Suddenly Dropbox doesn't seem too bad, does it?

Good for Dropbox to be upfront with the cause. For users to call them stupid is overreacting a bit I think, mistakes happen. Show me someone who hasn't accidentally glitched something and I'll happily tip my hat to their pure awesomeness.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
1/14/2014 | 3:19:45 PM
Dropbox damage should have been limited, wasn't
In response to Number 6, I would have to agree that DropBox has a good record, but if you look at the Twitter comments on the outage, you can see doubt creeping into some of its users' confidence in their offsite storage supplier. And on one important point, the explanation for the outage is not an explanation. The updating of the operating system in some database master/slave combinations resulted in all three systems being lost. They then had to be reconstructed from backup, as I read the explanation. Any one who reads this differently is welcome to explain a different interpretation. But my position is, this isn't supposed to happen in the cloud. The cloud software allows for, and compensates for the failure of any one component and the service as a whole keeps running. The damage from what went wrong at Dropbox should have been contained, but it wasn't.  
avaya
50%
50%
avaya,
User Rank: Apprentice
1/14/2014 | 3:19:22 PM
Re: Responsibility, yes, but not transparency
I gotta agree with your assesment.
avaya
100%
0%
avaya,
User Rank: Apprentice
1/14/2014 | 3:18:15 PM
Re: Responsibility, yes, but not transparency
All your files are just one subtle bug away :). The postmortem explanation is not through and internal controls seem to be lacking. Overwriting a production server or handful of servers doesnt happen even in rookie tech firms.

Also, it is hard to believe the pranksters timed the hoax perfectly.  How do they know the unplanned maintenance schedule or is it pure coincidence? Many unanswered questions here.

 
pat.white
IW Pick
100%
0%
pat.white,
User Rank: Apprentice
1/14/2014 | 2:44:16 PM
Re: Responsibility, yes, but not transparency
I'd actually offer, in the industry Dropbox is known for being extremely cautious and thorough. They make very slow, deliberate updates to their products when things are perfect. This seems a bit out of character for them, and I personally wonder if they're feeling pressure to move faster as the competition with Box heats up.
Number 6
0%
100%
Number 6,
User Rank: Moderator
1/14/2014 | 2:23:22 PM
Shocking! Dropbox Not Perfect!
"Some may be wondering whether Dropbox has the operational smarts to be relied upon for the long term."

As opposed to all those other suppliers upon which we rely that have never, ever committed a mistake in their existence? 

And "Some?" Who exactly has raised this question? It's easy to write "some say" followed by an overly broad assertion made only by the author without backing it up (pun not intended).
cbabcock
100%
0%
cbabcock,
User Rank: Strategist
1/14/2014 | 1:06:01 PM
Responsibility, yes, but not transparency
Dropbox is taking responsibilty here for a human programming error, but the explanation falls well short of transparency.
Google in the Enterprise Survey
Google in the Enterprise Survey
There's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity ­products, and 69 percent cite Google Apps' good or excellent ­mobility. But progress could still stall: 59 percent of nonusers ­distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Dec. 9, 2014
Apps will make or break the tablet as a work device, but don't shortchange critical factors related to hardware, security, peripherals, and integration.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.