Cloud // Infrastructure as a Service
05:29 PM
Charles Babcock
Charles Babcock
Connect Directly
Repost This

Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

The snafus in the cloud, it turns out, aren't so different from those occurring in the overworked, under-automated and undocumented processes of the average data center. According to Amazon's post mortem explanation of its recent hours-long outage, the failure was apparently triggered by a human error.

If so, processes susceptible to human error are not going to be good enough in the future, if the cloud is going to be a permanent platform for enterprise computing.

The cause of Amazon's recent outage, which would have been more of a disaster than it was but for the low Easter holiday traffic, was the result of a configuration error in a scheduled network update. The change was attempted in the middle of the night--at 3:47 a.m. in Northern Virginia-"as part of our normal scaling activity," according to the official explanation. That sounds like the EC2 data center was anticipating the start of early morning activity, where big customers such as Bizo or Reddit start refreshing hundreds of websites in preparation to meet the day's earliest readers.

The primary network serving one of the four availability zones in EC2's U.S. East-1 data center needed more network capacity. The attempt to provide it mistakenly shifted the traffic off a primary network onto a secondary and lower bandwidth network used for backup purposes. This is a change that has been probably correctly implemented thousands of times. It's the kind of error an operator could makes as a wrong choice on a menu or the entry of the name of the last network worked on instead of the one needed. In short, it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare up with a spouse.

However, I thought the Amazon Web Services cloud used more automated procedures than that. I thought clearly obvious errors had been anticipated and worked through, with defenses in place. Two lines of logic, checking the operator's decision, would have halted him in his tracks. A simple network configuration error should not be the source of a monumental hit to confidence in cloud computing. But apparently it is.

What happened next is not so different from what we speculated in Cloud Takes A Hit; Amazon Must Fix EC2 a week ago, based on the cryptic postings on the Services Health Dashboard. Eight minutes after the change marked the start of what the Amazon Service Health Watch dashboard described as "a networking event." The misconfiguration choked the backup network, which caused "a large number of EBS nodes in a single EBS cluster lost connection to their replicas."

An EBS cluster is servers and disk serving as short-term storage for running workloads in a given availability zone. The preceding description doesn't sound like much of an event, but in the cloud, it triggers a massive response. Suddenly large sets of data no longer knew whether their backup copy still existed on the cluster, and a central tenet of the cluster's operation is that a backup copy is always available--in case of a hardware failure.

The networking error in itself was relatively minor and easily rectified. But the error set up a massive "re-mirroring storm," a new and valuable addition to computing lexicon's already long list of disaster terms. So many Elastic Block Store volumes were trying to find disk space on which to recreate themselves that when they failed to find it, they aggressively tried again, tying up disk operations in a zone. You get the picture.

1 of 2
Comment  | 
Print  | 
More Insights
2014 Private Cloud Survey
2014 Private Cloud Survey
Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Elite 100 - 2014
Our InformationWeek Elite 100 issue -- our 26th ranking of technology innovators -- shines a spotlight on businesses that are succeeding because of their digital strategies. We take a close at look at the top five companies in this year's ranking and the eight winners of our Business Innovation awards, and offer 20 great ideas that you can use in your company. We also provide a ranked list of our Elite 100 innovators.
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.