Amazon Meltdown Required Reading At Fail University

We're starting to learn more about what happened during the Great AWS Outage last month. Perhaps the fault lies not in our servers, but in ourselves.

Dave Methvin, Contributor

May 2, 2011

2 Min Read

But now the course material has switched to cloud computing, and AFU has come along to deliver a tough lesson for anyone who didn't study hard before moving their data and services to the cloud. It's a lot of reading and full of AWS-specific terms, but most IT professionals will get something out of reading the technical post-mortem closely. The Cliff Notes version is that Amazon's Elastic Block Store (EBS) service failed in one data center because of a mistake made by the staff during a service upgrade. That caused cascading failures as servers recognized the failure and attempted to switch data off the failed EBS instances. It was a classic congestive failure.

The most important take-away lesson from Amazon's latest AFU class is this: There is no better way to find a single point of failure than to have it fail. Some customers have learned this lesson. In last week's column I mentioned that Netflix is a big AWS customer but seemed relatively unscathed by this problem. Netflix did their own post-mortem describing how they fared. If you're looking to copy someone's class notes on cloud failures, I'd recommend theirs.

One key secret that allowed Netflix to weather a cloud failure is that they actually design with failure in mind. But doesn't everyone say they do that? Yes, but Netflix uses a "chaos monkey" to randomly kill services, which makes sure that they can actually survive and have enough capacity to deal with passing storms. Still, this most recent mass-failure was a kind not anticipated by Netflix, and they're using the lessons from it to fine-tune their cloud tactics. They're even considering a "chaos gorilla" to test the effects of more widespread failures.

None of this is meant to excuse Amazon's role in this mess. AWS should deliver the most reliable service possible to their customers, and they certainly bear some responsibility for lost data and downtime. That's especially true in this case, since their own staff actions started the cascade of failures. Their post-mortem acknowledges that they are working on specific procedures and measures to address some of the causes. Like most catastrophic scenarios, it's unlikely we'll see this specific problem happen again. But you can bet this wont' be the last lesson taught by a big cloud failure, if we're willing to learn. Stay in school, kids.

Recommended Reading: Post Mortem: When Amazon's Cloud Turned On Itself Amazon Web Services Apologizes, Explains Outage IBM, Akamai Partner To Speed Cloud Apps GSA Prepares $2.5 Billion Cloud Procurement Amazon Cloud Outage Proves Importance Of Failover Planning How HootSuite Recovered From Amazon's Cloud Outage See more by Dave Methvin

Read more about:

20112011

About the Author(s)

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights