IT Infrastructure

Amazon Meltdown Required Reading At Fail University

We're starting to learn more about what happened during the Great AWS Outage last month. Perhaps the fault lies not in our servers, but in ourselves.

Dave Methvin, Contributor

May 2, 2011

2 Min Read

But now the course material has switched to cloud computing, and AFU has come along to deliver a tough lesson for anyone who didn't study hard before moving their data and services to the cloud. It's a lot of reading and full of AWS-specific terms, but most IT professionals will get something out of reading the technical post-mortem closely. The Cliff Notes version is that Amazon's Elastic Block Store (EBS) service failed in one data center because of a mistake made by the staff during a service upgrade. That caused cascading failures as servers recognized the failure and attempted to switch data off the failed EBS instances. It was a classic congestive failure.

The most important take-away lesson from Amazon's latest AFU class is this: There is no better way to find a single point of failure than to have it fail. Some customers have learned this lesson. In last week's column I mentioned that Netflix is a big AWS customer but seemed relatively unscathed by this problem. Netflix did their own post-mortem describing how they fared. If you're looking to copy someone's class notes on cloud failures, I'd recommend theirs.

One key secret that allowed Netflix to weather a cloud failure is that they actually design with failure in mind. But doesn't everyone say they do that? Yes, but Netflix uses a "chaos monkey" to randomly kill services, which makes sure that they can actually survive and have enough capacity to deal with passing storms. Still, this most recent mass-failure was a kind not anticipated by Netflix, and they're using the lessons from it to fine-tune their cloud tactics. They're even considering a "chaos gorilla" to test the effects of more widespread failures.

None of this is meant to excuse Amazon's role in this mess. AWS should deliver the most reliable service possible to their customers, and they certainly bear some responsibility for lost data and downtime. That's especially true in this case, since their own staff actions started the cascade of failures. Their post-mortem acknowledges that they are working on specific procedures and measures to address some of the causes. Like most catastrophic scenarios, it's unlikely we'll see this specific problem happen again. But you can bet this wont' be the last lesson taught by a big cloud failure, if we're willing to learn. Stay in school, kids.

About the Author(s)

Dave Methvin

Contributor

See more from Dave Methvin

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Amazon Meltdown Required Reading At Fail University

About the Author(s)

Editor's Choice