IT Infrastructure

Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

Charles Babcock, Editor at Large, Cloud

April 29, 2011

4 Min Read

In building high availability into cloud software, we've escaped the confines of hardware failures that brought running systems to a halt. In the cloud, the hardware may fail and everything else keeps running. On the other hand, we've discovered that we've entered a higher atmosphere of operations and larger plane on which potential failures may occur.

The new architecture works great when only one disk or server fails, a predictable event when running tens of the thousands of devices. But the solution itself doesn't work if it thinks hundreds of servers or thousands of disks have failed all at once, taking valuable data with them. That's an unanticipated event in cloud architecture because it isn't supposed to happen. Nor did it happen last week. But the governing cloud software thought it had, and triggered a massive recovery effort. That effort in turn froze EBS and Relational Database Service in place. Server instances continued running in U .S. East-1, but they couldn't access anything, more servers couldn't be initiated and the cloud ceased functioning in one of its availability zones for all practical purposes for over 12 hours.

The accounts that I have paid the most attention to in the aftermath have been those whose operations didn't fail, despite the Amazon architecture's breakdown. Accounts like the one from Donnie Flood, VP of engineering at Bizo, or Oren Michels, CEO of the Mashery. In talking to Jesse Lipson, CEO of ShareFile, an original EC2 beta customer in 2008 and still a customer, he said, "We're pretty paranoid about betting on any company, even if it's Amazon," and his firm invoked the option of redirecting its traffic to Amazon's West Coast data center when it found its servers failing. ShareFile, which supplies a file sharing and storage service to business, maintains its own "heartbeat" monitoring system for its servers, and the system detected ShareFile servers disappearing after the "network event" in EC2. The system automatically shifted ShareFile traffic toward those that were in the West Coast data center.

I think Amazon itself should have a traffic shifting system that reroutes the bulk of customer traffic when an availability zone or whole data center is no longer available. It should shift it, as individual customers did, from East to West, degrading service no doubt, but keeping customers online. Lipson points out, however, that linking data centers might allow the harm to spread. Inside the Northern Virginia data center, availability zones--which are subdivisions of the data center operating independently--the trouble spread like a contagion. Backup measures that worked in individual cases or across a small set cascaded out of control when invoked on a scale that had previously been unanticipated.

Despite that risk, I still think Amazon must link data centers, but it must also include a circuit breaker that queues up traffic or shunts it away if it turns into a threat to the functioning facility. Within a data center, availability zones need to be, well, available, even if there is trouble in one of them. I think that means architecting services so that they operate in some isolation in one zone from troubles in another. In the aftermath, the EBS and RDS services operated across availability zones, and freezing them in one froze them in all.

All of this is much easier said than done when operating on the scale and complexity of Amazon's EC2. Amazon has done such a good job of pioneering the cloud that there is an immense reservoir of faith among its customers that it will eventually get it right. No one I've talked to says they're willing to switch. Cloud computing may have had a setback, but it will make a quick comeback. There is a widespread belief that when it does, it will be better. Still, it remains to be said: Amazon has got to do better than this. It has got to get it right.

Charles Babcock is an editor-at-large for InformationWeek.

About the Author

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

See more from Charles Babcock

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Post Mortem: When Amazon's Cloud Turned On Itself

About the Author

Editor's Choice

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

<span class="ArticleBase-LargeTitle">Post Mortem: When Amazon's Cloud Turned On Itself</span>Post Mortem: When Amazon's Cloud Turned On Itself

About the Author

Editor's Choice

Post Mortem: When Amazon's Cloud Turned On Itself