Global CIO: Why The Amazon Cloud Outage Is Irrelevant

Having lived with generations of products that promised to eliminate human error, I know this: They won't. Plan for failure.

Jonathan Feldman, CIO, City of Asheville, NC

May 6, 2011

5 Min Read

Much has been written about how Amazon's recent outage is a major setback for cloud computing, and, oh look, that means private clouds are going to be on the upswing. But the Amazon outage is irrelevant to those of us planning the future of computing at our organizations.

The root cause of Amazon's several-day service interruption, which took down sites such as Foursquare and Heroku, was that, in the attempt to add capacity, human error shifted traffic into a secondary, slower network used for backup purposes. Or, in layman's terms, customers had a need, and in the attempt to fulfill that need, an Amazon technician made an error. Amazon promised customers that it will automate future changes of this kind as much as possible.

My colleague Charles Babcock wrote on April 29 that "processes susceptible to human error are not going to be good enough in the future." But having lived with generations of products that promised (falsely) to eliminate all human error, I know this: All processes, even automated ones, are set up and modified by humans, and therefore they're all subject to human error. As I pointed out in a blog in 2009, as long as cloud service providers employ humans, customers had better plan for failure. As Charlie put it, "it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare-up with a spouse."

The cloud, at least infrastructure as a service, promises to bring lots of granular computing resources to the table. Low-cost, small, virtualized servers, along with smart apps that know to call for more servers when needed, are indeed a paradigm shift. It's like unleashing lots of piranhas on a given problem. It's much better than the last "solution to all of our problems," the highly redundant and overbuilt server.

Those servers, touted as the equivalent of bad ass sharks, had multiple RAID cards, multiple drives, multiple processors, multiple power supplies. At the time, enterprises invested a heck of a lot of money in them. Some infrastructure folks in some enterprises still think these are the only types of servers to buy. Problem is, it's still easy to grind these suckers to a halt with a batch of bad patches, or with one well intentioned misconfiguration.

With cloud computing, those schools of piranha are going to be a lot more resilient to attack. Any one server dying on you isn't going to be a big deal. And you don't have to worry about "dead backplane syndrome." Even with redundant power supplies, CPUs, and the like, those shark servers have only one backplane, and if that goes out, you're in trouble. I've seen it happen more times than I care to think about.

So hooray for cloud. But take heed, cloudies: "a lot more resilient" doesn't mean "infallible."

Whenever a fancy new IT product, service, or way of doing things shows up, people start to get excited. Especially when the benefits are significant, otherwise sharp and canny professionals start to act like they've never been in IT before. They start to participate in the dysfunction of belief that this time things are going to be different. This time, the products/services are going to eliminate failure!

Plan for failure. Plan for failure. Plan for failure. I can't emphasize that enough. Once your staff has implemented the coolest, newest, most resilient system available, take some time out and plan for it failing.

This can be a difficult conversation, because people will look at you as if you're wasting their time. But do it, and explain why. You won't be sorry.

I read a tweet the other day which argued that the Amazon outage underscores the need for change management techniques, which seek to bring discipline to yesterday's willy-nilly configuration changes. Specifically, such techniques involve change management boards, so that no one person is allowed to authorize a change.

I'm a big fan of the frameworks that espouse change management, notably ITIL, and of more disciplined IT service management in general. A focus on change control is incredibly helpful. I'm also a big believer in the Hawthorne effect: The mere act of tracking business technology downtime is one of the best ways to reduce it. Continuing the time-honored techniques of investing in test gear and insisting that employees do test builds will help reduce downtime further.

Global CIOs: A Site Just For You Visit InformationWeek's Global CIO -- our online community and information resource for CIOs operating in the global economy.

But all of those best practices won't eliminate failure. It comes down to this: You will have change in your environment, and your service provider will have change in its environment. Any change requires some level of human intervention, whether it's on the dev or ops side.

And that's why the Amazon outage is irrelevant. CIOs should be evaluating the cloud as just another service-delivery mechanism, planning for its failure.

Jonathan Feldman is a contributing editor for InformationWeek and director of IT services for a rapidly growing city in North Carolina. Write to him at [email protected] or at @_jfeldman.

About the Author

Jonathan Feldman

CIO, City of Asheville, NC

Jonathan Feldman is Chief Information Officer for the City of Asheville, North Carolina, where his business background and work as an InformationWeek columnist have helped him to innovate in government through better practices in business technology, process, and human resources management. Asheville is a rapidly growing and popular city; it has been named a Fodor top travel destination, and is the site of many new breweries, including New Belgium's east coast expansion. During Jonathan's leadership, the City has been recognized nationally and internationally (including the International Economic Development Council New Media, Government Innovation Grant, and the GMIS Best Practices awards) for improving services to citizens and reducing expenses through new practices and technology. He is active in the IT, startup and open data communities, was named a "Top 100 CIO to follow" by the Huffington Post, and is a co-author of Code For America's book, Beyond Transparency. Learn more about Jonathan at Feldman.org.

See more from Jonathan Feldman

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Global CIO: Why The Amazon Cloud Outage Is Irrelevant

About the Author

Editor's Choice

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

<span class="ArticleBase-LargeTitle">Global CIO: Why The Amazon Cloud Outage Is Irrelevant</span>Global CIO: Why The Amazon Cloud Outage Is Irrelevant

About the Author

Editor's Choice

Global CIO: Why The Amazon Cloud Outage Is Irrelevant