Strategic CIO // Executive Insights & Innovation
Commentary
5/6/2011
02:14 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Global CIO: Why The Amazon Cloud Outage Is Irrelevant

Having lived with generations of products that promised to eliminate human error, I know this: They won't. Plan for failure.

Much has been written about how Amazon's recent outage is a major setback for cloud computing, and, oh look, that means private clouds are going to be on the upswing. But the Amazon outage is irrelevant to those of us planning the future of computing at our organizations.

The root cause of Amazon's several-day service interruption, which took down sites such as Foursquare and Heroku, was that, in the attempt to add capacity, human error shifted traffic into a secondary, slower network used for backup purposes. Or, in layman's terms, customers had a need, and in the attempt to fulfill that need, an Amazon technician made an error. Amazon promised customers that it will automate future changes of this kind as much as possible.

My colleague Charles Babcock wrote on April 29 that "processes susceptible to human error are not going to be good enough in the future." But having lived with generations of products that promised (falsely) to eliminate all human error, I know this: All processes, even automated ones, are set up and modified by humans, and therefore they're all subject to human error. As I pointed out in a blog in 2009, as long as cloud service providers employ humans, customers had better plan for failure. As Charlie put it, "it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare-up with a spouse."

The cloud, at least infrastructure as a service, promises to bring lots of granular computing resources to the table. Low-cost, small, virtualized servers, along with smart apps that know to call for more servers when needed, are indeed a paradigm shift. It's like unleashing lots of piranhas on a given problem. It's much better than the last "solution to all of our problems," the highly redundant and overbuilt server.

Those servers, touted as the equivalent of bad ass sharks, had multiple RAID cards, multiple drives, multiple processors, multiple power supplies. At the time, enterprises invested a heck of a lot of money in them. Some infrastructure folks in some enterprises still think these are the only types of servers to buy. Problem is, it's still easy to grind these suckers to a halt with a batch of bad patches, or with one well intentioned misconfiguration.

With cloud computing, those schools of piranha are going to be a lot more resilient to attack. Any one server dying on you isn't going to be a big deal. And you don't have to worry about "dead backplane syndrome." Even with redundant power supplies, CPUs, and the like, those shark servers have only one backplane, and if that goes out, you're in trouble. I've seen it happen more times than I care to think about.

So hooray for cloud. But take heed, cloudies: "a lot more resilient" doesn't mean "infallible."

Whenever a fancy new IT product, service, or way of doing things shows up, people start to get excited. Especially when the benefits are significant, otherwise sharp and canny professionals start to act like they've never been in IT before. They start to participate in the dysfunction of belief that this time things are going to be different. This time, the products/services are going to eliminate failure!

Plan for failure. Plan for failure. Plan for failure. I can't emphasize that enough. Once your staff has implemented the coolest, newest, most resilient system available, take some time out and plan for it failing.

This can be a difficult conversation, because people will look at you as if you're wasting their time. But do it, and explain why. You won't be sorry.

I read a tweet the other day which argued that the Amazon outage underscores the need for change management techniques, which seek to bring discipline to yesterday's willy-nilly configuration changes. Specifically, such techniques involve change management boards, so that no one person is allowed to authorize a change.

I'm a big fan of the frameworks that espouse change management, notably ITIL, and of more disciplined IT service management in general. A focus on change control is incredibly helpful. I'm also a big believer in the Hawthorne effect: The mere act of tracking business technology downtime is one of the best ways to reduce it. Continuing the time-honored techniques of investing in test gear and insisting that employees do test builds will help reduce downtime further.

Global CIO
Global CIOs: A Site Just For You
Visit InformationWeek's Global CIO -- our online community and information resource for CIOs operating in the global economy.

But all of those best practices won't eliminate failure. It comes down to this: You will have change in your environment, and your service provider will have change in its environment. Any change requires some level of human intervention, whether it's on the dev or ops side.

And that's why the Amazon outage is irrelevant. CIOs should be evaluating the cloud as just another service-delivery mechanism, planning for its failure.

Jonathan Feldman is a contributing editor for InformationWeek and director of IT services for a rapidly growing city in North Carolina. Write to him at jf@feldman.org or at @_jfeldman.

Comment  | 
Print  | 
More Insights
The Business of Going Digital
The Business of Going Digital
Digital business isn't about changing code; it's about changing what legacy sales, distribution, customer service, and product groups do in the new digital age. It's about bringing big data analytics, mobile, social, marketing automation, cloud computing, and the app economy together to launch new products and services. We're seeing new titles in this digital revolution, new responsibilities, new business models, and major shifts in technology spending.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 16, 2014.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.