Global CIO: Why The Amazon Cloud Outage Is Irrelevant
Having lived with generations of products that promised to eliminate human error, I know this: They won't. Plan for failure.
Much has been written about how Amazon's recent outage is a major setback for cloud computing, and, oh look, that means private clouds are going to be on the upswing. But the Amazon outage is irrelevant to those of us planning the future of computing at our organizations.
The root cause of Amazon's several-day service interruption, which took down sites such as Foursquare and Heroku, was that, in the attempt to add capacity, human error shifted traffic into a secondary, slower network used for backup purposes. Or, in layman's terms, customers had a need, and in the attempt to fulfill that need, an Amazon technician made an error. Amazon promised customers that it will automate future changes of this kind as much as possible.
My colleague Charles Babcock wrote on April 29 that "processes susceptible to human error are not going to be good enough in the future." But having lived with generations of products that promised (falsely) to eliminate all human error, I know this: All processes, even automated ones, are set up and modified by humans, and therefore they're all subject to human error. As I pointed out in a blog in 2009, as long as cloud service providers employ humans, customers had better plan for failure. As Charlie put it, "it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare-up with a spouse."
The cloud, at least infrastructure as a service, promises to bring lots of granular computing resources to the table. Low-cost, small, virtualized servers, along with smart apps that know to call for more servers when needed, are indeed a paradigm shift. It's like unleashing lots of piranhas on a given problem. It's much better than the last "solution to all of our problems," the highly redundant and overbuilt server.
Those servers, touted as the equivalent of bad ass sharks, had multiple RAID cards, multiple drives, multiple processors, multiple power supplies. At the time, enterprises invested a heck of a lot of money in them. Some infrastructure folks in some enterprises still think these are the only types of servers to buy. Problem is, it's still easy to grind these suckers to a halt with a batch of bad patches, or with one well intentioned misconfiguration.
With cloud computing, those schools of piranha are going to be a lot more resilient to attack. Any one server dying on you isn't going to be a big deal. And you don't have to worry about "dead backplane syndrome." Even with redundant power supplies, CPUs, and the like, those shark servers have only one backplane, and if that goes out, you're in trouble. I've seen it happen more times than I care to think about.
So hooray for cloud. But take heed, cloudies: "a lot more resilient" doesn't mean "infallible."
Whenever a fancy new IT product, service, or way of doing things shows up, people start to get excited. Especially when the benefits are significant, otherwise sharp and canny professionals start to act like they've never been in IT before. They start to participate in the dysfunction of belief that this time things are going to be different. This time, the products/services are going to eliminate failure!
Plan for failure. Plan for failure. Plan for failure. I can't emphasize that enough. Once your staff has implemented the coolest, newest, most resilient system available, take some time out and plan for it failing.
This can be a difficult conversation, because people will look at you as if you're wasting their time. But do it, and explain why. You won't be sorry.
I read a tweet the other day which argued that the Amazon outage underscores the need for change management techniques, which seek to bring discipline to yesterday's willy-nilly configuration changes. Specifically, such techniques involve change management boards, so that no one person is allowed to authorize a change.
I'm a big fan of the frameworks that espouse change management, notably ITIL, and of more disciplined IT service management in general. A focus on change control is incredibly helpful. I'm also a big believer in the Hawthorne effect: The mere act of tracking business technology downtime is one of the best ways to reduce it. Continuing the time-honored techniques of investing in test gear and insisting that employees do test builds will help reduce downtime further.
But all of those best practices won't eliminate failure. It comes down to this: You will have change in your environment, and your service provider will have change in its environment. Any change requires some level of human intervention, whether it's on the dev or ops side.
And that's why the Amazon outage is irrelevant. CIOs should be evaluating the cloud as just another service-delivery mechanism, planning for its failure.
Jonathan Feldman is a contributing editor for InformationWeek and director of IT services for a rapidly growing city in North Carolina. Write to him at firstname.lastname@example.org or at @_jfeldman.
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
IT Strategies to Conquer the CloudChances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.