The SLA says customer instances--their application workloads running in Amazon virtual machines--need to be up and running 99.95% of the time. During EC2's Easter outage, most instances that were running before the trouble started continued to run. It might have been impossible to get a sleeping instance started, but sleepers aren't covered by the SLA. Also, the major failure wasn't the core EC2 instances covered by the SLA but the services on which the instances depend. Those services are not mentioned as being available 99.95% of the time in the SLA, even if your site depends on them. They're not mentioned at all.
In forthrightly describing the problem and owning up to it, Amazon has gone beyond the terms of the SLA and offered compensation to customers affected by the outage. It compensated those affected by the outage with 10 days of free use of EC2. But make no mistake, it didn't have to, and there's no guarantee it would do so in the future.
Here are examples of companies that kept running and those that didn't. CloudSleuth, an Amazon EC2 monitoring service, had two test applications running in Amazon's U.S. East-1 availability zone as the incident began. And it confirmed that those apps were running all through the Easter weekend outage. All they could do was send back a ping confirming they were up and running, but that's all they are designed to do.
The many websites depending on that zone, however, from Blue Sombrero to Zencoder to the better-known HootSuite and Reddit sites, were dead in the water for the better part of 12 to 24 hours and some for three days. What's the difference between them and a CloudSleuth app? While the core Reddit apps are running, they need data delivered by EC2 services Elastic Block Store, which draws customer data off of disks, and Relational Database Service, which draws data out of MySQL databases. They use it to maintain and update their sites. These services were not available.
Amazon mistakenly shifted primary network traffic onto a network that wasn't designed for it. That network choked, prompting Elastic Block Store to discover that backup copies of data that it expected to be there were no longer available. It set off a furious "remirroring storm," which in turn froze operations in one section of Amazon's U.S. East data center, then spread to other availability zones.
Again, Amazon did what was right. But SLAs exist so companies don't have to depend on goodwill. I am reminded of one irate website maintainer's post in the midst of the crisis: "Amazon's updates [to its Service Watch dashboard] read as if they were written by their attorneys and accountant, who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy."
Bryson Koehler, senior VP at InterContinental Hotel Group, once made this comment to me in an interview: "EC2 is a best effort" service, not a sure thing. That assessment is reinforced by the narrow definition of protection in Amazon's SLA.
At San Francisco's Engine Yard, a service that hosts Ruby applications in EC2, the trouble brewing in the middle of the night April 21 was spotted right away. Technical support staff members were called in, and they started using a beta service Engine Yard had in place to move customer EC2 instances from U.S. East to other Amazon data centers, primarily U.S. West in Northern California but also to centers in Dublin, Ireland; Asia; and Japan.
There were several instances of Engine Yard's own management dashboard ceasing to function for a few minutes, but it always came back and the process continued until all customers had been transferred or transferred themselves. Engine Yard posted instructions on how to use the service and made it available to everyone.
When it comes to cloud computing, this example may illustrate where the real guarantee of service continuity lies--with your contingency planning. Engine Yard depends on Amazon's infrastructure as a service, but Mike Piech, VP of product management, said, "Amazon has a strictly defined SLA" and it wouldn't cover most of the cases impacted by the recent outage.
That's why cloud users need to figure out up front what they're going to do in the event of a cloud data center failure--besides go back and finally read the fine print in their SLAs.