Amazon Issues Post Mortem On Cloud Outage

Amazon says it will better protect key backup components to avoid another outage.
Slideshow: Cloud Security Pros And Cons
Slideshow: Cloud Security Pros And Cons
(click image for larger view and for full slideshow)
Restoration of the temporary storage service, EBS, however, took longer, and that delay also neutralized the value of any EC2 workload instance that was already restored. EBS took longer to restore because so many storage volumes had lost power, and the spare storage capacity wasn't sufficient to handle all the re-mirroring requests. A new or existing workload will have its writes blocked as EBS attempts to replicate a data set. To many functioning workloads, it appeared their storage volumes were "stuck," as their writes were blocked, the post mortem reported.

To get unstuck, AWS needed to add more spare capacity, which meant rousting truck drivers and loaders out of bed to bring in more servers from offsite storage. "We brought in additional labor to get more onsite capacity online and trucked in servers. ... There were delays as this was nighttime in Dublin and the logistics of trucking required mobilizing transportation some distance from the datacenter. Once the additional capacity was available, we were able to recover the remaining volumes."

If a customer used one availability zone and all his data was in the zone that lost power, and writes were going on ESB volumes, it was unclear whether writes in every instance had been completed before the power went out. If they were not, the customer's data would be inconsistent with what the customer's systems expect.

"For the volumes we assumed were inconsistent, we produced a recovery snapshot to enable customers to create a new volume and check its consistency before trying to use it. The process of producing recovery snapshots was time-consuming because we had to first copy all of the data from each node to Amazon Simple Storage Service (Amazon S3), process that data to turn it into the snapshot storage format, and re-copy the data to make it accessible from a customer's account."

Thirty-one hours and 23 minutes after the incident, Amazon had 38% of the recovery snapshots available. Well past the second day of the outage, or 52 hours and 50 minutes afterward, it had 85% of the recovery snapshots delivered. Ninety-eight percent were available 69 hours and 44 minutes after the incident, a little before the three-day mark at 72 hours. The remaining 2% needed to be rebuilt manually.

As previously reported, AWS also encountered a software error in its snapshot recovery system that was unrelated to the power disrupted but which affected the rate of recovery.

Amazon said it was taking steps to insure that a similar event doesn't disrupt its operations in the future. It said it will add isolation to its programmable logic controllers for the backup generators "so they are isolated from other failures." In addition to active PLCs, it will in the future keep a cold, isolated PLC available as further backup, as soon as its power supply vendors figure out how to install such a unit.

"We are going to address the resource saturated that affected API calls at the beginning of the disruption" by putting in an ability to quickly remove failed API management servers from production. "It will take us several more months to complete some changes that we're making, and we will test and roll out these changes carefully," the post mortem said.

It will seek to "drastically reduce the long recovery time required to recover stuck or inconsistent EBS volumes. ... We will create the capability to recover volumes directly on the EBS servers upon restoration of power, without having to move the data off of those servers."

As for the software bug, "we are instrumenting an alarm that will alert us if there are any unusual situations discovered by the snapshot cleanup identification process, including blocks falsely flagged as being unreferenced." The "unreferenced blocks" were produced by the bug and in some instances disrupted accurate snapshot recoveries.

Communicating with customers during these outages remains a challenge, the AWS team acknowledged. It attempted to present more detailed updates this time around, compared to its Easter weekend outage.

In addition, it will staff up more quickly in response to events and it plans to let customers more easily tell it if their workloads have been impacted.

"We've been hard at work on developing tools to allow you to see via the APIs if your instances/volumes are impaired, and hope to have this to customers in the next few months," the post mortem said.

At times, the post mortem took a contrite tone: "As we were sending customers recovery snapshots, we could have been clearer and more instructive on how to run the recovery tools, and provided better detail on the recovery actions customers could have taken. We sometimes assume a certain familiarity with these tools that we should not."

Amazon will grant a 10-day credit to all EC2 instance workload customers in the affected EU West availability zone. A 30-day credit will be granted to those affected by the bug in the snapshot recovery system. Those customers also will get free access to Premium Support engineers.

See the latest IT solutions at Interop New York. Learn to leverage business technology innovations--including cloud, virtualization, security, mobility, and data center advances--that cut costs, increase productivity, and drive business value. Save 25% on Flex and Conference Passes or get a Free Expo Pass with code CPFHNY25. It happens in New York City, Oct. 3-7, 2011. Register now.

Editor's Choice
Samuel Greengard, Contributing Reporter
Cynthia Harvey, Freelance Journalist, InformationWeek
Carrie Pallardy, Contributing Reporter
John Edwards, Technology Journalist & Author
Astrid Gobardhan, Data Privacy Officer, VFS Global
Sara Peters, Editor-in-Chief, InformationWeek / Network Computing