Amazon Issues Post Mortem On Cloud Outage

Amazon says it will better protect key backup components to avoid another outage.

Charles Babcock, Editor at Large, Cloud

August 15, 2011

9 Min Read

Slideshow: Amazon's Case For Enterprise Cloud Computing

Slideshow: Amazon's Case For Enterprise CloudComputing

Slideshow: Amazon's Case For Enterprise Cloud Computing(click image for larger view and for full slideshow)

When the power failed at Amazon Web Services data center in Dublin, Ireland, backup generators should have come online and produced replacement power. They did not do so because a programmable logic controller (PLC) that matches the phase of the generated power to the power being used in the data center also failed.

"We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task," said an Amazon post mortem of the Aug. 7 event at EU West.

This post mortem doesn't define a source of the ground fault event or describe its nature. Ground faults occur when something disrupts the normal flow electricity in on a hot wire and out on a neutral wire. A person, accidentally touching a hot electricity wire, produces a ground fault as electricity flows into his body instead of through the circuit. A ground fault circuit interrupter, commonly installed in home bathrooms, detects the difference and automatically shuts off the current in a fraction of a second.

AWS' electricity supplier reported a transformer explosion and fire Aug. 7 that it first attributed to a lightning strike, and such an event might cause a ground fault. But the utility has since concluded it did not experience a lightning strike. "The utility provider ... is continuing to investigate root cause," the post mortem said, leaving a basic question hanging in the air. The post mortem was published to the AWS website Saturday.

The outage triggered Amazon's uninteruptible power supply, its battery backup system, which it said, "quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone."

The post mortem continued: "We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the region."

An Amazon availability zone is a separate data center working with others in the same region; each zone has its own power and telecommunications supplies. Once connections were lost between availability zones, a new problem arose. When customers targeted API requests for basic services--requests to start a virtual machine in the Elastic Compute Cloud or create temporary storage in the Elastic Block Store services--in the impacted zone, they failed. The API management servers in the surviving availability zones began to queue up API requests to the impacted zone and attempt to process them, leading to API backlogs within 24 minutes of the event.

"Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs."

An hour and 19 minutes after the event, all requests for services in the zone that was out were disabled and its failed management servers were removed from service, and EC2 launch times in the zones still functioning started to recover.

But this part of the incident illustrates that AWS has not so far safeguarded the services in one availability zone from mishaps in another. The details in the post mortem make it clear that what started out as a problem of electrical supply to one availability zone spread different problems to other zones, even though a customer's use of a second availability zone to run a backup system is supposed to ensure high availability.

Later in the post mortem, the AWS team stated: "At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption." But there was at least a slowdown as the API servers in the unaffected zones bogged down.

Amazon operations staff reacted quickly to the unexpected loss of both primary and backup power. They brought some of its backup generators online an hour and 13 minutes after the outage by manually synchronizing their phase with the rest of the data center's. This restored power to many EC2 customer workloads and Elastic Block Store volumes, but it could not restore power to much of the networking gear. "So these restored instances were still inaccessible," the post mortem noted. Three hours and eight minutes after the event, power was restored sufficiently to get the network running again and restore connectivity, making workloads accessible.

Slideshow: Cloud Security Pros And Cons

Slideshow: Cloud Security Pros And Cons

Slideshow: Cloud Security Pros And Cons(click image for larger view and for full slideshow)

Restoration of the temporary storage service, EBS, however, took longer, and that delay also neutralized the value of any EC2 workload instance that was already restored. EBS took longer to restore because so many storage volumes had lost power, and the spare storage capacity wasn't sufficient to handle all the re-mirroring requests. A new or existing workload will have its writes blocked as EBS attempts to replicate a data set. To many functioning workloads, it appeared their storage volumes were "stuck," as their writes were blocked, the post mortem reported.

To get unstuck, AWS needed to add more spare capacity, which meant rousting truck drivers and loaders out of bed to bring in more servers from offsite storage. "We brought in additional labor to get more onsite capacity online and trucked in servers. ... There were delays as this was nighttime in Dublin and the logistics of trucking required mobilizing transportation some distance from the datacenter. Once the additional capacity was available, we were able to recover the remaining volumes."

If a customer used one availability zone and all his data was in the zone that lost power, and writes were going on ESB volumes, it was unclear whether writes in every instance had been completed before the power went out. If they were not, the customer's data would be inconsistent with what the customer's systems expect.

"For the volumes we assumed were inconsistent, we produced a recovery snapshot to enable customers to create a new volume and check its consistency before trying to use it. The process of producing recovery snapshots was time-consuming because we had to first copy all of the data from each node to Amazon Simple Storage Service (Amazon S3), process that data to turn it into the snapshot storage format, and re-copy the data to make it accessible from a customer's account."

Thirty-one hours and 23 minutes after the incident, Amazon had 38% of the recovery snapshots available. Well past the second day of the outage, or 52 hours and 50 minutes afterward, it had 85% of the recovery snapshots delivered. Ninety-eight percent were available 69 hours and 44 minutes after the incident, a little before the three-day mark at 72 hours. The remaining 2% needed to be rebuilt manually.

As previously reported, AWS also encountered a software error in its snapshot recovery system that was unrelated to the power disrupted but which affected the rate of recovery.

Amazon said it was taking steps to insure that a similar event doesn't disrupt its operations in the future. It said it will add isolation to its programmable logic controllers for the backup generators "so they are isolated from other failures." In addition to active PLCs, it will in the future keep a cold, isolated PLC available as further backup, as soon as its power supply vendors figure out how to install such a unit.

"We are going to address the resource saturated that affected API calls at the beginning of the disruption" by putting in an ability to quickly remove failed API management servers from production. "It will take us several more months to complete some changes that we're making, and we will test and roll out these changes carefully," the post mortem said.

It will seek to "drastically reduce the long recovery time required to recover stuck or inconsistent EBS volumes. ... We will create the capability to recover volumes directly on the EBS servers upon restoration of power, without having to move the data off of those servers."

As for the software bug, "we are instrumenting an alarm that will alert us if there are any unusual situations discovered by the snapshot cleanup identification process, including blocks falsely flagged as being unreferenced." The "unreferenced blocks" were produced by the bug and in some instances disrupted accurate snapshot recoveries.

Communicating with customers during these outages remains a challenge, the AWS team acknowledged. It attempted to present more detailed updates this time around, compared to its Easter weekend outage.

In addition, it will staff up more quickly in response to events and it plans to let customers more easily tell it if their workloads have been impacted.

"We've been hard at work on developing tools to allow you to see via the APIs if your instances/volumes are impaired, and hope to have this to customers in the next few months," the post mortem said.

At times, the post mortem took a contrite tone: "As we were sending customers recovery snapshots, we could have been clearer and more instructive on how to run the recovery tools, and provided better detail on the recovery actions customers could have taken. We sometimes assume a certain familiarity with these tools that we should not."

Amazon will grant a 10-day credit to all EC2 instance workload customers in the affected EU West availability zone. A 30-day credit will be granted to those affected by the bug in the snapshot recovery system. Those customers also will get free access to Premium Support engineers.

See the latest IT solutions at Interop New York. Learn to leverage business technology innovations--including cloud, virtualization, security, mobility, and data center advances--that cut costs, increase productivity, and drive business value. Save 25% on Flex and Conference Passes or get a Free Expo Pass with code CPFHNY25. It happens in New York City, Oct. 3-7, 2011. Register now.

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights