Amazon Issues Post Mortem On Cloud Outage - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
03:50 PM
Connect Directly

Amazon Issues Post Mortem On Cloud Outage

Amazon says it will better protect key backup components to avoid another outage.

Slideshow: Amazon's Case For Enterprise Cloud Computing
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
When the power failed at Amazon Web Services data center in Dublin, Ireland, backup generators should have come online and produced replacement power. They did not do so because a programmable logic controller (PLC) that matches the phase of the generated power to the power being used in the data center also failed.

"We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task," said an Amazon post mortem of the Aug. 7 event at EU West.

This post mortem doesn't define a source of the ground fault event or describe its nature. Ground faults occur when something disrupts the normal flow electricity in on a hot wire and out on a neutral wire. A person, accidentally touching a hot electricity wire, produces a ground fault as electricity flows into his body instead of through the circuit. A ground fault circuit interrupter, commonly installed in home bathrooms, detects the difference and automatically shuts off the current in a fraction of a second.

AWS' electricity supplier reported a transformer explosion and fire Aug. 7 that it first attributed to a lightning strike, and such an event might cause a ground fault. But the utility has since concluded it did not experience a lightning strike. "The utility provider ... is continuing to investigate root cause," the post mortem said, leaving a basic question hanging in the air. The post mortem was published to the AWS website Saturday.

The outage triggered Amazon's uninteruptible power supply, its battery backup system, which it said, "quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone."

The post mortem continued: "We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the region."

An Amazon availability zone is a separate data center working with others in the same region; each zone has its own power and telecommunications supplies. Once connections were lost between availability zones, a new problem arose. When customers targeted API requests for basic services--requests to start a virtual machine in the Elastic Compute Cloud or create temporary storage in the Elastic Block Store services--in the impacted zone, they failed. The API management servers in the surviving availability zones began to queue up API requests to the impacted zone and attempt to process them, leading to API backlogs within 24 minutes of the event.

"Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs."

An hour and 19 minutes after the event, all requests for services in the zone that was out were disabled and its failed management servers were removed from service, and EC2 launch times in the zones still functioning started to recover.

But this part of the incident illustrates that AWS has not so far safeguarded the services in one availability zone from mishaps in another. The details in the post mortem make it clear that what started out as a problem of electrical supply to one availability zone spread different problems to other zones, even though a customer's use of a second availability zone to run a backup system is supposed to ensure high availability.

Later in the post mortem, the AWS team stated: "At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption." But there was at least a slowdown as the API servers in the unaffected zones bogged down.

Amazon operations staff reacted quickly to the unexpected loss of both primary and backup power. They brought some of its backup generators online an hour and 13 minutes after the outage by manually synchronizing their phase with the rest of the data center's. This restored power to many EC2 customer workloads and Elastic Block Store volumes, but it could not restore power to much of the networking gear. "So these restored instances were still inaccessible," the post mortem noted. Three hours and eight minutes after the event, power was restored sufficiently to get the network running again and restore connectivity, making workloads accessible.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
Learning: It's a Give and Take Thing
James M. Connolly, Editorial Director, InformationWeek and Network Computing,  1/24/2020
IT Careers: Top 10 US Cities for Tech Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/14/2020
Predictions for Cloud Computing in 2020
James Kobielus, Research Director, Futurum,  1/9/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Flash Poll