Amazon Issues Post Mortem On Cloud Outage - InformationWeek
Cloud // Infrastructure as a Service
03:50 PM
Connect Directly
[Dark Reading Crash Course] Finding & Fixing Application Security Vulnerabilitie
Sep 14, 2017
Hear from a top applications security expert as he discusses key practices for scanning and securi ...Read More>>

Amazon Issues Post Mortem On Cloud Outage

Amazon says it will better protect key backup components to avoid another outage.

Slideshow: Amazon's Case For Enterprise Cloud Computing
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
When the power failed at Amazon Web Services data center in Dublin, Ireland, backup generators should have come online and produced replacement power. They did not do so because a programmable logic controller (PLC) that matches the phase of the generated power to the power being used in the data center also failed.

"We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task," said an Amazon post mortem of the Aug. 7 event at EU West.

This post mortem doesn't define a source of the ground fault event or describe its nature. Ground faults occur when something disrupts the normal flow electricity in on a hot wire and out on a neutral wire. A person, accidentally touching a hot electricity wire, produces a ground fault as electricity flows into his body instead of through the circuit. A ground fault circuit interrupter, commonly installed in home bathrooms, detects the difference and automatically shuts off the current in a fraction of a second.

AWS' electricity supplier reported a transformer explosion and fire Aug. 7 that it first attributed to a lightning strike, and such an event might cause a ground fault. But the utility has since concluded it did not experience a lightning strike. "The utility provider ... is continuing to investigate root cause," the post mortem said, leaving a basic question hanging in the air. The post mortem was published to the AWS website Saturday.

The outage triggered Amazon's uninteruptible power supply, its battery backup system, which it said, "quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone."

The post mortem continued: "We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the region."

An Amazon availability zone is a separate data center working with others in the same region; each zone has its own power and telecommunications supplies. Once connections were lost between availability zones, a new problem arose. When customers targeted API requests for basic services--requests to start a virtual machine in the Elastic Compute Cloud or create temporary storage in the Elastic Block Store services--in the impacted zone, they failed. The API management servers in the surviving availability zones began to queue up API requests to the impacted zone and attempt to process them, leading to API backlogs within 24 minutes of the event.

"Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs."

An hour and 19 minutes after the event, all requests for services in the zone that was out were disabled and its failed management servers were removed from service, and EC2 launch times in the zones still functioning started to recover.

But this part of the incident illustrates that AWS has not so far safeguarded the services in one availability zone from mishaps in another. The details in the post mortem make it clear that what started out as a problem of electrical supply to one availability zone spread different problems to other zones, even though a customer's use of a second availability zone to run a backup system is supposed to ensure high availability.

Later in the post mortem, the AWS team stated: "At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption." But there was at least a slowdown as the API servers in the unaffected zones bogged down.

Amazon operations staff reacted quickly to the unexpected loss of both primary and backup power. They brought some of its backup generators online an hour and 13 minutes after the outage by manually synchronizing their phase with the rest of the data center's. This restored power to many EC2 customer workloads and Elastic Block Store volumes, but it could not restore power to much of the networking gear. "So these restored instances were still inaccessible," the post mortem noted. Three hours and eight minutes after the event, power was restored sufficiently to get the network running again and restore connectivity, making workloads accessible.

1 of 2
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
2017 State of IT Report
In today's technology-driven world, "innovation" has become a basic expectation. IT leaders are tasked with making technical magic, improving customer experience, and boosting the bottom line -- yet often without any increase to the IT budget. How are organizations striking the balance between new initiatives and cost control? Download our report to learn about the biggest challenges and how savvy IT executives are overcoming them.
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll