Amazon Issues Post Mortem On Cloud Outage

Amazon says it will better protect key backup components to avoid another outage.

Slideshow: Amazon's Case For Enterprise Cloud Computing
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
When the power failed at Amazon Web Services data center in Dublin, Ireland, backup generators should have come online and produced replacement power. They did not do so because a programmable logic controller (PLC) that matches the phase of the generated power to the power being used in the data center also failed.

"We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task," said an Amazon post mortem of the Aug. 7 event at EU West.


More Cloud Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

This post mortem doesn't define a source of the ground fault event or describe its nature. Ground faults occur when something disrupts the normal flow electricity in on a hot wire and out on a neutral wire. A person, accidentally touching a hot electricity wire, produces a ground fault as electricity flows into his body instead of through the circuit. A ground fault circuit interrupter, commonly installed in home bathrooms, detects the difference and automatically shuts off the current in a fraction of a second.

AWS' electricity supplier reported a transformer explosion and fire Aug. 7 that it first attributed to a lightning strike, and such an event might cause a ground fault. But the utility has since concluded it did not experience a lightning strike. "The utility provider ... is continuing to investigate root cause," the post mortem said, leaving a basic question hanging in the air. The post mortem was published to the AWS website Saturday.

The outage triggered Amazon's uninteruptible power supply, its battery backup system, which it said, "quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone."

The post mortem continued: "We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the region."

An Amazon availability zone is a separate data center working with others in the same region; each zone has its own power and telecommunications supplies. Once connections were lost between availability zones, a new problem arose. When customers targeted API requests for basic services--requests to start a virtual machine in the Elastic Compute Cloud or create temporary storage in the Elastic Block Store services--in the impacted zone, they failed. The API management servers in the surviving availability zones began to queue up API requests to the impacted zone and attempt to process them, leading to API backlogs within 24 minutes of the event.

"Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs."

An hour and 19 minutes after the event, all requests for services in the zone that was out were disabled and its failed management servers were removed from service, and EC2 launch times in the zones still functioning started to recover.

But this part of the incident illustrates that AWS has not so far safeguarded the services in one availability zone from mishaps in another. The details in the post mortem make it clear that what started out as a problem of electrical supply to one availability zone spread different problems to other zones, even though a customer's use of a second availability zone to run a backup system is supposed to ensure high availability.

Later in the post mortem, the AWS team stated: "At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption." But there was at least a slowdown as the API servers in the unaffected zones bogged down.

Amazon operations staff reacted quickly to the unexpected loss of both primary and backup power. They brought some of its backup generators online an hour and 13 minutes after the outage by manually synchronizing their phase with the rest of the data center's. This restored power to many EC2 customer workloads and Elastic Block Store volumes, but it could not restore power to much of the networking gear. "So these restored instances were still inaccessible," the post mortem noted. Three hours and eight minutes after the event, power was restored sufficiently to get the network running again and restore connectivity, making workloads accessible.


Page 2:  Getting Unstuck
 1 | 2 |Next Page » 

Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
T-Shirt Giveaway T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting!
Subscribe to RSS

Resource Links