Amazon Issues Post Mortem On Cloud Outage
Amazon says it will better protect key backup components to avoid another outage.
"We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task," said an Amazon post mortem of the Aug. 7 event at EU West.
More Cloud Insights
- Maximize the benefits of virtualization for greater ROI
- Building a Hybrid Cloud in Government: It's not that Complicated
- IBM WebSphere Cast Iron Cloud integration: Integrate Microsoft Dynamics in days
- 2011 Cloud Networking Report
This post mortem doesn't define a source of the ground fault event or describe its nature. Ground faults occur when something disrupts the normal flow electricity in on a hot wire and out on a neutral wire. A person, accidentally touching a hot electricity wire, produces a ground fault as electricity flows into his body instead of through the circuit. A ground fault circuit interrupter, commonly installed in home bathrooms, detects the difference and automatically shuts off the current in a fraction of a second.
AWS' electricity supplier reported a transformer explosion and fire Aug. 7 that it first attributed to a lightning strike, and such an event might cause a ground fault. But the utility has since concluded it did not experience a lightning strike. "The utility provider ... is continuing to investigate root cause," the post mortem said, leaving a basic question hanging in the air. The post mortem was published to the AWS website Saturday.
The outage triggered Amazon's uninteruptible power supply, its battery backup system, which it said, "quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone."
The post mortem continued: "We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the region."
An Amazon availability zone is a separate data center working with others in the same region; each zone has its own power and telecommunications supplies. Once connections were lost between availability zones, a new problem arose. When customers targeted API requests for basic services--requests to start a virtual machine in the Elastic Compute Cloud or create temporary storage in the Elastic Block Store services--in the impacted zone, they failed. The API management servers in the surviving availability zones began to queue up API requests to the impacted zone and attempt to process them, leading to API backlogs within 24 minutes of the event.
"Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs."
An hour and 19 minutes after the event, all requests for services in the zone that was out were disabled and its failed management servers were removed from service, and EC2 launch times in the zones still functioning started to recover.
But this part of the incident illustrates that AWS has not so far safeguarded the services in one availability zone from mishaps in another. The details in the post mortem make it clear that what started out as a problem of electrical supply to one availability zone spread different problems to other zones, even though a customer's use of a second availability zone to run a backup system is supposed to ensure high availability.
Later in the post mortem, the AWS team stated: "At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption." But there was at least a slowdown as the API servers in the unaffected zones bogged down.
Amazon operations staff reacted quickly to the unexpected loss of both primary and backup power. They brought some of its backup generators online an hour and 13 minutes after the outage by manually synchronizing their phase with the rest of the data center's. This restored power to many EC2 customer workloads and Elastic Block Store volumes, but it could not restore power to much of the networking gear. "So these restored instances were still inaccessible," the post mortem noted. Three hours and eight minutes after the event, power was restored sufficiently to get the network running again and restore connectivity, making workloads accessible.