December 14, 2009
Amazon Web Services has attributed a 44-minute outage in part of its Northern Virginia data center last week to the failure of power supply in one "availability zone" in the data center, which was soon followed by a second failure of a component in the redundant system.
Users of the Amazon EC2 cloud with workloads in Amazon's Northern Virginia data center experienced problems early in the morning of December 9, with some operations in a part of the data center interrupted during a five-hour period.
Amazon started notifying customers of a problem at 4:08 a.m. Eastern. By 9:41 a.m., it's Amazon Service Health Dashboard reported that "we have completed recovery of most instances affected by this event."
The postings first mentioned a connectivity issue, then acknowledged a power issue. In following up on the postings, InformationWeek asked Amazon whether the power issue was inside the data center or an issue with an external supplier.
Amazon spokesmen responded that "a single component of the redundant power distribution system failed in this zone. Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power."
Forty-three minutes after the first notice, Amazon Web Services posted messages, such as: "The underlying power issue has been addressed, instances have begun to recover" at 4:51 a.m. Eastern and "most affected instances and are operating normally" at 5:11 a.m.
The next day it added an explanation: "A single component of the redundant power distribution system failed in this zone. Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power."
The actual time of the outage, according to a monitoring service that gathers information by pinging traffic over the Internet and off the accounts it maintains inside Amazon facilities, indicated 3:34 a.m. to 4:19 a.m. Eastern.
Apparent Networks set up the monitoring service because it wanted to illustrate what its PathView Cloud could do for companies making use of cloud computing. It said it maintains 20 accounts in the data center that experienced the outage and six of them went down. Apparent Networks spokesmen were careful to say they have no way of knowing if their experience applied to the data center as a whole.
By using a network path to monitor the data center, Apparent Networks can see something that Hyperic's systems management system, Cloud Status. It tracked its own pinging and command traffic to a router in Northern Virginia where it stopped short of the virtual server that Apparent was running there. Amazon is known to operate a data center near McLean, Va., but company officials don't name specific locations in communications. Likewise, the Amazon Service Health Dashboard avoids naming locations beyond a region in which it might have several data centers. In this case it referred only to the US-East-1 region.
If a user of Apparent Networks PathView Cloud found evidence of a service outage, that user could match up that information with Amazon's own CloudWatch service or Hyperic's CloudStatus to see how his individual virtual machines were performing and learn more, noted Javier Soltero, CTO of management products at SpringSource, a unit of VMware.
"On the whole, Amazon is extremely consistent," said Soltero. That consistency isn't simply in operating data centers but in its willingness to report incidents to customers through the service dashboard. In this instance, however, "we saw a gap between the actual outage" and when the service notices started to appear. The gap was 34 minutes long, if Apparent Networks outage times are right, which is either a short time or an unbearably long time. Your view of the gap depends on whether you were running time-sensitive workloads or non-sensitive workloads, if you were an EC2 customer in the data center affected.
Amazon's incident notice language is also location non-specific. Customers can't tell from the notices whether they have a virtual machine running where the incident is taking place. They must either subscribe to Amazon's CloudWatch or a third party service, such as PathView Cloud or Cloud Status, that's looking at the cloud from the outside.
About the Author(s)
You May Also Like