2 min read

Amazon IDs Cause Of Data Center Outage

The failure of two power components at a Virginia data center affected some EC2 operations on December 9th, Amazon Web Services says.
Apparent Networks set up the monitoring service because it wanted to illustrate what its PathView Cloud could do for companies making use of cloud computing. It said it maintains 20 accounts in the data center that experienced the outage and six of them went down. Apparent Networks spokesmen were careful to say they have no way of knowing if their experience applied to the data center as a whole.

By using a network path to monitor the data center, Apparent Networks can see something that Hyperic's systems management system, Cloud Status. It tracked its own pinging and command traffic to a router in Northern Virginia where it stopped short of the virtual server that Apparent was running there. Amazon is known to operate a data center near McLean, Va., but company officials don't name specific locations in communications. Likewise, the Amazon Service Health Dashboard avoids naming locations beyond a region in which it might have several data centers. In this case it referred only to the US-East-1 region.

If a user of Apparent Networks PathView Cloud found evidence of a service outage, that user could match up that information with Amazon's own CloudWatch service or Hyperic's CloudStatus to see how his individual virtual machines were performing and learn more, noted Javier Soltero, CTO of management products at SpringSource, a unit of VMware.

"On the whole, Amazon is extremely consistent," said Soltero. That consistency isn't simply in operating data centers but in its willingness to report incidents to customers through the service dashboard. In this instance, however, "we saw a gap between the actual outage" and when the service notices started to appear. The gap was 34 minutes long, if Apparent Networks outage times are right, which is either a short time or an unbearably long time. Your view of the gap depends on whether you were running time-sensitive workloads or non-sensitive workloads, if you were an EC2 customer in the data center affected.

Amazon's incident notice language is also location non-specific. Customers can't tell from the notices whether they have a virtual machine running where the incident is taking place. They must either subscribe to Amazon's CloudWatch or a third party service, such as PathView Cloud or Cloud Status, that's looking at the cloud from the outside.