Amazon Outage Leaves Latency Mystery

A 49-minute outage at Amazon's retail operation appears to have slowed AWS services in Dublin, Ireland, speeded up others, according to monitoring service.
VMware Vs. Microsoft: 8 Cloud Battle Lines
VMware Vs. Microsoft: 8 Cloud Battle Lines
(click image for larger view and for slideshow)
Amazon has declined to comment on these operational details, so observers are left to speculate. One possible explanation is that Dublin serves as a backup site to Amazon's Ashburn, Va., service site. A problem in Northern Virginia, Amazon's most heavily trafficked site, leads to work being shifted east to Dublin, and the impact showed up in Dublin's AWS cloud services, such as EC2 and S3. They remained running but slowed with the higher latencies. Meanwhile, all Amazon's U.S. sites, including Ashburn, Va., and the two U.S. West sites showed a slight speed up during the North American outage. So did other Amazon sites around the world.

Part of the explanation has to be the most obvious fact: With retail down, the firm's data centers were freed of one of their major workloads -- retail operations -- and applied more networking and processing power to the remaining cloud services work.

The exception, of course, is Dublin, where the cloud work slowed as the retail trouble developed. That fact suggests Dublin shares in load balancing with Ashburn, or possibly is the primary backup if something goes awry with services in Ashburn. That's a hunch, not a conclusion or anything clearly established by the facts.

But one thing does seem clear. There appears to be a relationship between the efficiency of AWS cloud services and the health of retail. When there's trouble with Amazon retail, that relationship might make the cloud services faster or slower, depending on which data centers are backing each other up or in other ways dependent on each other's operations.

At first glance, the 49-minute outage of retail Monday would appear to be completely unrelated to the higher latencies in Dublin that rose and fell over a 12-hour period. But as Amazon's 2011 Easter outage showed, once something goes wrong in a cloud data center, automated corrective actions kick in that in themselves impose a heavy processing burden. What was termed "a re-mirroring storm," meant to fix the seeming disappearance of customer data sets, tied up systems and crippled services far longer than did the human error that set off the storm in the first place.

Some similar event, less drastic in nature, caused Amazon's all-important retail portal to go dark for 49 minutes. For unexplained reasons, that appears to have affected Amazon's Dublin operations by imposing a latency penalty, which slowed its cloud services.

On such slender evidence, enterprise IT managers are trying to make decisions on the safest ways to deploy their workloads to the Amazon cloud. A forthright explanation by Amazon of the outage, now three days old, would help them with that task.