A 49-minute outage at Amazon's retail operation appears to have slowed AWS services in Dublin, Ireland, speeded up others, according to monitoring service.
8 Great Cloud Storage Services
(click image for larger view and for slideshow)
The outage that beset Amazon.com's retail home page lasted longer than some observers first believed. Reports have put the outage variously at 15 minutes, 25 minutes or "just under a half an hour," as Forbes reported soon after the incident.
In fact, it lasted 49 minutes, according to a monitoring service at Compuware, the owner of CloudSleuth cloud service monitoring and the Gomez Web application performance monitoring system, now part of the Compuware APM service. Despite inquiries, Amazon.com spokesmen have remained silent on the cause of the incident Monday and its duration. News reports have been sketchy.
Compuware staff double checked the Monday incident that saw the Amazon.com home page going dark with customers getting an "Oops" message around noon Pacific time and not becoming available again until 49 minutes later. Only North American users appear to have lost service. Europe and other parts of the world were unaffected, contrary to an earlier report.
Amazon Web Services continued as usual, suggesting something went wrong with Amazon.com's ecommerce software. Retail operations depend on the same infrastructure as Amazon Web Services for cloud users, although the two were once separate. Both emanate from the same data centers. One AWS cloud service, the AWS service management console for customers, became inaccessible about the same time as the Amazon.com home site. The console worked for those already logged in, but non-logged-in customers were denied access, according to Amazon's own Service Health Dashboard during a 47-minute period between 11:45 a.m. and 12:32 p.m., a near match for Compuware's observation of the retail Amazon.com outage. But unlike the retail site, the user management console was also inaccessible to users in Europe, Asia-Pacific and South America as well as North America, according to the AWS health dashboard.
One of people who noticed it was inaccessible was Forrester Research's lead cloud analyst, James Staten, who tweeted: "can't manage #AWS from the console -- outage. 12:58 p.m. Pacific."
With no explanation from Amazon forthcoming, it's hard to know what these seeming unrelated events mean. But another interesting set of facts come from a second online cloud monitoring service, Cedexis Radar.
Cedexis Radar was able to observe increased latency, or a slowdown in throughput, at Amazon Web Service's primary European traffic center in Dublin, Ireland. The slowdown built up to about 60 milliseconds of added response time, not crippling but a noticeable, unwanted increase to most cloud services.
At the same time, Cedexis Radar also recorded a speed up in AWS operations at all other Amazon data center sites, such as Amazon West in northern California and Oregon and Amazon South America. The cloud service slowdown in Dublin started at about the time of the Amazon.com outage in North America, built to its peak seven hours later, then tapered off five hours later at midnight. The curve marking the time period for this latency build up closely matched curves showing the speedier responses at the other Amazon sites.
How could Dublin's slowdown occur when response times appear to have improved at all the other sites? Cedexis did not list every Amazon site, such as Hong Kong, Australia and Japan; it lumped them together into Asia Pacific. But the pattern held for the major regions listed: The improvements in latencies at Amazon sites -- except Dublin -- show a more muted curve but one of similar length. The sites show their shortest latencies about 7 p.m. Pacific, then return to normal by midnight.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.