IoT
IoT
Cloud // Infrastructure as a Service
News
11/21/2014
10:44 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Microsoft Azure Outage: Questions Remain

Microsoft Azure's East and South Central US data centers' performance plummeted while West remained relatively unaffected.

 8 Doomsday Predictions From Yesterday And Today
8 Doomsday Predictions From Yesterday And Today
(Click image for larger view and slideshow.)

On Nov. 18, Microsoft rolled out an update to its Azure storage service that contained an unintended infinite loop for a certain operation buried in the code. The triggering of that infinite loop in normal operations caused the service to basically freeze. Third-party cloud service monitoring services show that as the update rolled out, each Azure data center experienced a growing service latency followed by an outage about an hour later.

A chart supplied at the request of InformationWeek by former Compuware application performance management service Dynatrace showed a 241.75-millisecond (about a quarter of a second) latency building up before failure -- a disastrous slowdown in storage operations. Earlier that day it had run at an average 15-16 milliseconds.

Likewise, charts obtained from the Cedexis Radar monitoring service showed a consistent 95%-97% success rate in attempted connections to Azure cloud services throughout most of the day. Shortly after 4 p.m. PST -- 45 minutes prior to the time Microsoft acknowledged the trouble in its Azure Status page -- that response level started to fall off a cliff. Over the next hour, it dropped from 95%-97% to 7%-8% -- a virtual freeze for most users.

Interestingly, the Cedexis data also show that the problem didn't affect all Azure data centers equally. In a chart showing three US and two European data centers (below), Cedexis metrics illustrate that a little after midnight UTC (4 p.m. PST), Microsoft's US East (in northern Virginia) and South Central (in Dallas) are affected more than the other three -- their performance drop is the most precipitous. An hour later, US East is down to a 7%-8% level, while South Central has ground to a halt and is accepting no connections. South Central's complete outage goes on until about 6:45 p.m. PST (my estimate from a graph without fine calibrations).

(Source: Cedexis)
(Source: Cedexis)

Azure's US West data center in Northern California shows degraded service but continues to chug along at 55%-60% throughout this period.

In an interview, Cedexis service strategist Pete Mastin said any drop in user connections spells trouble for cloud services, since a failed connection will usually be retried right away. That builds unnecessary traffic and tends to increase the failure rate. In other words, it takes only a small decrease in a service's availability to impact its latency rate. In this case, once the trouble started, Azure's latencies built up rapidly.

[Want to learn more about the Azure storage outage? See Microsoft Azure Storage Service Outage: Postmortem.]

What happened with the two European data centers, however, is even more of an anomaly than the varied outcomes in the US data centers.

Azure's West Europe data center (in Dublin, Ireland) and North Europe center (in Amsterdam) initially showed relatively little impact. North Europe fell off a barely perceptible 1%-2% (my estimate, since graph doesn't show fine detail), while West Europe dropped to 80% effectiveness.

All five data centers then start to recover at about the same time -- about 1 a.m. UTC (Greenwich Mean Time) or 5 p.m. PST, and the recovery continued for two hours. At about 3 a.m. UTC (7 p.m. PST), operations were back to normal at both the US and the European data centers.

Then Azure's North Europe data center in Amsterdam suffered a second precipitous drop. From 5 a.m. to 8 a.m. UTC, its user connection rate plummeted from 96% to 37%. Meanwhile, the Dublin-based West Europe center maintained close-to-normal operations, showing a decline of only a few percentage points over the same period.

The performance drop across all data centers makes sense if the storage service update was rolled out simultaneously around the globe -- but, Mastin wondered, why would [Microsoft] do that? Is it considered a best practice to roll out a cloud service update everywhere at the same time? In a statement, corporate VP for Azure Jason Zander said the update had been tested both in isolation and in a limited live production deployment, a process it calls "flighting." The update had passed all tests.

When it became evident there was a problem with the rollout, why did the Amsterdam data center take a second performance dive just as the business day was getting underway in its time zone? If business activity served as the trigger, why didn't Dublin show a similar drop an hour later? Had troubleshooters implemented a rollback there before the code glitch had time to cause major problems?

It's possible. But then there's the question of the varied responses among the US data centers. Why, at 4 p.m., was the US West data center -- still in the most active part of the business day – less affected than the US South Central and US East data centers?

Asked about these variations, Mastin was guarded in his response. Having served previously as operations staff at an Internap data center, he's aware of the many individual circumstances and anomalies that can occur in the course of a system update. "We measure things," he said. "We don't necessarily understand why it happened."

That said, he added, "I worked at Internap for four years. We knew rolling out an update to all data centers at one time was never a good idea."

To collect data, Cedexis has 500 enterprise clients embed JavaScript in pages downloaded by their customers each day. The JavaScript triggers a query from the user's location back to the site visited, capturing response times and reporting the results to Cedexis. The company collects 2 to 3 billion such user samples daily from 29,000 Internet service providers.

Microsoft's postmortem is still to come. Let's hope it can explain the outage and address solutions to prevent similar events from happening again.

Does your resiliency plan take into account both natural disasters and man-made mayhem? If the CISO hasn't signed off, assume the answer is no. Get the Disaster Recovery In The APT Age Tech Digest from Dark Reading today. (Free registration required.)

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
11/24/2014 | 6:54:14 PM
No question South Central Azure was down
One more point from the Cedexis data on this outage: US East in Northern Virginia is a heavily trafficked Azure site, but it never quite went entirely off the air. Between roughly 7-8 p.m. Eastern standard time, users trying to access Azure from many different points around the globe suffered a similar fall off in service which took it from at least 95% available to about 5% available. Services continued running but were obviously impaired. South Central Azure, in Dallas, however, did simply go off the air with services zero % accessible, for 80-90 min., during the same period. What was the difference? I could speculate, would rather Microsoft explain.
pete23
100%
0%
pete23,
User Rank: Apprentice
11/24/2014 | 4:21:01 PM
A couple of additional points about the Azure outage.
One of the more important lessions that can be learned about this (or any outage) is that from the enterprise perspective there is no reason to let these types of outages hinder your business. Its long been a truism in IT, 'dont buy one of anything'. What that means is that you dont have one firewall at the top of your stack, you dont have one IP provider from a datacenter, you dont have one CDN delivering your content - and what we are learning is that you dont have one homogeneus cloud vendor as your only cloud. Its not enough to be deployed in multiple regions with the same vendor. You need to spread your compute accross multiple vendors in multiple availablity zones. This is the new best practice that is taking hold. Enterprises that fail to take this lesson will be brought to task. Innovators that understand this lesson will profit. Cedexis is one of the key innovators in this space -  
jring281
50%
50%
jring281,
User Rank: Apprentice
11/24/2014 | 12:46:39 PM
Azure unintended infinite loop
If MSFT is willing to share the Azure code we will tell them which statement(s) are the fault.
CherylY937
50%
50%
CherylY937,
User Rank: Apprentice
11/23/2014 | 12:15:12 PM
Small correction
The Dublin data centre define North Europe, not West Europe, and vice versa for Amsterdam.  This often confuses people because Dublin in a few hundred kilometers further west than Amsterdam.  There again, it is also a little further north.  It's difficult to make out from low-res graphic, but I think it was Dublin that suffered the major problems, not Amsterdam.
CherylY937
50%
50%
CherylY937,
User Rank: Apprentice
11/23/2014 | 12:15:12 PM
Small correction
The Dublin data centre define North Europe, not West Europe, and vice versa for Amsterdam.  This often confuses people because Dublin in a few hundred kilometers further west than Amsterdam.  There again, it is also a little further north.  It's difficult to make out from low-res graphic, but I think it was Dublin that suffered the major problems, not Amsterdam.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
11/21/2014 | 7:30:32 PM
Why France?
Without being well informed on Azure operations, it's hard to know what some of the Cedexis data means. But another interesting oddity is that of the countries worst affected, France stands out with a deeper and longer lasting trough than other counties. That shows up in two performance charts highlighting country by country results provided by Cedexis, along with a look at the Orange network segment there.
Stratustician
50%
50%
Stratustician,
User Rank: Ninja
11/21/2014 | 3:39:42 PM
Re: Post a Status update sooner?
Instances like this just point out that you can't rely 100% on cloud providers, real life sadly gets in the way and these things happen.  That being said, I agree with Charlie in that yes, Microsoft should've been more timely on the announcement and updates. 
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
11/21/2014 | 2:04:23 PM
Post a Status update sooner?
So far, Microsoft has tended to be fairly forthright on the fact there was a problem, and later in a post mortem, on the nature of the problem. The timing of the notice on the Azure Status page in this case tended to lag the actual existence of the problem by about 45 min., but in my experience, all service providers operate that way. My interpretation: they try to get on top of the problem or fix it first, supply notice of it when it's becoming obvious to users. Not sure we'll reform them on that point any time soon.
Multicloud Infrastructure & Application Management
Multicloud Infrastructure & Application Management
Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
Register for InformationWeek Newsletters
White Papers
Current Issue
Top IT Trends to Watch in Financial Services
IT pros at banks, investment houses, insurance companies, and other financial services organizations are focused on a range of issues, from peer-to-peer lending to cybersecurity to performance, agility, and compliance. It all matters.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.