Microsoft Azure Outage: Questions Remain - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
10:44 AM
Connect Directly

Microsoft Azure Outage: Questions Remain

Microsoft Azure's East and South Central US data centers' performance plummeted while West remained relatively unaffected.

 8 Doomsday Predictions From Yesterday And Today
8 Doomsday Predictions From Yesterday And Today
(Click image for larger view and slideshow.)

On Nov. 18, Microsoft rolled out an update to its Azure storage service that contained an unintended infinite loop for a certain operation buried in the code. The triggering of that infinite loop in normal operations caused the service to basically freeze. Third-party cloud service monitoring services show that as the update rolled out, each Azure data center experienced a growing service latency followed by an outage about an hour later.

A chart supplied at the request of InformationWeek by former Compuware application performance management service Dynatrace showed a 241.75-millisecond (about a quarter of a second) latency building up before failure -- a disastrous slowdown in storage operations. Earlier that day it had run at an average 15-16 milliseconds.

Likewise, charts obtained from the Cedexis Radar monitoring service showed a consistent 95%-97% success rate in attempted connections to Azure cloud services throughout most of the day. Shortly after 4 p.m. PST -- 45 minutes prior to the time Microsoft acknowledged the trouble in its Azure Status page -- that response level started to fall off a cliff. Over the next hour, it dropped from 95%-97% to 7%-8% -- a virtual freeze for most users.

Interestingly, the Cedexis data also show that the problem didn't affect all Azure data centers equally. In a chart showing three US and two European data centers (below), Cedexis metrics illustrate that a little after midnight UTC (4 p.m. PST), Microsoft's US East (in northern Virginia) and South Central (in Dallas) are affected more than the other three -- their performance drop is the most precipitous. An hour later, US East is down to a 7%-8% level, while South Central has ground to a halt and is accepting no connections. South Central's complete outage goes on until about 6:45 p.m. PST (my estimate from a graph without fine calibrations).

(Source: Cedexis)
(Source: Cedexis)

Azure's US West data center in Northern California shows degraded service but continues to chug along at 55%-60% throughout this period.

In an interview, Cedexis service strategist Pete Mastin said any drop in user connections spells trouble for cloud services, since a failed connection will usually be retried right away. That builds unnecessary traffic and tends to increase the failure rate. In other words, it takes only a small decrease in a service's availability to impact its latency rate. In this case, once the trouble started, Azure's latencies built up rapidly.

[Want to learn more about the Azure storage outage? See Microsoft Azure Storage Service Outage: Postmortem.]

What happened with the two European data centers, however, is even more of an anomaly than the varied outcomes in the US data centers.

Azure's West Europe data center (in Dublin, Ireland) and North Europe center (in Amsterdam) initially showed relatively little impact. North Europe fell off a barely perceptible 1%-2% (my estimate, since graph doesn't show fine detail), while West Europe dropped to 80% effectiveness.

All five data centers then start to recover at about the same time -- about 1 a.m. UTC (Greenwich Mean Time) or 5 p.m. PST, and the recovery continued for two hours. At about 3 a.m. UTC (7 p.m. PST), operations were back to normal at both the US and the European data centers.

Then Azure's North Europe data center in Amsterdam suffered a second precipitous drop. From 5 a.m. to 8 a.m. UTC, its user connection rate plummeted from 96% to 37%. Meanwhile, the Dublin-based West Europe center maintained close-to-normal operations, showing a decline of only a few percentage points over the same period.

The performance drop across all data centers makes sense if the storage service update was rolled out simultaneously around the globe -- but, Mastin wondered, why would [Microsoft] do that? Is it considered a best practice to roll out a cloud service update everywhere at the same time? In a statement, corporate VP for Azure Jason Zander said the update had been tested both in isolation and in a limited live production deployment, a process it calls "flighting." The update had passed all tests.

When it became evident there was a problem with the rollout, why did the Amsterdam data center take a second performance dive just as the business day was getting underway in its time zone? If business activity served as the trigger, why didn't Dublin show a similar drop an hour later? Had troubleshooters implemented a rollback there before the code glitch had time to cause major problems?

It's possible. But then there's the question of the varied responses among the US data centers. Why, at 4 p.m., was the US West data center -- still in the most active part of the business day – less affected than the US South Central and US East data centers?

Asked about these variations, Mastin was guarded in his response. Having served previously as operations staff at an Internap data center, he's aware of the many individual circumstances and anomalies that can occur in the course of a system update. "We measure things," he said. "We don't necessarily understand why it happened."

That said, he added, "I worked at Internap for four years. We knew rolling out an update to all data centers at one time was never a good idea."

To collect data, Cedexis has 500 enterprise clients embed JavaScript in pages downloaded by their customers each day. The JavaScript triggers a query from the user's location back to the site visited, capturing response times and reporting the results to Cedexis. The company collects 2 to 3 billion such user samples daily from 29,000 Internet service providers.

Microsoft's postmortem is still to come. Let's hope it can explain the outage and address solutions to prevent similar events from happening again.

Does your resiliency plan take into account both natural disasters and man-made mayhem? If the CISO hasn't signed off, assume the answer is no. Get the Disaster Recovery In The APT Age Tech Digest from Dark Reading today. (Free registration required.)

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
Charlie Babcock,
User Rank: Author
11/24/2014 | 6:54:14 PM
No question South Central Azure was down
One more point from the Cedexis data on this outage: US East in Northern Virginia is a heavily trafficked Azure site, but it never quite went entirely off the air. Between roughly 7-8 p.m. Eastern standard time, users trying to access Azure from many different points around the globe suffered a similar fall off in service which took it from at least 95% available to about 5% available. Services continued running but were obviously impaired. South Central Azure, in Dallas, however, did simply go off the air with services zero % accessible, for 80-90 min., during the same period. What was the difference? I could speculate, would rather Microsoft explain.
Charlie Babcock
Charlie Babcock,
User Rank: Author
11/21/2014 | 7:30:32 PM
Why France?
Without being well informed on Azure operations, it's hard to know what some of the Cedexis data means. But another interesting oddity is that of the countries worst affected, France stands out with a deeper and longer lasting trough than other counties. That shows up in two performance charts highlighting country by country results provided by Cedexis, along with a look at the Orange network segment there.
Charlie Babcock
Charlie Babcock,
User Rank: Author
11/21/2014 | 2:04:23 PM
Post a Status update sooner?
So far, Microsoft has tended to be fairly forthright on the fact there was a problem, and later in a post mortem, on the nature of the problem. The timing of the notice on the Azure Status page in this case tended to lag the actual existence of the problem by about 45 min., but in my experience, all service providers operate that way. My interpretation: they try to get on top of the problem or fix it first, supply notice of it when it's becoming obvious to users. Not sure we'll reform them on that point any time soon.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

New Storage Trends Promise to Help Enterprises Handle a Data Avalanche
John Edwards, Technology Journalist & Author,  4/1/2021
11 Things IT Professionals Wish They Knew Earlier in Their Careers
Lisa Morgan, Freelance Writer,  4/6/2021
How to Submit a Column to InformationWeek
InformationWeek Staff 4/9/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Successful Strategies for Digital Transformation
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll