Microsoft Azure Storage Service Outage: Postmortem - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Cloud Storage
News
11/20/2014
10:40 AM
Connect Directly
Twitter
RSS
E-Mail

Microsoft Azure Storage Service Outage: Postmortem

Azure outage Tuesday produced disruptions to MSN website, Office 365, Xbox Live, and third-party services, as well as possible data integrity problems.



Microsoft Office Mobile: Right For You?
Microsoft Office Mobile: Right For You?
(Click image for larger view and slideshow.)

Microsoft's Azure Storage Service experienced a service outage beginning around 5:00 p.m. Pacific time Tuesday. As a result, Microsoft's popular MSN news and information website was inaccessible for an undisclosed period of time, and some third-party websites based on Azure ceased to function. In addition, access to Office 365 and the Xbox Live gaming platform were interrupted, and Application Insights Services, an application performance monitoring service for hosted Web applications, stopped functioning.

The disruptions appear to have been the most severe in Western Europe, where those services were slow to come back online Wednesday.

Corporate VP Jason Zander posted a blog Wednesday, after most issues had been resolved for North American customers, apologizing for the outage.

"I want to first sincerely apologize for the disruption this has caused. We know our customers put their trust in us and we take that very seriously," he wrote. The outages extended to parts of Asia as well as Europe and the US, he said.

[This isn't the first time Azure has gone down. See Microsoft Azure Outage Explanation Doesn't Soothe.]

The outage was a reminder of how dependent other cloud services are on the storage system. Amazon's most serious outage occurred on Easter weekend in 2011 when a storage network line was inadvertently choked off by human error. The loss of access to data set off a "re-mirroring storm" as the automated systems attempted to recreate many apparently missing data sets. That activity further choked storage networks so that Elastic Block Store became inaccessible to many applications, databases, and users who were depending on it.

According to Zander, on Tuesday Azure administrators were implementing an update to Azure Storage designed to improve its performance, a competitive factor in the race for customers by IBM, Google, Amazon, and Microsoft. The update had been tested in a customer-facing subset of the storage service for several weeks and functioned smoothly, a process that's known as "flighting" inside the ranks of Azure operations managers.

"The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service," Zander wrote. But when Microsoft rolled it out to the general service, something went awry.

"During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues," he said.

Microsoft set about rolling the changes back, but then needed to restart the storage service's customer front ends to complete the process. That took time and apparently didn't always go smoothly. "Most of our customers started seeing the availability improvement," he wrote Wednesday afternoon, but "a limited subset of customers are still experiencing intermittent issues."

Microsoft's Visual Studio Online service for developers was affected, in addition to other services, and its maintenance team posted notices about the outage as well. One of them suggested that updates to data sets may have been delayed.

On Wednesday morning Pacific time, Microsoft reported: "We have restored the system completely. Our monitoring is green and all services are running as expected."

However, it also warned, those customers who had been affected "will see a data gap during the impacted window from (from 11/19/2014 01:00 UTC till 11/19/2014 05:00 UTC)." The Universal time readings would correspond to 5 p.m. Pacific Tuesday to 1 a.m. Pacific Wednesday. "Data gap" was not defined in the posting nor was there advice on how to cope with a data gap if one is found. But if the Azure Storage Service was not functional, it probably means updates that were attempted during that period will need to be repeated. For some customers, data integrity may be an issue.

"Our engineering and support teams are actively engaged to help customers through this time," wrote Zander. The company has promised to make public a full postmortem on the incident at an undisclosed date.

Visual Studio Online, an Azure service, experienced an outage Aug. 14, and a broader set of services went down on Aug. 18. The incident included Azure Cloud Services, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services, and StorSimple in multiple data centers and regions.

Network Computing's new Must Reads is a compendium of our best recent coverage of storage. In this issue, you'll learn why storage arrays are shrinking for the better, discover the ways in which the storage industry is evolving towards 3D flash, find out how to choose a vendor wisely for cloud-based disaster recovery, and more. Get the Must Reads: Storage issue from Network Computing today.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Threaded  |  Newest First  |  Oldest First
nasimson
50%
50%
nasimson,
User Rank: Ninja
11/20/2014 | 12:50:54 PM
To err is human, to tolerate is machine.
> Amazon's most serious outage occurred on Easter weekend in 2011 when
> a storage network line was inadvertently choked off by human error.

In this day and age, it seems surprising that systems are not so fault tolerant that these can get choked by "human errors". Quite surprising. Machines by now should have become intelligent enough to not to get disrupted by human errors. To err is human, to tolerate is machine.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
11/20/2014 | 3:54:25 PM
Computer overlords?
Computers are still dumb in many ways, unable to tell humans when they're about to make a mistake. It's our job to test and test again before going live. Sounds like the "Flighting" project wasn't adequately tested. That's quite a data gap to overcome. Let that be a reminder -- one of many cloud customers have had -- that cloud infrastructure is as fallible as their own data center infrastructure.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
11/20/2014 | 5:17:48 PM
Source of cloud failures
In both the Amazon Easter outage and Microsoft Leap Year outage, an initial small human error lead to automated systems drawing the wrong conclusions and setting off a chain reaction that for all practical purposes forze up parts of the infrastructure. Clouds experience failures, the same as enterprise data centers, yes, but the failures are different and I think the operators are learning from them. Still not a foolproof proposition.
Thomas Claburn
50%
50%
Thomas Claburn,
User Rank: Author
11/20/2014 | 7:06:05 PM
Re: Source of cloud failures
Given the fallibility of people, I wonder whether the headlines for these types of stories should be more along the lines of Software & People Still Prone To Error.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
11/20/2014 | 10:33:58 PM
Automated management adds unforeseen complications
There's much about this outage that hasn't been fully explained. To be fair, Microsoft hasn't pretended to offer up its full post mortem yet. But I'm looking at data from third parties that says the trouble started at all Azure data centers simultaneously but didn't affect them all the same way. I'd like to learn more.
Slideshows
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
Commentary
Is Cloud Migration a Path to Carbon Footprint Reduction?
Joao-Pierre S. Ruth, Senior Writer,  10/5/2020
News
IT Spending, Priorities, Projects: What's Ahead in 2021
Jessica Davis, Senior Editor, Enterprise Apps,  10/2/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Slideshows
Flash Poll