Microsoft Azure Storage Service Outage: Postmortem
Azure outage Tuesday produced disruptions to MSN website, Office 365, Xbox Live, and third-party services, as well as possible data integrity problems.
Microsoft Office Mobile: Right For You?
Microsoft Office Mobile: Right For You? (Click image for larger view and slideshow.)
Microsoft's Azure Storage Service experienced a service outage beginning around 5:00 p.m. Pacific time Tuesday. As a result, Microsoft's popular MSN news and information website was inaccessible for an undisclosed period of time, and some third-party websites based on Azure ceased to function. In addition, access to Office 365 and the Xbox Live gaming platform were interrupted, and Application Insights Services, an application performance monitoring service for hosted Web applications, stopped functioning.
The disruptions appear to have been the most severe in Western Europe, where those services were slow to come back online Wednesday.
Corporate VP Jason Zander posted a blog Wednesday, after most issues had been resolved for North American customers, apologizing for the outage.
"I want to first sincerely apologize for the disruption this has caused. We know our customers put their trust in us and we take that very seriously," he wrote. The outages extended to parts of Asia as well as Europe and the US, he said.
[This isn't the first time Azure has gone down. See Microsoft Azure Outage Explanation Doesn't Soothe.]
The outage was a reminder of how dependent other cloud services are on the storage system. Amazon's most serious outage occurred on Easter weekend in 2011 when a storage network line was inadvertently choked off by human error. The loss of access to data set off a "re-mirroring storm" as the automated systems attempted to recreate many apparently missing data sets. That activity further choked storage networks so that Elastic Block Store became inaccessible to many applications, databases, and users who were depending on it.
According to Zander, on Tuesday Azure administrators were implementing an update to Azure Storage designed to improve its performance, a competitive factor in the race for customers by IBM, Google, Amazon, and Microsoft. The update had been tested in a customer-facing subset of the storage service for several weeks and functioned smoothly, a process that's known as "flighting" inside the ranks of Azure operations managers.
"The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service," Zander wrote. But when Microsoft rolled it out to the general service, something went awry.
"During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues," he said.
Microsoft set about rolling the changes back, but then needed to restart the storage service's customer front ends to complete the process. That took time and apparently didn't always go smoothly. "Most of our customers started seeing the availability improvement," he wrote Wednesday afternoon, but "a limited subset of customers are still experiencing intermittent issues."
Microsoft's Visual Studio Online service for developers was affected, in addition to other services, and its maintenance team posted notices about the outage as well. One of them suggested that updates to data sets may have been delayed.
On Wednesday morning Pacific time, Microsoft reported: "We have restored the system completely. Our monitoring is green and all services are running as expected."
However, it also warned, those customers who had been affected "will see a data gap during the impacted window from (from 11/19/2014 01:00 UTC till 11/19/2014 05:00 UTC)." The Universal time readings would correspond to 5 p.m. Pacific Tuesday to 1 a.m. Pacific Wednesday. "Data gap" was not defined in the posting nor was there advice on how to cope with a data gap if one is found. But if the Azure Storage Service was not functional, it probably means updates that were attempted during that period will need to be repeated. For some customers, data integrity may be an issue.
"Our engineering and support teams are actively engaged to help customers through this time," wrote Zander. The company has promised to make public a full postmortem on the incident at an undisclosed date.
Visual Studio Online, an Azure service, experienced an outage Aug. 14, and a broader set of services went down on Aug. 18. The incident included Azure Cloud Services, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services, and StorSimple in multiple data centers and regions.
Network Computing's new Must Reads is a compendium of our best recent coverage of storage. In this issue, you'll learn why storage arrays are shrinking for the better, discover the ways in which the storage industry is evolving towards 3D flash, find out how to choose a vendor wisely for cloud-based disaster recovery, and more. Get the Must Reads: Storage issue from Network Computing today.
About the Author
You May Also Like