Microsoft Azure Outage Blamed On Bad Code

Microsoft's analysis of Nov. 18 Azure outage indicates engineers' decision to widely deploy misconfigured code triggered major cloud outage.

Charles Babcock, Editor at Large, Cloud

December 22, 2014

3 Min Read

6 IT Career Resolutions

6 IT Career Resolutions

6 IT Career Resolutions (Click image for larger view and slideshow.)

Microsoft has corrected a software bug in its Azure Cloud Storage that, when deployed widely on Nov. 18, triggered a massive outage. In some cases the error triggered infinite loops and tied up storage servers, dragging down much of the Azure cloud into a drastic slowdown.

As a result, connection rates to Azure on Nov. 18 dropped from 97% to 7%-8% after 7 p.m. Eastern in Northern Virginia. The Azure data center in Dallas suffered a complete outage for a short while. Data centers in Europe didn't recover until deep into the following day.

Microsoft had tested the storage update code before deployment, but contrary to its own best practices, Azure administrators rolled it out to Azure storage services as a whole instead of "flighting" -- limiting the roll-out to small sections at a time.

"The standard flighting deployment policy of incrementally deploying changes across small slices was not followed," wrote Microsoft's Jason Zander, corporate VP for the Azure team, in a blog Dec. 17.

[Want to learn more about the cloud outage's impact? See Microsoft Azure Storage Service Outage: Postmortem.]

The key problem, however, was a configuration issue in Azure Table storage front ends. "The configuration switch was incorrectly enabled for Azure Blob storage front-ends," wrote Zander. Table storage front-ends record the sequence of the different data types going into a Blob (a service for storing large amounts of unstructured data) and can be used to guide the data's retrieval. The error in the configuration switch appears to have caused an infinite loop.

{image 1}

The original change was meant to improve Azure Storage performance. In test after test, including pre-production staging, it did so and proved reliable, wrote Zander. That may have lead to overconfidence and haste in attempting to deploy the update and realize the performance gains.

Whatever the cause, Azure administrators have implemented automated practices that won't allow a human decision to overrule its "flighting" best practice -- using separate and limited implementations for putting new code into production.

In perhaps the clearest outcome of the incident, Zander wrote: "Microsoft Azure had clear operating guidelines, but there was a gap in the deployment tooling that relied on human decisions ... With the tooling updates, the policy is now enforced by the deployment platform itself."

Zander acknowledged that cloud operations must become more reliable and said Microsoft will continue to work on that goal. "We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services," he wrote.

Network Computing's new Must Reads is a compendium of our best recent coverage of storage. In this issue, you'll learn why storage arrays are shrinking for the better, discover the ways in which the storage industry is evolving towards 3D flash, find out how to choose a vendor wisely for cloud-based disaster recovery, and more. Get the Must Reads: Storage issue from Network Computing today.

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights