Microsoft has corrected a software bug in its Azure Cloud Storage that, when deployed widely on Nov. 18, triggered a massive outage. In some cases the error triggered infinite loops and tied up storage servers, dragging down much of the Azure cloud into a drastic slowdown.
As a result, connection rates to Azure on Nov. 18 dropped from 97% to 7%-8% after 7 p.m. Eastern in Northern Virginia. The Azure data center in Dallas suffered a complete outage for a short while. Data centers in Europe didn't recover until deep into the following day.
Microsoft had tested the storage update code before deployment, but contrary to its own best practices, Azure administrators rolled it out to Azure storage services as a whole instead of "flighting" -- limiting the roll-out to small sections at a time.
"The standard flighting deployment policy of incrementally deploying changes across small slices was not followed," wrote Microsoft's Jason Zander, corporate VP for the Azure team, in a blog Dec. 17.
[Want to learn more about the cloud outage's impact? See Microsoft Azure Storage Service Outage: Postmortem.]
The key problem, however, was a configuration issue in Azure Table storage front ends. "The configuration switch was incorrectly enabled for Azure Blob storage front-ends," wrote Zander. Table storage front-ends record the sequence of the different data types going into a Blob (a service for storing large amounts of unstructured data) and can be used to guide the data's retrieval. The error in the configuration switch appears to have caused an infinite loop.
The original change was meant to improve Azure Storage performance. In test after test, including pre-production staging, it did so and proved reliable, wrote Zander. That may have lead to overconfidence and haste in attempting to deploy the update and realize the performance gains.
Whatever the cause, Azure administrators have implemented automated practices that won't allow a human decision to overrule its "flighting" best practice -- using separate and limited implementations for putting new code into production.
In perhaps the clearest outcome of the incident, Zander wrote: "Microsoft Azure had clear operating guidelines, but there was a gap in the deployment tooling that relied on human decisions ... With the tooling updates, the policy is now enforced by the deployment platform itself."
Zander acknowledged that cloud operations must become more reliable and said Microsoft will continue to work on that goal. "We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services," he wrote.
Network Computing's new Must Reads is a compendium of our best recent coverage of storage. In this issue, you'll learn why storage arrays are shrinking for the better, discover the ways in which the storage industry is evolving towards 3D flash, find out how to choose a vendor wisely for cloud-based disaster recovery, and more. Get the Must Reads: Storage issue from Network Computing today.