Azure Outage Caused By False Failure Alarms: Microsoft

Microsoft's post-mortem on recent Azure cloud outage says a deep Azure security mechanism contained a bug that didn't recognize Leap Day. This set off a cascade of false hardware failure notices.
Microsoft had anticipated such a problem within a cluster, and had established a circuit-breaker mechanism against too many HI signals generating too many server shutdowns and VM transfers. Once a certain threshold of HI warnings was reached, the 1,000-unit cluster was placed in HI status, which halts the automatic service healing and gives human operators a chance to preserve what's still running while halting a cascading bug.

Laing explained how the bug spread anyway:

"At the same time, many clusters were also in the midst of the rollout of a new version of the Fabric Controller, Host Agent, and Guest Agent. That ensured that the bug would be hit immediately in those clusters and the server HI threshold hit precisely 75 minutes (3 times 25-minute timeout) later at 5:15 p.m. Pacific" Feb. 28, wrote Laing in his blog. The Pacific time translates to 1:15 a.m. Greenwich Mean Time in Dublin, or 75 minutes into Leap Day.

Microsoft engineers spotted the correlation between HI notices and the three 25-minute outages occurring after the start of Leap Day. They were able to identify the bug in the Guest Agent's certificate creator two hours and 38 minutes after it first triggered in Dublin. They quickly developed a rollout plan for the repair and had code ready seven hours and 20 minutes after the first incident. But in its eagerness to counteract the bug, Microsoft rolled out a fix that caused an additional problem.

It tested the repaired Guest Agent code on a test cluster, used it with its own applications on another cluster, then rolled it out to its first Azure production cluster at 2:11 a.m. Pacific standard time (10:11 a.m. in Dublin) on Feb. 29, or a little more than 10 hours after the initial triggering of the bug (at midnight Dublin time, or 4 p.m. PST Feb. 28).

Seven 1,000-server clusters in Azure were still in the midst of a partial update of the Fabric Controller, Host Agent, and Guest Agent. Because of that, the fix did not work with them. Microsoft's attempt to use a "blast" fix to all units at the same time, instead of a more measured, gradual deployment brought down not only the newly starting VMs but some existing healthy ones in the seven clusters.

Running workloads rely on Azure's Access Control Service, Azure Service Bus, and other services. "... Any application using them was now impacted because of the loss of services on which they depended," Laing conceded.

Microsoft fixed this problem as well for the seven affected clusters, and got Windows Azure back to a "largely operational" status by 8 a.m. PST, or a total of 16 hours after the first incident for the last seven clusters. Mopping up problems continued for several hours through human intervention.

From the start of his blog, Laing struck an apologetic note: "We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues ... Again, we sincerely apologize for the disruption, downtime, and inconvenience this incident has caused. We will be proactively issuing a service credit to our impacted customers ... Rest assured that we are already hard at work using our learnings to improve Windows Azure," Laing wrote.

The outage demonstrates a hitherto unknown security feature built into the Microsoft's Azure cloud. In order for an application to be started, the Guest Agent on the application's operating system creates a transfer certificate. The certificate includes a public key that it can hand off to the host agent to begin transfers of encrypted information between the hypervisor and application.

This is applying a secure method of operation both deep inside the cloud and inside the virtualized host. Once the Host Agent has been handed the key by the Guest Agent, the host server can send security certificates and other encrypted information to the application. If these transmissions were intercepted or for some reason ended up in the wrong hands, only the intended recipient has the means to decrypt them.

This is an application of basic public key infrastructure (PKI), widely used between remote parties exchanging information securely over the Internet. But a virtual machine on a server is very close to the hypervisor managing its affairs. In fact, they occupy bounded parts of the same physical RAM on the server. Using PKI in this setting is a new and fine-grained use that's not been commonly resorted to in the past and an example of an extra measure of security built into a public cloud infrastructure.

It also introduced greater complexity and several steps in which a hidden software bug, like a Leap Day problem, could cause something to go wrong. But in the long run, this type of architecture is likely to build greater confidence in public infrastructure as a service.

As enterprises ramp up cloud adoption, service-level agreements play a major role in ensuring quality enterprise application performance. Follow our four-step process to ensure providers live up to their end of the deal. It's all in our Cloud SLA report. (Free registration required.)

Editor's Choice
Brandon Taylor, Digital Editorial Program Manager
Jessica Davis, Senior Editor
Terry White, Associate Chief Analyst, Omdia
Richard Pallardy, Freelance Writer
Cynthia Harvey, Freelance Journalist, InformationWeek
Pam Baker, Contributing Writer