A process meant to detect failed hardware in Microsoft's Azure cloud was inadvertently triggered by a Leap Day software bug that set invalid expiration dates for security certificates. The bad certificates caused virtual machine startups to stall, which in turn generated more and more readings of hardware failure until Microsoft had a full-blown service outage on its hands.
There was no real epidemic of hardware failure. It was simply one cloud mechanism misinterpreting what was going with another and overreacting. The software bug itself was isolated with 2.5 hours and corrected within 7.5 hours of the first incident.
Microsoft's Bill Laing, corporate VP for the server and cloud division, posted a blog late Friday, following up on a March 1 blog on the outage, that aired the root cause analysis of its Azure Feb. 29th outage.
Here's a blow-by-blow description of what caused it.
When a virtual machine workload in Azure fails to start for the third time, what's known as the Host Agent on the server concludes it has detected a likely hardware problem and reports it to the governor of the server cluster, the Fabric Controller. The controller moves virtual machines off the "failing hardware" and restarts them elsewhere. At the start of Feb. 29, this Azure protection process inadvertently spread the Leap Day bug to more servers and eventually, other 1,000-unit clusters supervised by other controllers.
The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.
[ Want to learn more about previous cloud outages? See Post Mortem: When Amazon's Cloud Turned On Itself. ]
Only some services were affected; most running workloads continued operating. Azure Compute's ability to spin up new workloads was affected, although it continued to run existing workloads in many cases. Also, out were: Data Sync Services, Azure Service Bus, and SQL Azure Portal (the SQL Azure cloud database system itself continued operating). In some cases, these services were out for up to 16 hours.
As Feb. 28 rolled over to Leap Day at midnight in Dublin (and 4 p.m. Pacific time Feb. 28), a security certificate date-setting mechanism in the Guest Agent--an agent assigned to a customer workload by Azure--tried to set an expiration date 365 days later--Feb. 29, 2013, a date that does not exist. The lack of a valid expiration date caused the security certificate to fail to be recognized by a server's Host Agent, and that in turn caused the virtual machine it was meant to enable to stall in its initialization phase.
As virtual machines failed repeatedly, the Host Agent monitoring the server used an established workload standard--three failures in a row--as a sign of a likely or pending failure of some component of the server host's hardware. It reported the repeat failures to an automated supervisor, the cluster Fabric Controller.
Per the standard, after three repeat stalls, the Azure Fabric Controller places the server in a non-use, human investigate (HI), state. In the Azure cloud, HI is a mark of repeated failure of a virtual machine (VM) in which no customer code has run. That's suspicious because, theoretically, the VM has attempted to start up with known, clean code. In such circumstances, a hardware fault is the prime suspect.
The Fabric Controller, with 1,000 Azure servers under its domain, can act by moving the VMs off a failing server and onto other servers. The problem with that maneuver was that a software bug in the Guest Agent assigned to each customer's workload was again activated to generate a new security certificate in the new location. The SSL certificate is the first link established between the guest workload and host agent so that encrypted information may flow between the two and the workload initialized. More failed startups led to more hardware failure reports.
Once a server is placed in an HI state, the Fabric Controller performs an automated "service healing" by moving VMs off it to healthy servers. "In a case like this, when the VMs are moved to available servers, the Leap Day bug will reproduce during [Guest Agent] initialization, resulting in a cascade of servers that move to HI," wrote Laing in the blog post.
As Leap Day got underway in Microsoft's Dublin data center, this automated mechanism meant that the problem was going to be propagated beyond the original server cluster onto others, which would start generating certificates that would also fail. Eventually, the frequent attempts to move at-risk virtual machines on "failing" hardware would spread the Leap Day bug around a 1,000-server cluster.