Microsoft's Azure cloud outage Wednesday was apparently caused by a glitch related to leap day, according to a post mortem offered by the computer giant. Late Wednesday, the Microsoft Azure team blogged that it had moved quickly once it discovered the leap year bug to protect customer's running systems. But it could not prevent access being blocked to services in several Azure data centers.
There was good news and bad news in the disclosure. Bill Laing, corporate VP for server and cloud, wrote in a blog Wednesday afternoon that his engineers had realized there was a leap day bug affecting the compute service at 1:45 a.m. Greenwich Mean Time Wednesday, which was 5:45 p.m. Tuesday in the Pacific Northwest. They discovered it early, while many of the affected slept.
The bug is likely to have been first detected through the Microsoft Azure data center in Dublin. "While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year," wrote Laing. The computer clocks of its Dublin facility would have been well into their leap day at 1:45 a.m. GMT.
"Once we discovered the issue, we immediately took steps to protect customer services that were already up and running and began creating a fix for the issue," Laing wrote. In other words, Microsoft appears to have given priority to protecting running systems and did so at the expense of granting access to incoming requests for service. Few would quarrel with the decision.
[ Want to learn more about a possible route out of a cloud that's experiencing a service failure? See Amazon Cloud Outage Proves Importance Of Failover Planning. ]
But for some reason, the United Kingdom's recently launched government CloudStore, which is hosted in the North Europe region, went offline, according to a Computer Business Review report.
"The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57 a.m. PST," or a little over nine hours later, Microsoft's Laing wrote.
But that wasn't the end of the story; Laing continued: "However, some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are actively working to address these remaining issues."
Which customers are affected, how are they affected, and what is the nature of the ongoing outage? Instead of touching upon any of these points in a transparent way, Laing's sharp focus has faded to fuzzy gray, with the thrice-cited "issues" serving as a substitute for saying anything concrete about the remaining problems.
The sub-regions most directly affected by the original loss of access were named in the Azure Service Dashboard Wednesday as North Europe, which best estimates suggest the Microsoft data center in Dublin, Ireland, and the North Central and South Central United States. Microsoft operates Azure data centers in Chicago and San Antonio, Texas, in the Central time zone.
Microsoft also stated that its Azure Storage service was never down or inaccessible.
Prior to Laing's disclosures, Microsoft had stated that "incoming traffic may not go through for a subset of hosted services … Deployed applications will continue to run …" The subset of services affected included the SQL Azure Database and SQL Azure Data Synch services, SQL Azure Reporting, and Windows Azure Service Management.
While some services were not available in particular regions, Azure Service Management was out worldwide, an event that happened early--and was probably the first sure sign of trouble. On the other hand, the Azure Compute service continued as normal until 10:55 a.m. GMT, when the dashboard signaled that new service couldn't be granted to incoming requests in three sub-regions.
This incident is a reminder that the best practices of cloud computing operations are still a work in progress, not an established science. And while prevention is better than cure, infrastructure-as-a-service operators may not know everything they need to about these large-scale environments. The Azure Chicago facility is built to hold 300,000 servers, with a handful of people running it.
It might seem foreseeable that security clocks or system clocks could experience problems on the 29th day of February. Many were probably attended to or engineered correctly, but there's always one sleeper able to wake up and cause trouble. Thus, Microsoft's "cert issue triggered by 2/29/2012" announcement early Wednesday can join with Amazon's "remirroring storm" of April 22-24, 2011. Microsoft's cryptic message suggests a security certificate was unprepared for the leap year.
And don't forget the Dublin lightning strike last Aug. 7. It was said to have hit a utility transformer near the Amazon and Microsoft facilities, robbing them of power for an hour. In the aftermath, repeating what they had been told by the utility, Amazon operators said the force of the charge had been so great that it disrupted the phase coordination of backup generators coming online, causing them to fail.
The only problem was the utility concluded three days later there had been no lightning strike. It said instead there had been an unexplained equipment faiIure.
The lightning strike, in its way, had been a more acceptable explanation. What does it say about the cloud if random equipment failures disrupt it as well as acts of God? You can begin to see the boxes cloud providers end up in after quick explanations for reliability failures. It might be wise in the event of the next outage to remember that there are still things we don't understand about operating at the scale of today's cloud.
As enterprises ramp up cloud adoption, service-level agreements play a major role in ensuring quality enterprise application performance. Follow our four-step process to ensure providers live up to their end of the deal. It's all in our Cloud SLA report. (Free registration required.)