Cloud // Infrastructure as a Service
Commentary
2/29/2012
10:51 PM
Charles Babcock
Charles Babcock
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Microsoft Azure Outage Explanation Doesn't Soothe

Microsoft leader's post mortem on Azure cloud outage cites a human error factor, but leaves other questions unanswered. Does this remind you of how Amazon played its earlier lightning strike incident?

Microsoft's Azure cloud outage Wednesday was apparently caused by a glitch related to leap day, according to a post mortem offered by the computer giant. Late Wednesday, the Microsoft Azure team blogged that it had moved quickly once it discovered the leap year bug to protect customer's running systems. But it could not prevent access being blocked to services in several Azure data centers.

There was good news and bad news in the disclosure. Bill Laing, corporate VP for server and cloud, wrote in a blog Wednesday afternoon that his engineers had realized there was a leap day bug affecting the compute service at 1:45 a.m. Greenwich Mean Time Wednesday, which was 5:45 p.m. Tuesday in the Pacific Northwest. They discovered it early, while many of the affected slept.

The bug is likely to have been first detected through the Microsoft Azure data center in Dublin. "While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year," wrote Laing. The computer clocks of its Dublin facility would have been well into their leap day at 1:45 a.m. GMT.

"Once we discovered the issue, we immediately took steps to protect customer services that were already up and running and began creating a fix for the issue," Laing wrote. In other words, Microsoft appears to have given priority to protecting running systems and did so at the expense of granting access to incoming requests for service. Few would quarrel with the decision.

[ Want to learn more about a possible route out of a cloud that's experiencing a service failure? See Amazon Cloud Outage Proves Importance Of Failover Planning. ]

But for some reason, the United Kingdom's recently launched government CloudStore, which is hosted in the North Europe region, went offline, according to a Computer Business Review report.

"The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57 a.m. PST," or a little over nine hours later, Microsoft's Laing wrote.

But that wasn't the end of the story; Laing continued: "However, some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are actively working to address these remaining issues."

Which customers are affected, how are they affected, and what is the nature of the ongoing outage? Instead of touching upon any of these points in a transparent way, Laing's sharp focus has faded to fuzzy gray, with the thrice-cited "issues" serving as a substitute for saying anything concrete about the remaining problems.

The sub-regions most directly affected by the original loss of access were named in the Azure Service Dashboard Wednesday as North Europe, which best estimates suggest the Microsoft data center in Dublin, Ireland, and the North Central and South Central United States. Microsoft operates Azure data centers in Chicago and San Antonio, Texas, in the Central time zone.

Microsoft also stated that its Azure Storage service was never down or inaccessible.

Prior to Laing's disclosures, Microsoft had stated that "incoming traffic may not go through for a subset of hosted services … Deployed applications will continue to run …" The subset of services affected included the SQL Azure Database and SQL Azure Data Synch services, SQL Azure Reporting, and Windows Azure Service Management.

While some services were not available in particular regions, Azure Service Management was out worldwide, an event that happened early--and was probably the first sure sign of trouble. On the other hand, the Azure Compute service continued as normal until 10:55 a.m. GMT, when the dashboard signaled that new service couldn't be granted to incoming requests in three sub-regions.

This incident is a reminder that the best practices of cloud computing operations are still a work in progress, not an established science. And while prevention is better than cure, infrastructure-as-a-service operators may not know everything they need to about these large-scale environments. The Azure Chicago facility is built to hold 300,000 servers, with a handful of people running it.

It might seem foreseeable that security clocks or system clocks could experience problems on the 29th day of February. Many were probably attended to or engineered correctly, but there's always one sleeper able to wake up and cause trouble. Thus, Microsoft's "cert issue triggered by 2/29/2012" announcement early Wednesday can join with Amazon's "remirroring storm" of April 22-24, 2011. Microsoft's cryptic message suggests a security certificate was unprepared for the leap year.

And don't forget the Dublin lightning strike last Aug. 7. It was said to have hit a utility transformer near the Amazon and Microsoft facilities, robbing them of power for an hour. In the aftermath, repeating what they had been told by the utility, Amazon operators said the force of the charge had been so great that it disrupted the phase coordination of backup generators coming online, causing them to fail.

The only problem was the utility concluded three days later there had been no lightning strike. It said instead there had been an unexplained equipment faiIure.

The lightning strike, in its way, had been a more acceptable explanation. What does it say about the cloud if random equipment failures disrupt it as well as acts of God? You can begin to see the boxes cloud providers end up in after quick explanations for reliability failures. It might be wise in the event of the next outage to remember that there are still things we don't understand about operating at the scale of today's cloud.

As enterprises ramp up cloud adoption, service-level agreements play a major role in ensuring quality enterprise application performance. Follow our four-step process to ensure providers live up to their end of the deal. It's all in our Cloud SLA report. (Free registration required.)

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
KLC
50%
50%
KLC,
User Rank: Apprentice
3/4/2012 | 8:16:02 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
There is a very simple lesson to be learned from the Azure outage (and last year's Amazon outage): You must perform detailed Business Continuity (DR) planning. Identify all potential single points of failure. The lesson here is that an "entire vendor" can be a single point of failure. Amazon advertised "availability zones" to protect their clients from outages. Oooops, Amazon suffers a multi-availability-zone outage. This Azure outage was multi-data center. Murphy is alive and thriving in the cloud community, just like he(she) is in corporate data centers. So, plan for it.

If you are using cloud services, you must have contingency plans in place for a complete vendor failure, whether that is bringing critical apps back in-house, or switching to another provider. The cloud providers are victims of their own hype, in that the growth is too rapid for them to cover all their bases. We all know that change is public enemy #1 to reliability. The growth within our cloud providers requires constant change as they expand their environments, especially when it pushes the limits of their architectures.

There is another aspect of these outages that I find disconcerting, namely, how the providers handle the problems, especially as it relates to client communications. This was not a major issue with outsourcing, because clients had dedicated account management teams. With the commodity pricing of cloud, we don't have the luxury of Customer Relationship Management with the providers. The cloud community must address this, either by providing far better on-line communications vehicles, or biting the bullet and having account managers that can serve as communications conduits in either direction.

Cloud is here to stay, and I am sure we will witness an evolutionary process. With what we are witnessing and experiencing, it is time for the cloud providers to acknowledge their lack of enterprise class maturity and develop the plans to bridge the gaps.
YMOM100
50%
50%
YMOM100,
User Rank: Apprentice
3/3/2012 | 3:39:54 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
No wonder! How should Microsoft have anticipated this brand new concept of a leap day? After all, when Azure was designed there was no Feb 29 on the calendar.
rchard
50%
50%
rchard,
User Rank: Apprentice
3/2/2012 | 6:27:20 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
The first cloud that Microsoft bought I had the misfortune of having data on it. Due to active misconduct by executives they lost 1/3 of my data. Publicly the claimed to have recovered or compensated everyone, but some how that did not include any one I knew who they trashed. They still do not have the mindset and understanding of what it takes to run a cloud that could be trusted. I have long since written them off the list of providers I would ever trust again.
Sam Iam
50%
50%
Sam Iam,
User Rank: Apprentice
3/2/2012 | 8:19:08 AM
re: Microsoft Azure Outage Explanation Doesn't Soothe
This is not a "cloud" issue. This is a Microsoft issue.
parkercloud
50%
50%
parkercloud,
User Rank: Apprentice
3/1/2012 | 9:03:13 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
How can you possibly state this is a "cloud" operations issue and not just a bad operations issue? Bad operations is just bad operations regardless if it is for a cloud service or not.

"This incident is a reminder that the best practices of cloud computing operations are still a work in progress"
Multicloud Infrastructure & Application Management
Multicloud Infrastructure & Application Management
Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - September 10, 2014
A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? When it comes to big data, one size doesn't fit all. Here's how to decide.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A look at the top stories from InformationWeek.com for the week of September 7, 2014.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.