Cloud // Infrastructure as a Service
News
3/5/2012
10:03 PM
Connect Directly
Twitter
RSS
E-Mail
50%
50%
Repost This

Microsoft Azure Leap Day Outage Questions Persist

Is defective security certificate from a third party to blame for Azure outage? Microsoft won't say yet.

Microsoft continues to do root cause analysis on why its Windows Azure cloud identity management service failed to operate properly in three regions over eight hours on Feb. 29. As of this week, users remain in the dark regarding full outage details.

Microsoft's cryptic explanation so far said the cause of downtime for the Access Control Service "has been traced back to a cert issue triggered on 2/29/2012 GMT." In other words, a certificate processing software glitch occurred an hour and forty-five minutes after the start of the Leap Day, Microsoft said in its first explanation after the outage.

A spokesman said on Monday that conclusions from the follow up analysis "will come soon" and will be made public.

Microsoft spokesmen declined to confirm that it was a faulty security certificate, or who might have supplied the certificate. Security certificates can be issued by either a cloud service provider or by independent third parties, such as Symantec's Verify unit, GoDaddy, or Comodo. The bulk of security certificates are issued by the third parties.

[Want more background on Azure's leap day service outage? See Microsoft Azure Outage Explanation Doesn't Soothe]

That leaves open the prospect that a trusted third party supplied a faulty certificate whose date and time stamp couldn't be accepted in Azure operations as correct. That fault might occur if the certificate issuer had failed to account for Feb. 29 in a leap year.

Asked to indicate whether it was an in-house or third-party that issued the problem certificate, Microsoft spokesmen said only that the company was continuing its analysis.

Whatever the origin, the faulty certificate interfered with the operation of Windows Azure's Access Control Service. The service is used by Web application builders, who build it into their applications to provide a combination of single sign-on and authentication.

When the service was unavailable, applications that depended on it would not have been able to obtain identity confirmations and authorize visitors to reach parts of applications that they would normally be able to reach.

No figures have been issued on the number of Web application owners affected, or the visitor traffic that may have been lost to their applications.

At one point as the outage unfolded, the Microsoft Service Dashboard indicated that "less than 3.8% of hosted services" were affected.

Behind Access Control Service are several other services that frequently depend on it, including SQL Azure Data Sync, SQL Azure Database, and SQL Azure Reporting. Operators at CloudSleuth, a cloud service monitoring system provided by Compuware, confirmed that CloudSleuth test servers in Azure continued responded to test pings during the service outage. That provides outside verification to Microsoft's claim that running servers were not taken down, as least not in most cases.

Bill Laing, Microsoft's corporate VP for server and cloud, wrote in a blog Feb. 29 on the heels of the outage that Microsoft had given priority to keeping active systems up and running. Requests for new services that involved identifying users were declined until the certificate problem was rectified, he wrote.

Spokesmen for another cloud monitoring service, the French company, Cedexis, sponsor of the Cedexis Radar, said feedback from end users around the world had documented the Access Control Service outage and plotted its duration.

Comment  | 
Print  | 
More Insights
2014 Private Cloud Survey
2014 Private Cloud Survey
Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Elite 100 - 2014
Our InformationWeek Elite 100 issue -- our 26th ranking of technology innovators -- shines a spotlight on businesses that are succeeding because of their digital strategies. We take a close at look at the top five companies in this year's ranking and the eight winners of our Business Innovation awards, and offer 20 great ideas that you can use in your company. We also provide a ranked list of our Elite 100 innovators.
Video
Slideshows
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.