Microsoft continues to do root cause analysis on why its Windows Azure cloud identity management service failed to operate properly in three regions over eight hours on Feb. 29. As of this week, users remain in the dark regarding full outage details.
Microsoft's cryptic explanation so far said the cause of downtime for the Access Control Service "has been traced back to a cert issue triggered on 2/29/2012 GMT." In other words, a certificate processing software glitch occurred an hour and forty-five minutes after the start of the Leap Day, Microsoft said in its first explanation after the outage.
A spokesman said on Monday that conclusions from the follow up analysis "will come soon" and will be made public.
Microsoft spokesmen declined to confirm that it was a faulty security certificate, or who might have supplied the certificate. Security certificates can be issued by either a cloud service provider or by independent third parties, such as Symantec's Verify unit, GoDaddy, or Comodo. The bulk of security certificates are issued by the third parties.
[Want more background on Azure's leap day service outage? See Microsoft Azure Outage Explanation Doesn't Soothe]
That leaves open the prospect that a trusted third party supplied a faulty certificate whose date and time stamp couldn't be accepted in Azure operations as correct. That fault might occur if the certificate issuer had failed to account for Feb. 29 in a leap year.
Asked to indicate whether it was an in-house or third-party that issued the problem certificate, Microsoft spokesmen said only that the company was continuing its analysis.
Whatever the origin, the faulty certificate interfered with the operation of Windows Azure's Access Control Service. The service is used by Web application builders, who build it into their applications to provide a combination of single sign-on and authentication.
When the service was unavailable, applications that depended on it would not have been able to obtain identity confirmations and authorize visitors to reach parts of applications that they would normally be able to reach.
No figures have been issued on the number of Web application owners affected, or the visitor traffic that may have been lost to their applications.
At one point as the outage unfolded, the Microsoft Service Dashboard indicated that "less than 3.8% of hosted services" were affected.
Behind Access Control Service are several other services that frequently depend on it, including SQL Azure Data Sync, SQL Azure Database, and SQL Azure Reporting. Operators at CloudSleuth, a cloud service monitoring system provided by Compuware, confirmed that CloudSleuth test servers in Azure continued responded to test pings during the service outage. That provides outside verification to Microsoft's claim that running servers were not taken down, as least not in most cases.
Bill Laing, Microsoft's corporate VP for server and cloud, wrote in a blog Feb. 29 on the heels of the outage that Microsoft had given priority to keeping active systems up and running. Requests for new services that involved identifying users were declined until the certificate problem was rectified, he wrote.
Spokesmen for another cloud monitoring service, the French company, Cedexis, sponsor of the Cedexis Radar, said feedback from end users around the world had documented the Access Control Service outage and plotted its duration.