Microsoft's Windows Azure cloud froze up momentarily for a significant fraction of customers over October 29 and October 30 due to a glitch in its API for staging systems. In an interview, Azure general manager Mike Neil discussed the disruption as he described recent Azure operations.
There was a bug in the software API for staging systems that had escaped detection in testing. That was the conclusion from his "preliminary" investigation, Neil said. His finding has been reported up the chain of command to Qi Lu, president of online services, and Satya Nadella, president of the server and tools business, as is usual in such cases, Neil said during an interview in San Francisco earlier this month. A more complete explanation based on root cause analysis will be published at a later date, he said.
In effect, whether it was a customer or Microsoft itself using the staging API, the process of swapping out one virtual IP address for another, which allows a system to go from staging into production, was periodically failing. Neil couldn't specify the circumstances that caused it to stall, but the resulting disruption affected enough customers to set off a flurry of Twitter comments.
[Want to learn more about how cloud providers set their underlying architecture? See Cloud Infrastructure Buyers: Job 1 Is Studying Architecture.]
"A simple API call was failing... It was the first service-run of an API that enables swapping of virtual IP addresses," he said. The API had been thoroughly tested and functioned as expected in tests, right up until the moment it started failing while in actual production. Normally, the process of swapping a patched and undated preproduction system for a production system is invisible to the systems' users: Calls that had been going to the old system are smoothly moved into the updated one. On October 29, the process became all too visible, when the API call periodically failed to recognize the swapped-in system.
Microsoft uses the API for staging updates to Azure software itself and also makes it available to customers for staging their own production workloads. In disclosing the ailment -- it slowed or momentarily stalled some tasks, but systems did not actually stop functioning -- Neil also described some aspects of Microsoft's Azure architecture that aren't generally known.
When the trouble occurred, Microsoft concentrated not on identifying the exact nature of the problem and its fix, but on getting operations recovered and back to normal in the shortest period possible. His team's focus was on mean time to recovery, said Neil. Analysis and long-term fixing can follow the root cause analysis that's done at a later date. By October 31, everything was running normally again.
Normally a failure on Azure occurs in a limited region of the overall cloud. That's because Microsoft has organized is Azure datacenters into 1,000-server cells or "stamps," as it likes to call them. Microsoft, Google, and Amazon Web Services all organize their cloud infrastructures into regions, and within each region they have one or more data centers, which are themselves divided into subregions or "availability zones." Like an Amazon availability zone, a Microsoft stamp has its own power supply and network and communications links, as well as 1,000 servers and related storage. A failure that affected one stamp theoretically would be contained within its boundaries and not affect neighboring stamps.
The failure of the API implementation was evident because it occurred across multiple regions around the world at the same time. Usually any software change to Azure is carefully phased in to part of a data center site, then a whole center, then a region, etc. But the new API program had to be implemented across regions, after early phase-in success, at roughly the same time. It was too basic to continued operations to be brought in piecemeal. That caused the resulting slowdown and balky operations to be much more widespread and noticeable, not limited to a spot in one region, Neil conceded.
Azure's most dramatic outage was its Leap Day failure on February 29, 2012, when faulty security certificates convinced Azure governing software that more and more of its servers were failing and virtual machines needed to be transferred off them. The transfers set off more false alarms as new VMs were accompanied by improperly dated certificates, cascading the problem. It took Microsoft 2.5 hours to isolate the bug and 7.5 hours to correct it.
Other cloud providers have suffered similar mishaps. One of the most noted was Amazon's Easter weekend failure in 2011, when a night technician unplugged a main trunk network and replugged it into a smaller scale, backup network. That move set off "a re-mirroring storm" as virtual machines that could no longer access their usual data source called for a backup copy to be made.
Why can't the cloud have 100 percent uptime? "Hardware will fail. Software will have bugs. People will make mistakes," Neil said philosophically.
Want to relegate cloud software to edge apps or smaller businesses? No way. Also in the new, all-digital Cloud Software: Where Next? special issue of InformationWeek: The tech industry is rife with over-the-top, groundless predictions and estimates. (Free registration required.)