VMware Cloud Foundry Suffers Service Outage

As the beta development platform was recovering from a minor power supply problem, human error worsened the setback.

Charles Babcock, Editor at Large, Cloud

May 3, 2011

5 Min Read

11 Epic Technology Disasters

11 Epic Technology Disasters

(click image for larger view)
Slideshow: 11 Epic Technology Disasters

Assembling a distributed computing architecture in the cloud isn't easy to do. The more resources you try to bring together, the more that can go wrong. No, you're not hearing about Amazon's recent EC2 outage again. This time it's VMware.

VMware recently launched a development platform as a set of services in its CloudFoundry.org, a new developer's hosting service. On April 25, the Cloud Foundry experienced service disruption. In trying to recover later that day, it suffered an outage that continued into April 26.

Not only is VMware finding it more difficult than anticipated to keep a cloud up and running, it's sharing another experience with its much bigger and better established fellow cloud supplier: It's refusing to talk about the mishap other than what's presented in an official and carefully presented blog.

"The Cloud Foundry blog is the best resource for you, which details the account of the outage. VMware is continuing to keep that updated regularly to maintain transparency with the community," was the response from VMware to a request for more information Tuesday.

The blog cited is Dekel Tankel's, one of the primary builders and managers of CloudFoundry.org. In a blog posted April 29, he said the trouble started at 6:11 a.m. April 25, when a power supply in a storage cabinet experienced an outage. That deprived users of access to a single logical unit number (LUN), or identifier of a disk or set of disks, in Cloud Foundry. The power supply malfunction wasn't an unexpected event. Clouds are designed to detect and survive lost power supplies, either by invoking a redundant source or by routing around them using a backup copy.

"While not a 'normal event,' it is something that can and will happen from time to time," he wrote, and VMware thought it was prepared for it.

But, Tankel continued in the blog, "In this case, our software, our monitoring systems, and our operational practices were not in synch," he noted. The loss of a LUN was an event "that we did not properly handle and the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations."

Slideshow: Cloud Security Pros And Cons

Slideshow: Cloud Security Pros And Cons

Slideshow: Cloud Security Pros And Cons (click image for larger view and for full slideshow)

The Cloud Controller is the overall manager of many server nodes, each running a local droplet execution agent (DEA). A DEA is a supervisor of applications running on a single server. Each server has a DEA, and they can be configured in different ways to optimize their operation for a particular type of application.

The loss of the LUN disrupted the Cloud Controller beyond the single storage unit that had lost power, and that in turn disrupted other Cloud Foundry operations. It took Tankel and team several hours to realize that they hadn't lost any data in the incident and that the storage cabinet was once again functioning reliably, as was the Cloud Controller.

But having recovered, the novice cloud management team proceeded to compound the error. Wanting to learn a lesson from the incident, operations engineers immediately set about recording what went wrong and writing the playbook of correct procedures to avoid a future outage.

"This was to be a paper-only, hands-off-the-keyboards exercise, until the playbook was reviewed," Tankel writes.

You can almost guess what's coming next.

"Unfortunately, at 10:15 p.m. Pacific, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry," he continued.

"This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry" through the next 13 hours, until service was restored at 11:30 a.m. April 26, Tankel explained in the blog three days later.

The first loss of service resulted in a loss of access to data for some users. The second one resulted in all Cloud Foundry applications continuing to run, but the developers who normally use the cloud couldn't see that or access them. "We were the only ones who knew that the system was up," said Tankel in perhaps an unintentional understatement. For developers with no access, Cloud Foundry for all practical purposes was down.

Cloud Foundry was established as a beta service on April 12, so developers in the Cloud Foundry forums have mostly given it a pass since the outage. It's meant to host development projects that make use of Ruby on Rails, Groovy on Grails, or the Spring Tool suite of the Spring Framework for Java developers. VMware owns Spring Framework and its related tools.

The VMware outage is different from the Amazon incident April 21 but bears an eerie resemblance, given the human error involved in its second and greater service outage late April 25 and the first half of the 26th.

The Amazon EC2 outage was caused by what AWS called "a network event." It appears that a cloud operations administrator switched traffic on a primary network onto a secondary backup network by mistake. The network needed to be scaled up for early morning business activity but instead it had been drastically scaled down. Before the error could be corrected, the governing cloud software concluded that a large amount of data on a volume of the Elastic Block Store storage service was no longer available, setting off a "re-mirroring storm" as the service attempted to restore the data. That in turn choked two key services, EBS and Relational Database Service, in Amazon's Northern Virginia data center.

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights