News

VMware Cloud Foundry Suffers Service Outage

Charles Babcock
Editor At Large, InformationWeek

As the beta development platform was recovering from a minor power supply problem, human error worsened the setback.

11 Epic Technology Disasters
(click image for larger view)
Slideshow: 11 Epic Technology Disasters
Assembling a distributed computing architecture in the cloud isn't easy to do. The more resources you try to bring together, the more that can go wrong. No, you're not hearing about Amazon's recent EC2 outage again. This time it's VMware.

VMware recently launched a development platform as a set of services in its CloudFoundry.org, a new developer's hosting service. On April 25, the Cloud Foundry experienced service disruption. In trying to recover later that day, it suffered an outage that continued into April 26.


More Cloud Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

Not only is VMware finding it more difficult than anticipated to keep a cloud up and running, it's sharing another experience with its much bigger and better established fellow cloud supplier: It's refusing to talk about the mishap other than what's presented in an official and carefully presented blog.

"The Cloud Foundry blog is the best resource for you, which details the account of the outage. VMware is continuing to keep that updated regularly to maintain transparency with the community," was the response from VMware to a request for more information Tuesday.

The blog cited is Dekel Tankel's, one of the primary builders and managers of CloudFoundry.org. In a blog posted April 29, he said the trouble started at 6:11 a.m. April 25, when a power supply in a storage cabinet experienced an outage. That deprived users of access to a single logical unit number (LUN), or identifier of a disk or set of disks, in Cloud Foundry. The power supply malfunction wasn't an unexpected event. Clouds are designed to detect and survive lost power supplies, either by invoking a redundant source or by routing around them using a backup copy.

"While not a 'normal event,' it is something that can and will happen from time to time," he wrote, and VMware thought it was prepared for it.

But, Tankel continued in the blog, "In this case, our software, our monitoring systems, and our operational practices were not in synch," he noted. The loss of a LUN was an event "that we did not properly handle and the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations."

Page 2: Novice Management Team Compounded The Error
 1 | 2  | Next Page » 

Related Reading


Informationweek Discussions

Start the Discussion


InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
Subscribe to RSS

Resource Links