VMware Cloud Foundry Suffers Service Outage
As the beta development platform was recovering from a minor power supply problem, human error worsened the setback.
VMware recently launched a development platform as a set of services in its CloudFoundry.org, a new developer's hosting service. On April 25, the Cloud Foundry experienced service disruption. In trying to recover later that day, it suffered an outage that continued into April 26.
More Cloud Insights
- The Untapped Potential of Mobile Apps for Commercial Customers
- Get Actionable Insight with Security Intelligence for Mainframe Environments
White PapersMore >>
Not only is VMware finding it more difficult than anticipated to keep a cloud up and running, it's sharing another experience with its much bigger and better established fellow cloud supplier: It's refusing to talk about the mishap other than what's presented in an official and carefully presented blog.
"The Cloud Foundry blog is the best resource for you, which details the account of the outage. VMware is continuing to keep that updated regularly to maintain transparency with the community," was the response from VMware to a request for more information Tuesday.
The blog cited is Dekel Tankel's, one of the primary builders and managers of CloudFoundry.org. In a blog posted April 29, he said the trouble started at 6:11 a.m. April 25, when a power supply in a storage cabinet experienced an outage. That deprived users of access to a single logical unit number (LUN), or identifier of a disk or set of disks, in Cloud Foundry. The power supply malfunction wasn't an unexpected event. Clouds are designed to detect and survive lost power supplies, either by invoking a redundant source or by routing around them using a backup copy.
"While not a 'normal event,' it is something that can and will happen from time to time," he wrote, and VMware thought it was prepared for it.
But, Tankel continued in the blog, "In this case, our software, our monitoring systems, and our operational practices were not in synch," he noted. The loss of a LUN was an event "that we did not properly handle and the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations."