The loss of the LUN disrupted the Cloud Controller beyond the single storage unit that had lost power, and that in turn disrupted other Cloud Foundry operations. It took Tankel and team several hours to realize that they hadn't lost any data in the incident and that the storage cabinet was once again functioning reliably, as was the Cloud Controller.
But having recovered, the novice cloud management team proceeded to compound the error. Wanting to learn a lesson from the incident, operations engineers immediately set about recording what went wrong and writing the playbook of correct procedures to avoid a future outage.
"This was to be a paper-only, hands-off-the-keyboards exercise, until the playbook was reviewed," Tankel writes.
You can almost guess what's coming next.
"Unfortunately, at 10:15 p.m. Pacific, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry," he continued.
"This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry" through the next 13 hours, until service was restored at 11:30 a.m. April 26, Tankel explained in the blog three days later.
The first loss of service resulted in a loss of access to data for some users. The second one resulted in all Cloud Foundry applications continuing to run, but the developers who normally use the cloud couldn't see that or access them. "We were the only ones who knew that the system was up," said Tankel in perhaps an unintentional understatement. For developers with no access, Cloud Foundry for all practical purposes was down.
Cloud Foundry was established as a beta service on April 12, so developers in the Cloud Foundry forums have mostly given it a pass since the outage. It's meant to host development projects that make use of Ruby on Rails, Groovy on Grails, or the Spring Tool suite of the Spring Framework for Java developers. VMware owns Spring Framework and its related tools.
The VMware outage is different from the Amazon incident April 21 but bears an eerie resemblance, given the human error involved in its second and greater service outage late April 25 and the first half of the 26th.
The Amazon EC2 outage was caused by what AWS called "a network event." It appears that a cloud operations administrator switched traffic on a primary network onto a secondary backup network by mistake. The network needed to be scaled up for early morning business activity but instead it had been drastically scaled down. Before the error could be corrected, the governing cloud software concluded that a large amount of data on a volume of the Elastic Block Store storage service was no longer available, setting off a "re-mirroring storm" as the service attempted to restore the data. That in turn choked two key services, EBS and Relational Database Service, in Amazon's Northern Virginia data center.