Outage By SUV
Rackspace's managed hosting business and the embryonic Mosso Cloud running in the same Dallas data center were taken offline for several hours by an errant SUV on Nov. 13, 2007.
The driver of a large four-wheel drive vehicle, a diabetes sufferer, passed out behind the wheel. Instead of swerving to the edge of the street, the vehicle accelerated straight ahead, failed to turn at a T-intersection and jumped the curb to climb a grass berm on the far side. The berm served as a ramp that allowed the SUV to launch itself into the air over a row of parked cars. As it came down, it slammed into a building housing a power transformer for the Rackspace facility, knocking it out as a power source.
The building's cooling system came to a halt as a switching process linked up a secondary utility source of power. There was no interruption of processing, since the compute equipment continued running on the emergency batteries in place for just such an emergency. The facility's staff had initiated a restart procedure for the building's chillers when the utility, getting word that emergency crews were trying to extract a driver from a smashed vehicle embedded in live-feed transformer equipment, shut off all power to the facility, disrupting Rackspace's secondary utility source.
Again battery power kicked in and emergency generators started on cue, as called for by the disaster recovery plan. Data center processing had thus far not been interrupted, despite the accident and two losses of power from the grid. The multi-step startup process for the cooling system's large chillers, however, had been disrupted midway through the restart, and it proved impossible to get some restarted without further troubleshooting.
Rackspace president Lew Moorman told customers in a blog post soon after the incident that "two chillers did not restart, prompting the data center to overheat." The heat generated by the compute equipment was enough to send temperatures soaring, and Rackspace managers implemented "a phased equipment shut down lest equipment be damaged" and customer data lost.
The outage lasted until 10.50 p.m., five hours after the accident. Software-as-a-service provider 37signals, a company hosted by Rackspace, posted its own comment to its customers: "This 'perfect storm' chain of events beat both our set up and our data center's sophisticated back-up systems. We will work hard to further diversify our systems in order to make any future downtime event like this even more rare." In addition to increasing the risk of losing customers, the event was reported to have cost Rackspace $3.5 million in refunds.
(Image: Mousepotato via iStockphoto)