Google Vs. Zombies -- And Worse
In Google's War Room
(Page 2 of 2)
"The war room is actually a physical room..." said Krishnan. "We have a lot tech leads and a lot of coordinators sitting together. So we're communicating with each other constantly. It's a very adrenaline-filled room. Very little sleep. Everybody, when something goes down, we're all stressed and alert. And we are supposed to know everything that's going on at any given point in time."
The events often require being up in the middle of the night, due to the global nature of the testing. They're powered by caffeine and donuts, which pretty much covers the hacker food pyramid.
More Security Insights
White PapersMore >>
The cause of the problem, whether it's rotting vegetables, zombies taking over a data center or something more mundane, isn't as important as the problems that are revealed and the response to them. DiRT exists to increase the likelihood that Google can keep its equipment and operations up and running.
One recent DiRT exercise, for example, involved an earthquake near the company's headquarters that took down a data center housing several internal Google systems. It revealed not only systems that didn't have adequate backups but also unexpected dependencies. Some engineers had systems failing over to workstations at offices in Mountain View, but these became inaccessible when the "earthquake" caused authentication mechanisms to fail.
Real disasters, such as Hurricane Sandy, have informed how Google deals with imagined ones.
"Sandy corrected a lot of our assumptions," said Krishnan. "That was a real world application of a lot of the things that we've done. We found a lot of gaps that we hadn't addressed at all. We found that some of the things that we decided would work are somewhat contrived and we should fix that."
Google's technical infrastructure weathered Sandy just fine, according to Krishnan. The problems that arose had to do with people: people who had to deal with flooded homes or family emergencies, people who didn't have power or who had lost Internet access, people who didn't have the information they needed to contact others, and people whose concerns overwhelmed incident managers. Google ended up sending many employees based in the New York area home during the crisis.
The problems exposed by Sandy showed up in a subsequent DiRT test: Sandy created a lot of internal company email, "so we actually simulated that environment during our recent DiRT exercise," Krishnan explained. "We started bombarding our incident managers with, 'Hey, I have a flight home. I don't know how to get there. Tell me. The airline is costing me a bazillion dollars. Can you expense this for me? Will you pay for me?'"
To Krishnan's surprise, the test participants responded well. They self-organized and dealt with the problems, she said.
The learning isn't always so swift. During the first DiRT test, only one person was able to find the emergency communication plan and dial in to the conference call at the designated time. A follow-up test produced a far better response, so good in fact that the number of callers exceeded the bridge line's capacity. And a subsequent call was undone by someone who called in and then placed the call on hold, subjecting the other conference call participants to "hold music" and revealing the lack of a mechanism to eject the absent caller or silence the music.
It turns out that it isn't easy to crash Google's systems. Krishnan recounted an attempt to simulate network packet loss that proved ineffective. "Our test bombed on us," she explained. "Then we realized that it was because we chose a certain time of day when there was almost no traffic. And based on the traffic, we had to actually cause a full outage to notice anything. There was no amount of packet loss we could create for anybody to notice anything. We were super-resilient to that."
The goal of another DiRT exercise, Krishnan said, was to test executive decision making. An alert was issued. "Within 15 minutes, some of our most senior executives showed up on a phone bridge," she said. "They were making decisions all over the place. The beauty of the whole thing is -- even through those decisions -- the first thing they thought about was their users." She characterized the call as "the most inspirational eight minutes of any DiRT so far."
DiRT, in conjunction with other quality-control regimes, has helped make Google software better. Over the years, engineers have whitelisted certain applications to exempt them from tests when they know the applications cannot pass. Krishnan says that the number of whitelisted applications has been declining and that now hardly any applications have to be excluded.
"We spend a ton of energy and time building top-notch products for our users," Krishnan explained. "But we also want to build top-notch infrastructure, so that our users believe our systems are reliable and available. DiRT tests to make sure that is true."
Heroism might not be a sustainable model for dealing with disasters, but at Google, it's more than a fictional framework; it's part of the job. "[W]e have had scenarios of zombies, and the incredibly axe- and baseball-bat skilled site reliability engineers and hardware operations techs who save the day," Krishnan said in an email. "In reality, however, all credit goes to the people who work tirelessly each year to make this happen: the DiRT team, the incident commanders, and the rest of those at Google who respond to these intense exercises."
It's often said that failing to plan is planning to fail. However, the converse is not true: Planning to fail isn't failing to plan. Rather, as Google demonstrates, planning to fail is preparing to succeed.
Don't let data, WAN and integration challenges get in the way of automated failover. Also in the new, all-digital Disaster Recovery Roadblocks special issue of InformationWeek: Why automation is smart, affordable and good for IT. (Free with registration.)