How Columbia Sportswear Will Survive Next Tsunami: Cloud
International sportswear provider pursues hybrid cloud disaster recovery plan, to avoid another data center shutdown as happened in the 2011 Japan tsunami.
Columbia Sportswear, the $1.7 billion-revenue outdoor clothing retailer, learned a hard lesson in March 2011 when a tsunami struck the Fukushima region of Japan. Its data center in Tokyo remained intact, but soon stopped operating due to frequent electrical power interruptions.
"A clothing manufacturer is not high on the list of those getting emergency diesel power," noted Michael Leeper, director of global technology for Columbia. IT staff in Tokyo had access via the Internet to facilities in other parts of the country and the world. It would have been able to reach them, if it had duplicate systems located elsewhere. But "the country was going through turmoil as we still tried to conduct business. We couldn't sustain power. We had to shut off the equipment for days at a time," he recalled in an interview.
As it was, even if they found a site in Japan with steady power, they'd need to retrieve data off tapes. If a three-petabyte data transfer took several days, then the Tokyo data center was likely to be disrupted somewhere in the middle of it. The lesson irrevocably sank in. "We didn't realistically have a disaster recovery plan ... we'd have to pray that the data came back off the tape," he said.
Leeper, however, is part of the generation of managers who has enthusiastically embraced virtualization. In the U.S., Columbia has two data centers, one in its headquarters city of Portland, Ore., and another in Denver, with one facility serving as the recovery center for the other. Leeper has pushed VMware virtualization deep into the heart of the U.S. data centers, moving from about 15% virtualized to 96% in about nine months. It's routine to move virtual machines from one rack to another via live migration, known as vMotion. But in Japan, he had no way for his systems to escape the aftermath of the tsunami.
"We use vMotion to move servers 30 feet, 50 feet, or 100 feet in the data center. Why not use it to move them hundreds of miles?" he asked. In effect, he was looking for a hybrid cloud style of operation where the cloud could take up the slack for the 3-4 weeks needed to get a company facility reliably operational again. That would require a smooth transfer of operational responsibility from one site to another. It would also require an up-to-date data stream that could be shut down at one site and resurrected at another.
Leeper talked to large cloud providers, including Terremark (a unit of Verizon) and Savvis (a unit of CenturyLink), about using them as disaster recovery sites in the U.S. "They wanted us to bring them our workloads to run in their data centers. Then we'd set up disaster recovery. They didn't understand what we were talking about," he recalled.
Columbia is a VMware ESX Server and vSphere user. Leeper found a smaller, VMware-compatible, regional provider of infrastructure service in Tier 3, a firm in nearby Seattle and a VMware partner. Columbia experimented with them and found, using VMware vCloud Suite with vSphere, vCloud Director, and vCenter Site Recovery Manager, Tier 3 functioned fine as a hosting service at which Columbia systems could be quickly initiated.
But for it to function as a temporary disaster recovery site, Columbia needed a way to continuously, but cheaply, replicate business data. Without that, it might not succeed in pulling everything needed off of tapes. Even if all the taped data materialized, there would be some inevitable setback, some missing data from the day of disaster, or possibly several days since the last tape had been made, that would delay the business' ability to reopen all operations.
Columbia likes the concept of a cloud facility providing a backup and recovery center, but Leeper's staff is studying how to replicate data to it at a price that Columbia considers acceptable. "How do we seed the data to this site before we need the disaster recovery?" he asked.
For that matter, even if it created such a facility tomorrow at Tier 3, "they don't quite have the global reach we need," Leeper adds, thinking of his Japan operation. That is, the latencies caused by distance between Seattle and Tokyo would mean greater delays in processing than the company wants to put up with. If Columbia's going to move disaster recovery to the cloud, the cloud provider needs to have facilities in the parts of the world that are key to Columbia. Then he wants to be assured Columbia's backup and recovery systems can be moved from one site to another, since no one can be sure where disaster will strike next. And he doesn't want to be forced to establish full disaster recovery sites in each part of the world where Columbia operates.
Columbia continues to use Tier 3--"they're a great partner," said Leeper--but he is trying to find a less expensive way to replicate data to the cloud than those available today, one that can move data between his data centers or between the cloud data centers that his firm has chosen. He knows hybrid cloud will work for his company, because Columbia's staff has done enough experimentation at Tier 3 in moving recovery systems around to provide proof of concept.
If a disaster occurred in Tokyo, he'd like the option of being able to move his systems at a moment's notice to Shanghai, where replicated data sits waiting to run. He knows that's a goal that may not be far off, but it's still something that he can't do today.
Columbia has 4,000 employees and 50 locations around the world, along with additional data centers in Hong Kong and Strasbourg, Austria. The concept of hybrid cloud would give it an ability to recover from any type of disaster in any location, instead of setting up duplicate sites near each.
Leeper wants to be able to move workloads to a temporary site in the cloud, recover--or even move--an existing data center to a new location, and bring it back up again "without end users knowing there's been an outage, except for a 15-20 second pause in their systems." He knows it's possible as soon as he and the staff solve the frequent data replication piece.
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.