Our Cloud Disaster Recovery Story
I put my money where my mouth is with cloud DR, and it not only benefited my organization but also earned us a prestigious award.
As a student and practitioner of enterprise cloud computing, I've written a lot over the past five years about what I’ve learned. So I'm very pleased that my organization has won a prestigious award for one of our cloud computing efforts, competing against organizations four times our size. Here’s our story and a bit of analysis behind this "five-year overnight success."
What exactly did we win? The Amazon "City In A Cloud" competition, for midsized cities, including a pretty hunk of Lucite and a service credit of $50,000. Sweet.
Mind you, there aren't a huge number of cities even experimenting with cloud computing. Nonetheless, our city, Asheville, N.C., with a daytime population of 120,000, was up against a field of innovators that included Tel Aviv, Israel (population 414,000), Almere, Netherlands (196,000), and Santa Clarita, Calif. (209,000).
Lest you say this award was all about vendor shenanigans, third-party judges included: Scott Case of Startup America; St. Paul, Minn., Mayor Christopher Coleman, president of the National League of Cities; Bob Sofman, co-executive director of Code for America; and luminaries from The Aspen Institute, White House Office of Social Innovation, and Civic Participation, and other organizations.
What did we win for? We used automation software to do real-time syncing of production systems to cloud storage, which meant paying for the software and the storage, but no compute -- until we needed it. It meant being able to fail over when needed with a high level of confidence, knowing that the disaster recovery system is exactly the same as the production system, or at least within a few hours of that state.
Why is that important? Well, let’s start with a stat from the InformationWeek survey underpinning my recent Cloud Disaster Recovery Tech Digest. Among the 430 business technology pros who responded to the survey, all of whom are involved with their organizations’ backup systems, just 23% said they're extremely confident they could get the business up and running again in a reasonable time frame after a disaster that takes out the main datacenter.
If you’re part of the 77% who aren’t extremely confident, you're not alone. We weren't confident.
A happy problem
After 20-plus years supporting public safety, I'm prone to thinking about catastrophes. So when I arrived at my present organization, I asked: How do we handle disaster recovery? The answer: We have a DR center. Hooray! Then I discovered that it was two blocks away from the main site. Oy, vey!
I've always thought that DR is best handled by external providers, but in this case, with facilities available to us and providers not able to provide the level of service we expected at a price we could afford, we decided to go with an internal solution.
So we planned to build a regional DR center as a capital project, an add-on to a planned fire station. We had to be patient: Construction wouldn’t happen for a couple of years. Then the project was canceled in 2011. Enormous problem, or fortuitous opportunity?
Having dabbled in cloud computing since 2009, we started thinking: Maybe we don't need another data center. I had joined the organization around the time of Hurricane Katrina and vividly remembered the problems with regional datacenters during that disaster. So even moving our DR center 12 miles away, as we had planned, might not be good enough. Moving the "virtual DR center" several states away, to a cloud datacenter nowhere near us, made a lot of sense.
At the time, it was daunting to move virtual machines into any public cloud, especially with the level of automation we were looking for. So we kept experimenting and investigating.
Startup risks and rewards
Then I got pitched by a startup vendor, CloudVelox, about automated cloud disaster recovery. (I’m startup-friendly, though I do delete at least 49 of every 50 vendor pitches I receive due to lack of relevance or understanding of our business goals.) I read CloudVelox's pitch. I was interested, but we were talking about production systems important enough to merit the type of investment that DR demands, so I wasn't about to approach this willy-nilly.
The question was: How to approach the risk, and how to get permission from those who own the systems?
In terms of approach, we took it slow. We took the "small jump, medium jump, high jump" approach. In this case, we deployed one low-risk server using the startup vendor's methodology. Then we moved to one mid-risk server. Then a mid-risk n-tier application. Armageddon didn't ensue.
In terms of permission, our IT organization has earned credibility with other business units in our city. We offer a high level of uptime. If we screw up, we admit it and communicate about it. Although we must enforce policy, we aren't the No Police. And we recognize that we aren’t the owners of systems; we're the custodians.
All of that cred added up in this case to approval to gradually move production systems into a new type of disaster recovery: automated synchronization and deployment into a public cloud provider, in this case Amazon Web Services.
Our application staff was all in favor of a DR system that would be automated and available on an ad hoc, easy-to-test basis. Our infrastructure staff was understandably a little freaked out: Production systems in the cloud? Security nightmare!
We put the app staff in charge of the project -- they had skin in the game because the old failover methods were labor-intensive and hard to test. And because our infrastructure folks had legit, specific concerns beyond their initial emotional reactions, we designed the deployment to address those concerns. We also hired an auditor to put the system through its paces before going into production.
When we went into production, we were amazed that not only could we fail over in less than an hour (compare that result to those weekend-long DR exercises where IT runs around like Keystone Cops trying to figure out patch levels and whether apps are working OK). The performance of the systems was pretty awesome, even though we were running in the West availability zone. (We're in North Carolina.)
Lessons learned
It wasn't, and still isn't, rainbows and sunshine. We learned several things worth sharing.
DNS. Your enterprise DNS is probably more messed up than you think. It boils down to the enterprise propensity to define an "internal" versus an "external" DNS zone. No surprise: When you expect your app to be available via AWS, and you want to plan for your headquarters being gone due to an earthquake or other disaster, you might want to use globally distributed DNS, even for internal apps. You don't want to worry about manually configuring clients to use global, rather than internal, DNS entries when you're worried about your headquarters falling into a gigantic crack in the Earth's crust.
Licensing. It can be a bear when you move systems. To wit: Many proprietary systems rely on a license key based on a host ID, and when the host ID changes, the system won't work, or it will work in substandard "evaluation mode." A quick call to our vendor revealed exactly what the procedure is for emergency/disaster recovery. It's a good thing to know before the disaster.
Bandwith. Synchronizing your databases to the cloud periodically isn't for the faint of heart. Our DR vendor synchronizes at the block level in Windows Server, which appears to create a lot of traffic. We're fortunate that we have more broadband providers in town than the usual duopoly, so bandwidth is both plentiful and relatively inexpensive. In fact, we just about doubled our average utilization. So the question becomes: Is it more cost effective to double your bandwidth, or is it more cost effective to farm out to a DR provider or build a private DR center? In our case, upping our bandwidth is the far better alternative, but I know that’s not true everywhere.
Ultimately, we probably will still move our current alternative datacenter to another location to back up things like VoIP and public safety radio. But I can tell you this: That new datacenter will cost far less, and it will be far smaller, than the one we initially planned to build. And we won't waste money buying duplicate gear either. Another important outcome is that, because of the cost reduction (about a tenth of the cost for capital, according to our infrastructure manager), we have moved to also protect systems that are "important, but not urgent," systems that were too expensive to protect in the past.
In its ninth year, Interop New York (Sept. 29 to Oct. 3) is the premier event for the Northeast IT market. Strongly represented vertical industries include financial services, government, and education. Join more than 5,000 attendees to learn about IT leadership, cloud, collaboration, infrastructure, mobility, risk management and security, and SDN, as well as explore 125 exhibitors' offerings. Register with Discount Code MPIWK to save $200 off Total Access & Conference Passes.
About the Author
You May Also Like