Cloud // Cloud Storage
Commentary
7/20/2012
06:43 PM
Kris Domich
Kris Domich
Commentary
Connect Directly
RSS
E-Mail
50%
50%

In BC/DR, It's The Small Stuff That Can Get You

Companies that prep for hurricanes, floods, and earthquakes need to make sure they're not sunk by a power strip.

I see many organizations that believe they're well-insulated from major disasters--a sentiment that often grows into a sense of complacency, which eventually breeds a mindset that business continuity and disaster recovery planning and testing are basically unnecessary expenses. In the 2012 InformationWeek State of Storage Survey, of more than 300 respondents asked about their disaster recovery and business continuity strategies, only 38% have BC/DR processes and test them regularly.

After all, how often do thunderstorms knock out big swathes of the Northeast, right?

Given the extreme weather lately, calamities that can affect data centers happen more often than one might expect. But the bigger-picture answer is, disasters come in many forms. If your BC/DR planning centers on catastrophic events like a Sept. 11-style terrorist attack or the 2011 earthquake in the Washington, D.C., metro area that cut power for days, you've only addressed part of the risk. Common, everyday events like component failure, data corruption, a telecom outage, or plain old human error can cause the same level of service disruption.

While it's impossible to plan for every potential contingency, at least for those of us without unlimited budgets, there are a few simple best practices that can ensure appropriate levels of BC/DR for most businesses.

First, you need a mission statement: BC/DR planning must define appropriate measures to protect your organization against conceivable threats that may harm employees, customers, your ability to maintain service-level agreements (SLAs), your brand, your reputation, or any of your corporate values.

When thus encapsulated, it should become apparent that every organization--for-profit or otherwise--must take measures to protect itself from downtime, no matter how mundane or complex the cause. In this column, let's focus on the mundane side of the equation.

Common component failure--we're talking things like physical interface cards, fans, and power supplies--is one of the most frequent causes of service and application downtime in smaller data centers. Case in point, one of my clients, a moderately sized assisted living center, de-energized several critical servers during a planned outage scheduled to last two hours. The servers had been running nonstop for approximately two years. After cooling down to ambient data center temperatures for almost two hours, several of the servers' power supplies failed to initialize after being re-energized. Even though this client had an on-site support agreement covering the power supplies, without spares on hand, it took nearly five hours to receive the replacement units. This effectively tripled the planned duration of the outage.

No critical system within a data center should rely solely on any single instance of these components; they should always be redundant. Most data-center-grade equipment is designed to have redundant instances of these components; however, not all organizations take advantage of them. For example, a redundant power supply that is not plugged in or is plugged in to the same power strip or power distribution unit (PDU) as the primary one won't do you much good.

The key concept here is "separate": To maximize the capability of dual-power supply systems, the power supplies must be plugged into separate PDUs fed from separate breakers in separate power panels routed from separate UPS units. The UPS units should be fed by commercial power and protected by emergency generator power. While commercial power and generators can fail, these are typically the least likely to have frequent or long-term outages when compared with the downstream components; thus you have eliminated single points of failure in the places where failure is most likely to occur.

In addition to ensuring that critical equipment is under an on-site support agreement, IT can combat the problem of common component failure by keeping a reasonable inventory of common spares on site. Most components today, even power supplies and equipment fans, can be field-replaced by a Level 1 or 2 data center technician. Carrying common spares is a cost-effective way to mitigate the risk of outages; think of the expense as an insurance policy to cover the time elapsed between the failure of a component and the installation of a replacement part under the on-site support agreement.

Decisions on which spares to keep on hand should be made based on the downtime tolerance for any given system as compared with the SLA of the on-site repair contract. If a server can only be down for an hour, but the vendor's contracted response time is two or four hours, it makes sense to have spares on hand for that system.

At one time, BC/DR was generally a multisite, active-active or active-passive data center configuration involving redundant hardware that frequently sat idle. There are still cases where this is reality, such as when regimented approaches to assessing risk warrant this investment. For most of us, however, virtualization and other factors have reduced the redundancy. That's a good thing for budgets, but it can be dangerous as well. Don't let a fried NIC worth a couple hundred dollars cost your company thousands or more in downtime.

Comment  | 
Print  | 
More Insights
Comments
Threaded  |  Newest First  |  Oldest First
jsun_EVault
50%
50%
jsun_EVault,
User Rank: Apprentice
7/24/2012 | 6:57:11 PM
re: In BC/DR, It's The Small Stuff That Can Get You
Kris, you make some excellent points (with examples) that "common, everyday events like component failure, data corruption, a telecom outage, or plain old human error can cause the same level of service disruption" as a natural disaster and that organizations need to plan and test for both. Forrester analyst Rachel Dines wrote a report last year that broached this topic calling it IT service continuity management.

Regarding your summation, I wanted to point out that virtualization and other factors, such as new storage technologies and the Cloud, also allows for new types of recovery services that can help organizations address IT service continuity challenges. These cloud recovery services not only protect against natural disasters but can also deliver proactive failover support, providing a zero-downtime alternative for planned maintenance, site outages and upgrades as well as the examples you outlined above.
Google in the Enterprise Survey
Google in the Enterprise Survey
There's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity ­products, and 69 percent cite Google Apps' good or excellent ­mobility. But progress could still stall: 59 percent of nonusers ­distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - August 20, 2014
CIOs need people who know the ins and outs of cloud software stacks and security, and, most of all, can break through cultural resistance.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.