5 min read

In BC/DR, It's The Small Stuff That Can Get You

Companies that prep for hurricanes, floods, and earthquakes need to make sure they're not sunk by a power strip.
I see many organizations that believe they're well-insulated from major disasters--a sentiment that often grows into a sense of complacency, which eventually breeds a mindset that business continuity and disaster recovery planning and testing are basically unnecessary expenses. In the 2012 InformationWeek State of Storage Survey, of more than 300 respondents asked about their disaster recovery and business continuity strategies, only 38% have BC/DR processes and test them regularly.

After all, how often do thunderstorms knock out big swathes of the Northeast, right?

Given the extreme weather lately, calamities that can affect data centers happen more often than one might expect. But the bigger-picture answer is, disasters come in many forms. If your BC/DR planning centers on catastrophic events like a Sept. 11-style terrorist attack or the 2011 earthquake in the Washington, D.C., metro area that cut power for days, you've only addressed part of the risk. Common, everyday events like component failure, data corruption, a telecom outage, or plain old human error can cause the same level of service disruption.

While it's impossible to plan for every potential contingency, at least for those of us without unlimited budgets, there are a few simple best practices that can ensure appropriate levels of BC/DR for most businesses.

First, you need a mission statement: BC/DR planning must define appropriate measures to protect your organization against conceivable threats that may harm employees, customers, your ability to maintain service-level agreements (SLAs), your brand, your reputation, or any of your corporate values.

When thus encapsulated, it should become apparent that every organization--for-profit or otherwise--must take measures to protect itself from downtime, no matter how mundane or complex the cause. In this column, let's focus on the mundane side of the equation.

Common component failure--we're talking things like physical interface cards, fans, and power supplies--is one of the most frequent causes of service and application downtime in smaller data centers. Case in point, one of my clients, a moderately sized assisted living center, de-energized several critical servers during a planned outage scheduled to last two hours. The servers had been running nonstop for approximately two years. After cooling down to ambient data center temperatures for almost two hours, several of the servers' power supplies failed to initialize after being re-energized. Even though this client had an on-site support agreement covering the power supplies, without spares on hand, it took nearly five hours to receive the replacement units. This effectively tripled the planned duration of the outage.

No critical system within a data center should rely solely on any single instance of these components; they should always be redundant. Most data-center-grade equipment is designed to have redundant instances of these components; however, not all organizations take advantage of them. For example, a redundant power supply that is not plugged in or is plugged in to the same power strip or power distribution unit (PDU) as the primary one won't do you much good.

The key concept here is "separate": To maximize the capability of dual-power supply systems, the power supplies must be plugged into separate PDUs fed from separate breakers in separate power panels routed from separate UPS units. The UPS units should be fed by commercial power and protected by emergency generator power. While commercial power and generators can fail, these are typically the least likely to have frequent or long-term outages when compared with the downstream components; thus you have eliminated single points of failure in the places where failure is most likely to occur.

In addition to ensuring that critical equipment is under an on-site support agreement, IT can combat the problem of common component failure by keeping a reasonable inventory of common spares on site. Most components today, even power supplies and equipment fans, can be field-replaced by a Level 1 or 2 data center technician. Carrying common spares is a cost-effective way to mitigate the risk of outages; think of the expense as an insurance policy to cover the time elapsed between the failure of a component and the installation of a replacement part under the on-site support agreement.

Decisions on which spares to keep on hand should be made based on the downtime tolerance for any given system as compared with the SLA of the on-site repair contract. If a server can only be down for an hour, but the vendor's contracted response time is two or four hours, it makes sense to have spares on hand for that system.

At one time, BC/DR was generally a multisite, active-active or active-passive data center configuration involving redundant hardware that frequently sat idle. There are still cases where this is reality, such as when regimented approaches to assessing risk warrant this investment. For most of us, however, virtualization and other factors have reduced the redundancy. That's a good thing for budgets, but it can be dangerous as well. Don't let a fried NIC worth a couple hundred dollars cost your company thousands or more in downtime.