In June, 2021, a research study sponsored by iland and Zerto revealed that only 54% of organizations had a formal disaster recovery plan, less than half of those tested their plans on at least an annual basis, and 7% of organizations never tested their DR plans at all.
Given the daily work pressures on IT, these results are not entirely surprising. However, that lack of testing creates risks when systems fail and the DR plans need to be actuated.
I experienced this first-hand one day when we decided to perform a DR test of our core systems with an offsite DR provider. Why were we performing this test? Because given the project pressures we had been under, we knew that we that we had been remiss. This failover hadn’t been tested for two years, and we knew we’d better do it.
We coordinated with our offsite data center for the test. All of us figured that the test would be an easy and straightforward failover because, to our best knowledge, all system configurations at both sites were identical.
Together, we watched the failover test fail! The reason, unbeknownst to us prior to testing, was that the offsite provider had not maintained its underlying operating system and subsystems at the same revision levels that we had. This caused the core application systems, which were configured for the software levels at our internal site, to fail at the offsite data center.
We were disappointed, but the best news for us from this effort was that we were only in a test. Working with our outside provider, we made the necessary software configuration adjustments. We updated our procedures for assuring software synchronization and committed to testing failover scenarios twice each year. Nevertheless, I kept thinking to myself: What if we had actually needed failover of production systems because of some disastrous event? The failover would have failed.
Today, many CIOs are rolling the dice on this same issue. They report to their boards and upper management that the DR plan is fully documented and in place. An outside auditor even comes in and determines that every item on the audit checklist is accounted for, giving the plan a thumbs up.
But does the plan actually work?
How to Know Your DR Plan Really Works
The only way you’ll know that your DR plan really works in practice is to test it in a simulated production scenario.
DR testing can be time consuming. That makes it a formidable challenge for companies and their IT groups. There is also that human tendency to put projects like DR testing at the back of the to-do list, since the likelihood of a full-blown disaster actually happening is small.
However, this doesn’t mean that you simply write your DR plan and forget about it, either.
CIOs and IT leaders have to develop a middle-ground DR strategy that includes time for testing to ensure that the DR plan actually works.
Developing DR Testing Strategies
Developing DR testing strategies means that you are actually going to test the viability of your DR plan at regular intervals. How do you do this when the perceptions of staff, management and even the board are that DR plan testing is a “back of the line” project that you only work on when you have time (which you never have)?
Here are four key steps:
1. Define DR plan testing as a fundamental building block of your risk management strategy.
Every organization looks at risk management today. They evaluate risk when it comes to assessing how much liability coverage they want to spend for. They “shock” their financials to simulate how the company will perform under both lower and higher revenue outlooks. They invest in cybersecurity software to prevent data breaches and intellectual property theft.
The IT disaster recovery plan -- and a commitment to testing it regularly to ensure that it works in practice -- should be part of the corporate risk management strategy. Unfortunately, DR plan testing isn’t included in most corporate risk management strategies. It should be. This is what CIOs should be presenting to their CEOs and boards.
2. Schedule regular testing with your offsite DR and failover providers.
If you are backing up core systems for failover at an offsite provider, meet with the provider to minimally test the DR plan failover annually. The DR plan test will ensure that seamless failover of production will actually work as documented.
No provider is going to volunteer this, so it is up to IT to make arrangements and provide the budget for them with time, staff and money. IT should present DR plan testing to upper management and the board as a fundamental risk management measure.
3. Update policies, procedures, and training.
One area in which IT most often underperforms is documentation and training. No one likes to take time away from “real work” to document. The task of training or retraining personnel is even less welcome.
If there is a change in your DR failover plan that you discover from testing, or in the policies and procedures that support it, both your own IT group and your outside failover provider (if you use one) should make these changes and train staff promptly. A commitment to getting this work done within two weeks of the failover test is a good metric. This ensures that any changes you make will be fresh in everyone’s minds.
4. Communicate and make DR plan testing and readiness a part of your corporate culture.
I don’t know of anyone in IT who enjoys testing or updating disaster recovery plans. Users like DR plan testing even less because it can make systems unavailable, interfering with users' ability to get their work done.
This is why IT should maintain open communications with end users and management about when DR plans will be tested, and how long systems will be unavailable. IT should also be thoughtful about when it executes these DR plan tests. For example, don’t plan a DR production failover test at a financial month-end close.
Users (and management) might not like the idea of DR plan testing rendering systems temporarily unavailable, but they will understand why this testing is necessary, and advance notice from IT will enable them to adjust their work plans.
What to Read Next:
CIO Best Practices for Communicating about Disaster
Revisiting Disaster Recovery and Business Continuity