The Most Crucial Elements of Any Cyber-Resilience Testing Plan

By taking a methodical approach to resiliency testing, organizations can vastly improve their ability to bounce back quickly and securely.

Reza Morakabati, CIO, Commvault

June 13, 2024

6 Min Read
cyber resilience written on a sticky note with crumbled papers surround it
Vladislav Zaretskiy via Alamy Stock

When it comes to cybersecurity, resilience may prove even more crucial to a business than any mere defense.    

Trying to keep bad actors out is important, of course. But ransomware and other forms of attacks are becoming more pervasive -- even inevitable. And the ability to get data systems back online after an incident is just as crucial as trying to detect and defend against an attack in the first place.    

But the only way to truly know the readiness and resiliency of the IT environment is to test it. The trouble is, too few companies conduct adequate testing, if they do it all. No amount of preparation can ever handle all contingencies, of course. However, by taking a methodical approach to resiliency testing, organizations can vastly improve their ability to bounce back quickly and securely.   

It’s a lot like training a hockey team. At first, the plays may seem chaotic. Players don’t immediately grasp how their individual movements contribute to the team’s overall flow. But by breaking it down into processes, and then perfecting each one through practice, the coordination improves. And increasingly, the team can execute more complex plays.    

Understanding the Landscape 

When building a testing strategy, companies should map their recovery operations into three segments: the people, the process, and the technology that's involved.   

Understanding how they all must work together will lead to a stronger and faster recovery:   

  • People: These should be the roles that are critical to returning to a safe state as quickly as possible. This could include investigators, forensics analysts, and the security response team.   

  • Process: Many companies will have some incident response plan in place. But in the chaos of an incident, these well-thought-out protocols can quickly break down. That’s why testing is so critical. Companies should put the security and IT teams through trial runs, including tabletop exercises. Then, they can pinpoint any troublesome areas. And they can start to detect areas for potential automation.    

  • Technology: Businesses need to know what their most critical IT systems are. These are the programs that, in the event of an attack, natural disaster or other incident, must be back up and running the fastest. Often, these aren’t apparent to the security teams. End users can help pinpoint the most critical apps.     

Create a 'Recovery' Role

Today, recovery rarely has its own dedicated specialist. Instead, the role is spread across many different positions. Or there’s no one person actually tasked with ensuring the business can get back online.   

That must change. If the business can afford it, hiring a dedicated individual to oversee data recovery is the most effective option. This person should spend their time talking to employees across the business to figure out what IT tools are critical to their jobs. Then, the company can begin to categorize applications by their necessity. Naturally, the more vital the system, the quicker it will need to be restored.   

Someone in a dedicated recovery role can also spend more time running through possible scenarios in their head. Then, they can build the right response plans to make recovery much faster.   

However, many IT and security teams won’t have the budget for a specialist. In those instances, adding the responsibility of overseeing recovery to someone’s daily job could make a difference. That will mean removing some other, less vital responsibilities from this individual’s workload. Otherwise, they won’t be able to adequately focus on testing the company’s ability to bounce back from an incident.   

And small improvements in testing can make a big difference. Even dedicating a few hours to testing could lead to faster and more efficient recoveries. But it all begins with making someone responsible for examining the existing threat landscape, pinpointing the biggest areas of risk, and then establishing the right response plans.    

Taking an Iterative Approach 

Many IT teams won’t have the money, time, or capability to test their entire environment. Instead, companies should start where they can, then strive to gradually test more.   

A business might only be able to test 20% of their environment. That’s why they should start with the most important 20%. Eventually, the company will be able to more seamlessly run these recovery tests. Then, they can gradually add other applications to the mix. Soon, they're testing an increasingly higher portion.   

But it’s unlikely enterprises will ever test their full IT environment. In fact, even striving for 100% might be a mistake. It’s a goal few organizations will be able to achieve. They don't have the ability to take critical systems -- like an active directory -- offline. It would cripple operations. Cleanrooms, which are secure, separate spaces designed to be isolated from infected software or hardware, are helping organizations. The technology provides a safe testing area in the cloud. That makes it possible to run recoveries without interrupting daily operations.    

Ultimately, being able to quickly and efficiently get even a small section of the IT environment back online can often suffice. It indicates the company’s strategy is sound and, ideally, will be scalable in the event of a more drastic incident.  

Shoot for Failure 

It may sound counter-intuitive, but companies actually want to fail these tests -- at least, some of the time.   

No business can expect constant perfection. Failures highlight the gaps. And ultimately, the more of these gaps the company identifies and fixes, the more resilient the organization will be in the end.   

Increasingly, the failures may not even be a result of any underlying problems. Instead, it’s a result of companies running into complications as they try to automate more of the process. Many security functions still rely on manual tasks by overworked specialists. The eventual goal, though, is to be able to run a full restore with just the click of a button. That requires training. Each failure helps an organization get closer to a fully automated process.   

At the end of the day, it doesn’t matter if the business is testing 5% of their environment or 50%. The key is just to start testing -- and to establish a reputable and scalable process to continually loop in a wider swath of technologies.  

About the Author(s)

Reza Morakabati

CIO, Commvault

With more than 30 years in the industry, Commvault CIO Reza Morakabati believes that you not only have to have the right strategy, but you must have consistent operations, strong execution, and the ability to scale to differentiate a company. 
He has demonstrated this at Commvault by working with his team to build the organizational structure and operational discipline needed to provide the company with best-in-class IT operations and a scalable business technology framework. 
Prior to joining Commvault, Reza served as Vice President of Business Technology & Operations at Puppet and spent several years in leadership at Pivotal and EMC. Reza, who earned his MBA from MIT Sloan, keeps his entrepreneurial spirit alive by helping small to mid-sized companies advance their operations. 

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights