Cyber Resiliency: How CIOs Can Prepare for a Cloud Outage

The dangers posed by a cloud outage are clear and omnipresent. Here's how to prepare for your organization for the inevitable worst-case cloud scenario.

John Edwards, Technology Journalist & Author

August 1, 2022

6 Min Read

Ivan Kmit via Alamy Stock

Only a few things in life are certain, including death, taxes, and social media rants. Cloud outages are also on that very short list.

Cloud outages can occur for a variety of reasons, including power failures caused by severe weather, equipment failure, code misconfiguration, or even lurking deployment issues, says Tim Potter, a principal at Deloitte Consulting. Most outages are limited in scope. “The outage doesn’t impact all services or all regions where the provider delivers cloud services,” he notes. Sometimes, however, an outage may be widespread or even complete.

Regardless of the scope, when a cloud outage does occur many organizations are surprised to discover that a quick fix may be impossible. Often, the problem lies solely with the cloud vendor, leaving customers with no choice but to wait for eventual service restoration.

Depending on the outage's severity, customers stand to lose far more than just temporary cloud access, “At this point anything is possible, such as client data leaks and the possibility of valuable intellectual property being stolen,” warns Precious Washington, a senior IT auditor for Schellman, a global independent security and privacy compliance assessor.

Damage Control

Beyond security issues, cloud outages can open the door to cascading disruptions affecting both routine business and mission-critical applications. “This can lead to [issues] ranging from revenue loss to more serious impacts -- such as putting lives at risk in the case of critical health care applications,” explains Ravikanth Ganta, a senior director at business consulting firm Capgemini Americas.

A cloud outage’s seriousness hinges on several factors, including organization preparedness, the zone regions affected, and the services impacted. “In many cases, businesses that build and run their applications in the cloud can endure a cloud outage with little to no impact if they architect their applications to take advantage of the automated failover capabilities readily available in the cloud,” Potter notes.

Modular applications designed to leverage loosely coupled services will typically experience only a minor drop in availability or performance during a vendor outage and, in many cases, may not be affected all. “Customers that ... haven’t architected their applications to gracefully failover or redirect traffic to unimpacted zones or regions, will face greater availability challenges when a cloud provider experiences an outage,” Potter says.

Protection Preparation

Guarding against a cloud outage requires a shift in mindset, Potter says. “Historically, CIOs have made large investments in hardening the infrastructure that hosts their applications -- they sought to eliminate incidents that would lead to an outage,” he notes. But in today’s increasingly software-defined cloud world, IT leaders should assume that their infrastructure will undoubtedly fail at some point. To address this inevitability, it's now important to design applications that can instantly route traffic and services around failures to a different cluster, zone, region, or even another cloud service provider.

Click image to download the complete 2022 State of Network Management Report.

Archna Bhardwaj, a consulting manager at business advisory firm EY Technology, stresses the importance of detecting and eliminating any single point of failure, particularly for critical workloads. “Applications need to be designed to be fully redundant across zones and/or regions,” she states. Since there's a cost element to consider when creating a fully redundant and highly available system, Bhardwaj advises running a cost-benefit analysis before designing the environment. She also suggests consulting with experts with experience in end-to-end technology transformation projects.

Diversifying applications across several cloud providers -- multi-cloud or hybrid cloud -- can go a long way toward reducing the risk of suffering a crippling cloud outage. “Companies can have different providers for different cloud requirements, like IaaS, PaaS, and SaaS solutions,” Bhardwaj notes.

Yet another way to avoid a serious outage is to deploy monitoring and notification technologies. Such tools, once in place, constantly examine the cloud environment's health and status, automatically alerting IT staff when a situation requires immediate attention. “Most cloud providers offer managed services to perform such activities for their customers,” Bhardwaj says. There are also many third-party tools and services for organizations that prefer not to manage such operations internally, she adds.

Building a Reliable Strategy

A well-planned strategy is essential to cloud service reliability. “It's important to run platforms that are self-healing and to deploy as much automation as possible across the infrastructure and application layers,” Ganta says. “By doing this, recovery will be fast and error-free.”

When developing a cloud reliability strategy, it's important to ensure that security will be maintained during outages. CIOs should work with the CISO to define a framework that's functional, effective, and operational, Washington suggests. “It's important to trust the CISO with full authority and responsibility,” she adds.

Washington also advises organizations to conduct regular backups and to create a duplicate cloud that can be quickly accessed if an outage occurs. “Always plan for the worst and test plans frequently,” she recommends.

Companies Most at Risk

Potter notes that organizations running a large number of legacy applications that were never designed to resist cloud outages, as well as enterprises lacking a robust resiliency culture, tend to be the most vulnerable to cloud service interruptions.

Organizations clinging to a single region cloud strategy, giving little consideration to high availability and disaster recovery safeguards, are also playing a dangerous game. Such organizations should consider partnering with an experienced global system integrator (GSI) to help define a risk-balanced cloud strategy, Ganta says.

CIOs should urge their teams to keep challenging the status quo and to make their cloud environments as strong and redundant as possible. “The cost and complexity needed to architect an application to run across regions or even multiple cloud providers’ platforms has decreased significantly in the past few years,” Potter notes.

Meanwhile, continuing advancements in artificial intelligence-driven AIOps are helping IT teams to anticipate and react to cloud connectivity issues faster and more effectively. When coupled with automated failover routines, organizations can actually achieve greater levels of business resiliency at a relatively low cost. “Consider running competitions to inspire innovative approaches that will increase your organization's ability to maintain service availability, even if your cloud provider experiences an outage,” Potter suggests. “You'll likely be surprised by the solutions generated by your team.”

Any organization developing a cloud strategy should design its environment in a way that meets its unique requirements. Additionally, once a cloud strategy is operational, it's important to ensure that it's functioning properly and meeting its anticipated performance levels. “Having a multi-cloud, multi-vendor environment makes it crucial ... to have the proper mechanisms in place to ensure that service level agreements are in place and that key performance indicators are being met consistently,” Bhardwaj says.

The cloud is maturing rapidly, but so are best practices and tools. “It's important to build a risk-balanced strategy and create cloud architectures that enable applications to benefit from the constant cloud evolution,” Ganta says.

What to Read Next:

Special Report: How Fragile is the Cloud, Really?

Quick Study: Cyber Resiliency and Risk

Reliance on Cloud Requires Greater Resilience Among Providers

About the Author(s)

John Edwards

Technology Journalist & Author

John Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.

See more from John Edwards

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Cyber Resiliency: How CIOs Can Prepare for a Cloud Outage

Damage Control

Protection Preparation

Building a Reliable Strategy

Companies Most at Risk

What to Read Next:

About the Author(s)

Editor's Choice