Business Continuity: To Err Is Human, To Plan Is Divine
Although disasters make headlines, 80% of all IT outages are caused by human error. To defend against downtime or service interruptions, organizations need to maintain strong business continuity plans.
The term "business continuity" today conjures images as varied as flooded data centers, cascading power outages, and waves of cyber attacks. But the real reason to plan ahead for a business disruption is likely to be much more mundane: Some well-intentioned system administrator makes an ill-advised change to a server in your data center causing all hell to break loose.
Up to 80% of all IT outages are caused by improper changes to the IT environment, Bob Vieraitis, VP of marketing for change control software vendor Solidcore Systems, told InformationWeek. And this is only going to get worse in increasingly complex IT environments where databases, servers, desktops, and other systems are managed by different groups within a given company.
"People who own the OS and servers are trying to keep them up and running," Vieraitis said. "The businesses on the other hand, if they own the application, they have new features that they have to get out there to be competitive in the marketplace, and that obviously involves change."
The emergence of virtual servers running on physical servers further adds to this complexity. "In a virtual environment, you no longer can tell where the OS is running," which makes it more difficult to determine where a change should be made and the potential impact of that change on the rest of the system, said Bill Lapcevic, Solidcore's VP of alliances. He added, "You become one step further away from understanding how change will affect your apps and environment."
Of course, given the nature of their business, Vieraitis and Lapcevic have every reason to want to believe that growing complexity will lead to greater difficulty managing change and avoiding downtime. Nonetheless, Solidcore is also in a position to see the impact that ill-advised changes have on their customers' IT environments.
Such as when a system administrator at WebEx Communications tweaked a server and took down service to the company's customers. This was a few years ago, prior to Cisco's plans to buy the provider of on-demand Web collaboration apps. The system administrator identified the need to make a change to a file running on one of the company's servers. As soon as the change was made, however, the server went offline, interrupting service to some of WebEx's customers. "Customers were dropped and reconnected, so that impacted our availability numbers," said Randy Barr, chief information security officer for WebEx, which operates about 2,000 servers across seven data centers.
WebEx's data center operations center was the first to see the alert. "As soon as something turns red, they set up a conference line," Barr says. The first rule of troubleshooting is to look at any changes made in the IT environment. It didn't take long to find the problem, since the system administrator who made the change saw the alert and admitted the problem to operations center staff, which re-routed traffic on that server to another server in the cluster.
At the time, system administrators had to inform WebEx's security team as well as the company's change-control committee whenever they wanted to make a change to the company's systems, but there was no fail-safe way to block ad hoc changes that ultimately would prove detrimental the IT environment.
Today, Solidcore's S3 Control software allows changes to be made to WebEx's systems only when such changes have an approved change ticket issued by the company's BMC Remedy Change Management application. If there's no ticket, then changes cannot be made to the system. This is an important function because the leading cause of WebEx's systems being unavailable is changes applied to the company's production environment.
To defend against downtime or service interruptions caused by changes to applications or IT systems, organizations need to set policies that dictate what changes are permitted, who's allowed to make these changes, and when these changes can be made. Further, these changes must be tested and approved before the changed systems are put back online.
But not all companies see business continuity planning as a top priority. In fact, about 30% of the 1,000 U.S. IT executives surveyed by AT&T for the company's annual business continuity and disaster recovery preparedness survey released in May stated that it was "not a priority." A quarter of the executives surveyed said they don't even have a business continuity plan in place. Among the reasons for viewing business continuity as a low priority are that other issues take higher priority, belief that the probability of a disaster causing business disruption is small, and business continuity planning is too expensive.
The AT&T survey also indicated that, while 57% of companies have had their business continuity plans updated in the past 12 months, only 41% have had the plans tested during the same time period.
Testing is critical to ensuring smooth failover during an emergency, particularly for companies that deliver software as a service. "The one thing that everyone should think through: you have to test your continuity plan and update it," Barr said. He also offered the following, "Some folks don't realize that if you work in a building, you can contact building management to help with planning for an incident." Nothing compares to actually putting a business continuity plan through its paces.
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.