How to Keep an IT Mistake From Leading to a Cascade of Errors
Don’t let your IT organization fall victim to a runaway “error string”. Preventative planning can help you stop a possible cavalcade of flaws.
An “error string” describes a software scenario in which one negative result leads to another incorrect outcome, which then leads to a runaway string of inaccuracies. It’s a problem that more than a few IT leaders have already experienced.
“When one mistake can make it past your checks and balances system, it can easily lead to more mistakes being made,” warns Troy Portillo, director of operations for Studypool, an online learning platform.
Whenever a mistake is detected, it’s always best policy to immediately fix the error, Portillo advises. “The longer you wait once a mistake has been brought to your attention, the higher the probability it will lead to a cascade of mistakes,” he says. “No system is perfect but acting quickly once you’re alerted to an issue is the best way to avoid future mistakes stemming from the original error.”
Root Causes
IT functions something like a circuit in which all the breakers in a line must be shut before a current can flow through it, explains Tommy Gardner, CTO at HP Federal. “Often, multiple errors have to occur sequentially prior to … a dramatic loss of data,” he says.
A cascading series of errors can occur when essential preventative steps are overlooked. “For example, continuity of operations can cause the IT team to focus on a perceived threat,” Gardner says. “With the time and attention of the staff diverted, perhaps the routine updates aren ’t done.”
Perils of Poor Design
Error strings can often be traced back to poor system design. “If a system isn’t designed to properly handle errors between subsystems, a single mistake can cause the system to fail, leading to a series of errors,” explains Tom Chisholm, principal training solutions engineer at software developer Perforce.
“For example, if a web application doesn’t handle database connection errors correctly, a single error can cause the entire application to crash if other parts of the system aren’t resilient to such a failure,” Chisholm says. Poorly written database queries can also initiate a failure chain. “This can lead to database deadlocks, which in turn can lead to cascading failures in the frontend,” he notes.
Preventative Measures
Gardner believes that building a proactive defense is the best way to prevent a potential catastrophe. “IT teams should think through multiple problems at once, understand the limits and constraints of their system, and build structured protocols to adhere to on an ongoing basis,” he says. Gardner suggests training team members on IT best-practices so small user errors don’t morph into larger problems.
Gardner also advises independently testing software, as well as scheduling regular code updates, to ensure that both open-source and proprietary software aren’t harboring any vulnerable soft spots.
Chisholm agrees. He suggests thoroughly testing software before launching it. “Find the single points of failure in a controlled environment before you find them in production,” he recommends.
Meanwhile, the best way to prevent or recover from cascading failures is to build fault tolerance into every subsystem and to periodically test for fault tolerance, Chisholm says.
Chisholm also recommends using monitoring tools to keep an eye on system health and performance. “Be proactive in addressing any issues that arise,” he states. “Additionally, regularly reviewing logs and metrics can help identify potential problems before they become major issues.”
Breaking the String
Despite careful planning, a project can still fall victim to sequential errors. “System knowledge is the best way to avoid a string of errors,” Gardner counsels. Creating a break anywhere in the process can stop an error string in its tracks.
Practice makes perfect -- usually. Gardner suggests presenting IT team members with intentionally wild error string scenarios, and then challenging them to create effective ways of stopping them. “This can be a fun tabletop exercise and, if you can create collaboration across the security and IT teams, you’re better positioned to avoid vulnerabilities and product functionality issues,” he says.
Keep Calm and Carry On
Escaping from a cascading string of failures requires an unemotional and reasoned response. “Often, by the time you have a cascade, it’s too late to handle it quickly and gracefully,” Chisholm observes. “Moreover, you’re likely to be operating in panic mode, and rash attempts to halt the cascade may end up just making it worse.”
Chisholm’s advice: Step away, get some coffee, or other beverage of choice, and breathe deeply. “Then slowly and rationally evaluate the cause of the cascading failures.”
Perhaps most important, once operations have returned to normal, is investigating exactly what went wrong. “Analyze the failure, and lack of fault tolerance that led to disaster, and update your systems so similar failures won’t take you down again,” Chisholm suggests.
What to Read Next:
Stress-Test Your Software to Prevent a Southwest-Type Calamity
Digital Twin Technology: Revolutionizing Product Development
About the Author
You May Also Like