IT Outages, Who's Really at Fault? - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IT Leadership // CIO Insights & Innovation
01:00 PM
Connect Directly

IT Outages, Who's Really at Fault?

Systems do go down, and sometimes the cause seems obvious, but it may be too obvious. Employ root cause analysis methods to find the real cause of failure.

As our IT infrastructures grow increasingly complex thanks to advanced technologies such as virtualization, cloud computing and software defined networking (SDN), understanding the root cause of an IT outage becomes more difficult to achieve. But even more importantly, current troubleshooting techniques to find fault into an outage focuses solely on the technical side of the IT department. In fact, the truth is that many root causes go beyond technology and stem from poor policy and management decisions.

For anyone who has been involved in enterprise IT support, root cause analysis (RCA) is one of the first troubleshooting methodologies one needs to learn. Thanks to the ever-increasing complexities of network infrastructures and distributed computing platforms, the visible symptom that end users experience is usually not the true cause of the problem. Instead, RCA teaches us to continue drilling down into the cause-and-effect chain to ultimately find the core issue at hand.

There are multiple RCA approaches that one can use. The problem that I often see is that when training to use one of the many RCA tools and techniques, the focus of a root cause is typically fixated on one of two areas. First, the root cause reported is often found to be a hardware or software related error on the production infrastructure. Second, the root cause was due to a human error caused by a misconfiguration or poor communication between team members.

In many cases, one of these two areas are indeed the true root cause of the problem. Once discovered, the cause can be documented and fixed, and the continuous improvement cycle starts over. But in some situations, finding the core of the problem requires a different perspective. Because RCA methods ask us to constantly drill down into a problem, we never take a step back and look at it from a big picture perspective. That’s precisely what needs to be done.

Image: Pixabay/coffee
Image: Pixabay/coffee

For example, if an outage was caused by a hardware failure somewhere on the network, was the true root cause due to faulty gear, or was the hardware past its life expectancy? If the latter is the case, one must then consider why hardware that outlived its mean time between failure (MTBF) is still being relied upon in a production environment. If one continues digging, they may discover that it was previously recommended by IT support staff that this hardware be replaced long ago – but that budget dollars never materialized.

Another popular root cause that often goes overlooked deals with staffing within the IT department. IT administrators have tremendous responsibilities when it relates to the uptime of an enterprise network. With just a few keystrokes or clicks of a mouse, an admin can inadvertently bring an infrastructure to its knees. While it’s often easy to simply lay blame on the administrator who made the mistake, it’s important to look more deeply at why the misstep was made in the first place. Do they have to proper training to competently perform their administration duties? Did the admin just complete a marathon work shift and was simply not thinking straight? In situations such as these, policy and proper IT management could have avoided the outage. 

So, the next time you are reviewing an RCA report for an outage, make sure that the root cause indicated truly takes the troubleshooting process as far as the cause-and-effect chain can go. Despite the potentially uncomfortable situation of pointing out faults in management as the root cause, you owe it to your organization to find and fix these types of problems to keep them from occurring repeatedly. Only then does the RCA process perform the way it was intended.

Andrew has well over a decade of enterprise networking under his belt through his consulting practice, which specializes in enterprise network architectures and datacenter build-outs and prior experience at organizations such as State Farm Insurance, United Airlines and the ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
Charlie Babcock,
User Rank: Author
5/16/2017 | 3:56:40 PM
Look beyond the specifics
Andrew's column can be summed up as, Look for systemic causes after spotting specific root causes in your analysis.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

10 Things Your Artificial Intelligence Initiative Needs to Succeed
Lisa Morgan, Freelance Writer,  4/20/2021
Tech Spending Climbs as Digital Business Initiatives Grow
Jessica Davis, Senior Editor, Enterprise Apps,  4/22/2021
Optimizing the CIO and CFO Relationship
Mary E. Shacklett, Technology commentator and President of Transworld Data,  4/13/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll