As our IT infrastructures grow increasingly complex thanks to advanced technologies such as virtualization, cloud computing and software defined networking (SDN), understanding the root cause of an IT outage becomes more difficult to achieve. But even more importantly, current troubleshooting techniques to find fault into an outage focuses solely on the technical side of the IT department. In fact, the truth is that many root causes go beyond technology and stem from poor policy and management decisions.
For anyone who has been involved in enterprise IT support, root cause analysis (RCA) is one of the first troubleshooting methodologies one needs to learn. Thanks to the ever-increasing complexities of network infrastructures and distributed computing platforms, the visible symptom that end users experience is usually not the true cause of the problem. Instead, RCA teaches us to continue drilling down into the cause-and-effect chain to ultimately find the core issue at hand.
There are multiple RCA approaches that one can use. The problem that I often see is that when training to use one of the many RCA tools and techniques, the focus of a root cause is typically fixated on one of two areas. First, the root cause reported is often found to be a hardware or software related error on the production infrastructure. Second, the root cause was due to a human error caused by a misconfiguration or poor communication between team members.
In many cases, one of these two areas are indeed the true root cause of the problem. Once discovered, the cause can be documented and fixed, and the continuous improvement cycle starts over. But in some situations, finding the core of the problem requires a different perspective. Because RCA methods ask us to constantly drill down into a problem, we never take a step back and look at it from a big picture perspective. That’s precisely what needs to be done.
For example, if an outage was caused by a hardware failure somewhere on the network, was the true root cause due to faulty gear, or was the hardware past its life expectancy? If the latter is the case, one must then consider why hardware that outlived its mean time between failure (MTBF) is still being relied upon in a production environment. If one continues digging, they may discover that it was previously recommended by IT support staff that this hardware be replaced long ago – but that budget dollars never materialized.
Another popular root cause that often goes overlooked deals with staffing within the IT department. IT administrators have tremendous responsibilities when it relates to the uptime of an enterprise network. With just a few keystrokes or clicks of a mouse, an admin can inadvertently bring an infrastructure to its knees. While it’s often easy to simply lay blame on the administrator who made the mistake, it’s important to look more deeply at why the misstep was made in the first place. Do they have to proper training to competently perform their administration duties? Did the admin just complete a marathon work shift and was simply not thinking straight? In situations such as these, policy and proper IT management could have avoided the outage.
So, the next time you are reviewing an RCA report for an outage, make sure that the root cause indicated truly takes the troubleshooting process as far as the cause-and-effect chain can go. Despite the potentially uncomfortable situation of pointing out faults in management as the root cause, you owe it to your organization to find and fix these types of problems to keep them from occurring repeatedly. Only then does the RCA process perform the way it was intended.