Minimizing Human Errors to Improve Data Center Reliability
A simple mistake made in a data center can have serious consequences. Here's how to detect and prevent common errors before they can damage schedules, budgets, revenue, and perhaps even lives.
To err is human, but that fact doesn't make it any easier to get a data center back on its feet after an innocent mistake interrupts vital services.
According to an Uptime Institute survey, human error accounts for approximately 70% of data center problems, leading to everything from systems downtime to costly security breaches. "While IT teams are vital to a successful data center, often human error stems from a lack of understanding of the equipment or simply failing to follow procedure," observed Said Tabet, a distinguished engineer in Dell Technologies' global CTO office. "Especially now, when teams are off-site or remote, standard processes and tasks are more likely to slip through the cracks."
The most dangerous person in any data center is the over-confident, uninformed, self-anointed "expert," noted John O'Connor, manager of technology infrastructure operations at news and media firm Bloomberg. "I much prefer working with people who know what they don't know, are highly skilled in critical thinking, and embrace a team dynamic where everyone teaches everyone."
John O'Connor, Bloomberg
Error reduction
Proper deployment practices can go a long way toward reducing data center mistakes. "Wiring needs to be organized and labeled properly," advised Joe McKenna, global CIO for IT consulting and services provider Syntax. "You need good documentation of the hardware in the rack."
Detailed diagrams help ensure that team members are accessing and working on the right equipment. "Double-checking everything before someone touches something helps reduce errors and outages," McKenna said.
Following good practices when virtually accessing data center systems requires closely examining the incident ticket, or request, to ensure that the technician will work on the correct system. "If possible, for major actions like reboots and restarts, there should be a second set of eyes to review and approve what's about to be executed," McKenna suggested.
Joe McKenna, Syntax
Perhaps the most common human error is system misconfiguration, observed Amr Ahmed, an executive director at business and technology advisory firm EY Consulting. "More specifically, errors are made due to system patches or upgrades, such as upgrading storage firmware that's ... causing the storage platform to halt, or a backup power source misconfiguration," he noted.
The easiest way to manage errors is by deploying a strong change management discipline coupled with a solid understanding of the various data center environment interdependencies, Ahmed said. "These [tasks], alongside intelligent automation and orchestration, can help avoid a major cascade and, ultimately, a negative impact."
Amr Ahmed, EY
Automation offers hope
Automating the tasks most prone to human error is an effective way to mitigate downtime and outage risks, Tabet said. "The use of AI-based methodologies to enhance staff training and efficiency helps improve productivity and accelerate proper deployment of new data center technologies," he added.
Yet even with the assistance of sophisticated control systems, there are frequently times when human intervention and on-site analysis/decision making is necessary, often immediately. "AI automation in data centers gives organizations powerful capabilities and insights, but without a team to manage the system and leverage those insights, organizations achieve less efficiency and optimization than projected," Tabet explained. "The right people need to either be hired or trained to manage the system."
Said Tabet, Dell Technologies
Ahmed believes that the key to reducing the number of human mistakes is not just automation, but intelligent automation. "The modern data center and digital transformation era pose operation complexity and scalability challenges that hinder human operation and our ability to cope," he stated. The telemetry data and alerts that now increasingly flood data center panels whenever an anomaly is detected are now almost impossible to humanly manage in real time. "There's clear interest and rise of artificial intelligence in operations (AIOps) adoption in the data center," Ahmed observed. "AI is a powerful enabling technology that helps people make better business and technical decisions."
The complexities inherent in managing a data center packed with racks full of IT equipment require careful planning when deploying data center systems, cabling, documentation, and validation, McKenna noted. "To successfully execute data center operations, it’s important to have well defined and tested procedures," he added. Regular personnel training is also important, as is the automation of standard procedures to reduce errors and improve speed and efficiencies. "Overall, [it] prepares folks for working in the data center environment with confidence and accuracy," McKenna said. "It reduces errors in the data center."
Takeaway
Despite best efforts and careful attention to detail, it's impossible to completely eliminate human errors within any complex socio-technical endeavor, O'Connor acknowledged. "This is well known in fields where lives are at risk, such as aviation or operating a nuclear power plant," he noted. "In data centers, where the pace of technology change is only accelerating, we strive for perfection even though we know it's unachievable."
Follow up with these data center articles:
6 Reasons Why Internal Data Centers Won’t Disappear
How Cloud is Transforming the Data Center
How to Plan Today for Tomorrow's Lights-Out Data Center
About the Author
You May Also Like
2024 InformationWeek US IT Salary Report
Aug 15, 20242024 InformationWeek US IT Salary Report
May 29, 20242022 State of ITOps and SecOps
Jun 21, 2022