Turn Failure Detection into a Team Sport
Here’s how Chaos GameDays and its spinoffs can enable enterprises to fortify their infrastructure resilience and detect failures before they occur.
Preventing IT infrastructure failure is serious business. So is Chaos GameDays, the somewhat whimsical name given to the series of “chaos engineering” exercises designed to detect failures before they occur.
Count me as one of Chaos GameDays’ many proponents. From an operational and business perspective, proactive failure detection is far more sensible than reactive failure response.
Played periodically under defined rules, Chaos GameDays is designed to simulate a wide range of scenarios, including attempts to hack into and break systems components. This is done not just to predict system failure but also to build greater system resilience to prevent failure from ever occurring.
Think of it like a flu vaccine
As noted by the Gremlin Community, a good analogy for Chaos GameDays is that it is akin to a flu vaccine: injecting “a potentially harmful foreign body in order to prevent illness.”
Chaos GameDays is the gamification subset of Chaos Engineering, pioneered by Netflix circa 2010 just as the video-streaming company was transitioning to a distributed, cloud-based architecture. To protect these revolutionary yet extremely complex systems, Netflix -- soon joined by the world’s largest tech enterprises -- realized they needed new ways to predict failures in order to prevent them.
“If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most -- in the event of an unexpected outage,” Netflix wrote in its company blog soon after implementing the innovative approach. “The best way to avoid failure is to fail constantly.” And with so many more streaming services available today than a few years ago, Netflix certainly doesn’t want its existing customers to consider other options and stream elsewhere.
From there, the idea of Chaos GameDays was born, conceived by Orion Labs founder Jesse Robbins. His lightbulb moment occurred when he realized the best way to fix major failures was to create them -- and that gamifying the process would be a fun, team-oriented approach to develop crisis-preparedness frameworks that can maintain, protect and enhance an enterprise’s infrastructure.
GameDays or not, best practices remain the same
Time for a disclaimer: My company doesn’t engage in typical GameDays practices, but we do assemble DevOps teams that run similar types of infrastructure stress tests approximately every 15 weeks. These test runs are designed to mimic possible -- and sometimes even impossible -- hypothetical situations in order to determine how effective our teams’ proposed solutions mitigate risk and prevent incidents, and how quickly our teams can respond when failure occurs.
Whether you follow the Chaos GameDays route or implement other team-oriented failure-detection exercises, following a few basic best practices will go a long way toward keeping your operations running optimally when it matters most. They include using AI-based data analysis to help identify whether certain combinations of incidents or recurring patterns of issues in each exercise point to specific disasters-in-waiting.
It’s also important to search for and identify points of failure to include personnel availability and readiness, define keywords to describe each problem and how serious it is, and refine your communication templates to ensure you aren’t wasting time composing one-off messages in an emergency.
Then, make sure every team member responds to questions like these to ensure that everybody has the same focus and objectives:
How would you respond to each incident?
What are the predicted times to resolution?
Do you understand our existing disaster-response policies?
Do we have communication messaging templates ready so that we aren’t wasting time in an emergency?
What should we include in our playbook for those responding to incidents?
All enterprises -- particularly those whose survival and success depend on delivering exceptional customer experiences -- require hyper-resilient infrastructures and the appropriate IT service management (ITSM) tools that can sift through, tag and route issues. The most successful businesses, though, know that diving into the chaos of incident-prediction and incident-prevention is critical to staying ahead of the game.
Prasad Ramakrishnan is CIO of Freshworks, a customer engagement software company. With over 25 years of experience in the IT sector, Ramakrishnan manages the business systems, business intelligence and global IT infrastructure of Freshworks. Over the last decade he championed the transition to a cloud and SaaS-based infrastructure at companies like Veeva Systems, HotChalk, Bodhtree, Infoblox and FormFactor.
About the Author
You May Also Like