In today's always-on world, businesses need a way to manage and guarantee reliability of their digital services. At PagerDuty, our customers rely on us to manage incidents when their systems are having trouble, and to do so on time, every time, day or night.
We're constantly seeking to improve how we address the following questions:
- How do I quickly identify who needs to be alerted when my system is down?
- What needs to happen for fast incident resolution?
- How can I surface the data needed to make decisions?
- Where can I improve moving forward?
In 2013, our engineering team grew tired of not finding issues early enough in production, so we implemented Failure Fridays, a weekly practice for injecting failure scenarios into our infrastructure. We were inspired by Netflix, a customer, who created an array of tools for controlled testing of infrastructure resilience and failure scenarios. Their objective: to proactively solve problems within their infrastructure in order to limit subscriber impact.
[Don't do it like this. Read 8 Ways To Fail At DevOps.]
Our primary goal is to offer the most reliable, resilient, and durable incident management possible. We decided to see how quickly we could move from identifying an issue to resolving it. We outlined three goals:
- Understand common failure scenarios.
- Establish best practices for when things go wrong.
- Foster collaboration by bringing disparate parts of our organization together to solve problems – especially in the line of fire – using a controlled, intentional approach.
Failure Fridays have become a rallying cry for our organization, and are a natural extension of our mission to bring ease and intelligence into operations. Here are four things we've learned in the two-and-a-half years since we introduced this weekly ritual.
Keep Your Testing Fresh
Each week our team gets together and decides what new thing we're going to test. The failure scenario can start small, such as taking down a single process, host, or new service. Or, it can go big, such as taking down an entire data center.
We also test different classes of failures, ranging from hard shutdowns of services to connection timeouts and more. We prepare attacks in advance, but purposefully don't warn the team that owns the attacked service. The intent is to see how reliable the system is, and if something breaks, how quickly we can fix it.
We recently took down a third of our infrastructure to test all three of our offices. We set up war rooms and methodically went through the process of bringing it back up. During the course of the two-hour process, our customers were never affected, exactly as we had planned.
It was a proud moment because it brought us together across geographies and helped establish a clear response protocol for a significant event.
It's Not a Dress Rehearsal
On Failure Fridays, we're very intentional about which failure scenario we introduce. It's essential we conduct it in a live environment, not a pre-production or testing environment. Our business lives or dies by being at our best -- always on and available -- when our customers may not be, so it's paramount that we practice as if our jobs depended on it.
If you can manage it, ensure your tests are conducted in a simulation that's as close to real life as possible. That said, we carefully consider the design of our failure scenarios to ensure customers are never affected. That has to be part of your process and consideration.
"Gotchas" Make You Stronger
During Failure Fridays, we've uncovered four different bugs that caused random cluster-wide lockups for years. Two bugs were located in ZooKeeper, and the other two were lurking in the Linux kernel. We found out the hard way about being on an old version of Apache Zookeeper.
When we shut down an entire data center, the nodes that had been shut down wouldn't come back up. We ended up replacing the nodes on the fly, but couldn't do a rolling restart anymore due to the loss of quorum. We solved it by live patching some of our code to temporarily bypass Zookeeper as we repaired the cluster.
Though it took many hours to complete, we did the repair without any customers noticing, and learned quickly about the importance of process and best practices. The practice of working through failure scenarioswais as important as resolving the issue itself.
Hold a Blame-Free Post-Mortem
A post-mortem should be a blame-free, detailed description of exactly what went wrong, along with a list of steps needed to prevent similar incidents from recurring. We treat our post-mortems as a healthy opportunity to look back and learn together, with focus on what future actions need to be taken. If your team doesn't invest in taking actions to improve, then you're wasting time and effort, leaving the business at risk.
Injecting failure and learning how to respond is crucial for any organization today that depends upon software infrastructure. Rituals like Failure Fridays help build a team culture based on trust and empathy. That way, when things go wrong -- and they will -- no one panics. You stay two steps ahead in terms of how you approach incidents, how you communicate, how you resolve them, and how you remain proactive with customers in moments of failure.