Although it is not new in industrial and manufacturing settings, chaos engineering is a relatively new discipline in digital engineering. It involves experimenting with software in production to better understand faults and build confidence in the system’s overall capability to withstand turbulence.
While chaos engineering principles have been gaining traction within the last years, clients and engineers are often (understandably) apprehensive because of the misconception that chaos engineering is all about deliberately breaking things. Additionally, the use of terms like “blast radius” or “random terminations” and references to “chaos” or “storms” (Facebook’s name for it) don’t exactly help soothe their concerns.
However, most of the engineers who have spent a significant amount of time unravelling problems that weren’t discovered earlier appreciate the ‘Shift Left’ approach and value the ability to perform tests and fix bugs as early as possible in the digital lifecycle.
So, when an organization unveils these issues earlier on in the lifecycle, that must mean a better quality of software and fewer late nights fixing unforeseen problems, right? If only that was true.
With the rise of more complex software, IoT, cloud, distributed systems, and microservices, a new approach to quality and resilience is required to account for the many permutations and interdependencies between all the constituent parts. This is where chaos engineering comes in.
Traditional software testing verifies the code is doing what it’s supposed to (and continues to be an essential part of digital engineering). Chaos engineering, meanwhile, is a way of testing that the entire system is doing what you want it to, and code is just one part of the mix. To do this effectively, the system needs to be tested in production. This is because many other factors, like state, inputs, and how external systems behave, all play a part in the way a system runs.
This complexity has given rise to the idea of “dark debt," referring to the unforeseen anomalies that happen in complex systems when different parts of the software and hardware interact with one another in ways that can’t be predicted. The term borrows from the concepts behind “technical debt” (IT) and “dark matter” (space) to suggest the inevitable, unseen complications that arise in complex systems. This is exactly what chaos engineering seeks to identify.
How that turbulence in production is managed is a critical part of the planning that needs to go into every experiment. Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system. Here are a few pointers:
The best approach -- at least, the one I advocate -- is to talk to co-workers, explain your plans, and don't do anything if you suspect it will fail. (In that case, fix the weakness). Chaos engineering is no substitute for resiliency planning and patterns. Instead, organizations embarking on chaos engineering should carefully create hypotheses they wish to prove, considering how to limit their blast radius. The meticulously planned reality of chaos engineering is a far cry from how it was once described by Amazon’s Werner Vogel, “Break everything to see how your systems respond.”
Small is beautiful
Start small and limit the blast radius of your experiments. That includes taking into consideration when the experiment runs, and which departments and resources are available after the experiment runs. By now, I hope it is clear that when I talk about chaos engineering, it’s never about cutting a cable or unplugging a machine randomly to see what happens. The goal is to prove a hypothesis. Even when fault tolerance is within acceptable margins, there are always insights to be gained from examining how the system responded.
The environment matters
If running experiments in a full production environment feels like a step too far into the abyss, that’s ok. For an organization’s baby steps in chaos engineering, production may be too risky. In this case, they should start in a different environment, but one that is as close to the production environment as possible. Quite simply, the findings will not be sufficiently relevant to shed light on potential failures of the system unless the environment is very similar.
Software and systems are continuously being tweaked, so chaos engineering experiments should mirror this. It is not safe to assume that if a system responded to a fault injection test (FIT) in a particular way a month ago, the same holds true today. Many of these experiments can be automated, which enables engineers to focus on increasing the scope, intensity, and variety of tests.
Once you’ve tested the system for one type of fault, it’s time to adapt the hypothesis. It may also be time to try other hypotheses. Organizations that embark on chaos engineering sometimes get “stage fright” after the initial few tests, especially if these have been fairly minor. The thinking goes a little like this, “I don’t think there’s a problem in service X, but it’s too big a deal to risk.” Wrong!! Remember dark debt and the unforeseen anomalies inherent in complex systems? As Nora Jones from the original Netflix chaos engineering team has said, “Chaos engineering doesn’t cause problems. It reveals them.” Instead of getting cold feet when it matters most, organizations should absolutely tackle the big, important services, but do so in a careful, cautious way. When it comes to improving resiliency and confidence in systems, knowledge is power.
Manish Mistry is Chief Technology Officer of Infostretch, a Silicon Valley digital engineering professional services company.