Chaos Engineering: Withstanding Turbulence in Software Production - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

07:00 AM
Manish Mistry, Chief Technology Officer, Infostretch
Manish Mistry, Chief Technology Officer, Infostretch

Chaos Engineering: Withstanding Turbulence in Software Production

Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system. Here are a few pointers.

Although it is not new in industrial and manufacturing settings, chaos engineering is a relatively new discipline in digital engineering. It involves experimenting with software in production to better understand faults and build confidence in the system’s overall capability to withstand turbulence.

While chaos engineering principles have been gaining traction within the last years, clients and engineers are often (understandably) apprehensive because of the misconception that chaos engineering is all about deliberately breaking things. Additionally, the use of terms like “blast radius” or “random terminations” and references to “chaos” or “storms” (Facebook’s name for it) don’t exactly help soothe their concerns.

Image: iQoncept -
Image: iQoncept -

However, most of the engineers who have spent a significant amount of time unravelling problems that weren’t discovered earlier appreciate the ‘Shift Left’ approach and value the ability to perform tests and fix bugs as early as possible in the digital lifecycle.

So, when an organization unveils these issues earlier on in the lifecycle, that must mean a better quality of software and fewer late nights fixing unforeseen problems, right? If only that was true.

With the rise of more complex software, IoT, cloud, distributed systems, and microservices, a new approach to quality and resilience is required to account for the many permutations and interdependencies between all the constituent parts. This is where chaos engineering comes in.

Traditional software testing verifies the code is doing what it’s supposed to (and continues to be an essential part of digital engineering). Chaos engineering, meanwhile, is a way of testing that the entire system is doing what you want it to, and code is just one part of the mix. To do this effectively, the system needs to be tested in production. This is because many other factors, like state, inputs, and how external systems behave, all play a part in the way a system runs.

This complexity has given rise to the idea of “dark debt," referring to the unforeseen anomalies that happen in complex systems when different parts of the software and hardware interact with one another in ways that can’t be predicted. The term borrows from the concepts behind “technical debt” (IT) and “dark matter” (space) to suggest the inevitable, unseen complications that arise in complex systems. This is exactly what chaos engineering seeks to identify.

How that turbulence in production is managed is a critical part of the planning that needs to go into every experiment. Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system. Here are a few pointers:

No surprises

The best approach -- at least, the one I advocate -- is to talk to co-workers, explain your plans, and don't do anything if you suspect it will fail. (In that case, fix the weakness). Chaos engineering is no substitute for resiliency planning and patterns. Instead, organizations embarking on chaos engineering should carefully create hypotheses they wish to prove, considering how to limit their blast radius. The meticulously planned reality of chaos engineering is a far cry from how it was once described by Amazon’s Werner Vogel, “Break everything to see how your systems respond.”

Small is beautiful

Start small and limit the blast radius of your experiments. That includes taking into consideration when the experiment runs, and which departments and resources are available after the experiment runs. By now, I hope it is clear that when I talk about chaos engineering, it’s never about cutting a cable or unplugging a machine randomly to see what happens. The goal is to prove a hypothesis. Even when fault tolerance is within acceptable margins, there are always insights to be gained from examining how the system responded.

The environment matters

If running experiments in a full production environment feels like a step too far into the abyss, that’s ok. For an organization’s baby steps in chaos engineering, production may be too risky. In this case, they should start in a different environment, but one that is as close to the production environment as possible. Quite simply, the findings will not be sufficiently relevant to shed light on potential failures of the system unless the environment is very similar.

Keep going

Software and systems are continuously being tweaked, so chaos engineering experiments should mirror this. It is not safe to assume that if a system responded to a fault injection test (FIT) in a particular way a month ago, the same holds true today. Many of these experiments can be automated, which enables engineers to focus on increasing the scope, intensity, and variety of tests.

Expanding efforts

Once you’ve tested the system for one type of fault, it’s time to adapt the hypothesis. It may also be time to try other hypotheses. Organizations that embark on chaos engineering sometimes get “stage fright” after the initial few tests, especially if these have been fairly minor. The thinking goes a little like this, “I don’t think there’s a problem in service X, but it’s too big a deal to risk.” Wrong!! Remember dark debt and the unforeseen anomalies inherent in complex systems? As Nora Jones from the original Netflix chaos engineering team has said, “Chaos engineering doesn’t cause problems. It reveals them.” Instead of getting cold feet when it matters most, organizations should absolutely tackle the big, important services, but do so in a careful, cautious way. When it comes to improving resiliency and confidence in systems, knowledge is power.

Manish Mistry is Chief Technology Officer of Infostretch, a Silicon Valley digital engineering professional services company.

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
2021 State of ITOps and SecOps Report
2021 State of ITOps and SecOps Report
This new report from InformationWeek explores what we've learned over the past year, critical trends around ITOps and SecOps, and where leaders are focusing their time and efforts to support a growing digital economy. Download it today!
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll