Chaos Engineering: Withstanding Turbulence in Software Production - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software
Commentary
9/25/2020
07:00 AM
Manish Mistry, Chief Technology Officer, Infostretch
Manish Mistry, Chief Technology Officer, Infostretch
Commentary
50%
50%

Chaos Engineering: Withstanding Turbulence in Software Production

Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system. Here are a few pointers.

Although it is not new in industrial and manufacturing settings, chaos engineering is a relatively new discipline in digital engineering. It involves experimenting with software in production to better understand faults and build confidence in the system’s overall capability to withstand turbulence.

While chaos engineering principles have been gaining traction within the last years, clients and engineers are often (understandably) apprehensive because of the misconception that chaos engineering is all about deliberately breaking things. Additionally, the use of terms like “blast radius” or “random terminations” and references to “chaos” or “storms” (Facebook’s name for it) don’t exactly help soothe their concerns.

Image: iQoncept - stock.adobe.com
Image: iQoncept - stock.adobe.com

However, most of the engineers who have spent a significant amount of time unravelling problems that weren’t discovered earlier appreciate the ‘Shift Left’ approach and value the ability to perform tests and fix bugs as early as possible in the digital lifecycle.

So, when an organization unveils these issues earlier on in the lifecycle, that must mean a better quality of software and fewer late nights fixing unforeseen problems, right? If only that was true.

With the rise of more complex software, IoT, cloud, distributed systems, and microservices, a new approach to quality and resilience is required to account for the many permutations and interdependencies between all the constituent parts. This is where chaos engineering comes in.

Traditional software testing verifies the code is doing what it’s supposed to (and continues to be an essential part of digital engineering). Chaos engineering, meanwhile, is a way of testing that the entire system is doing what you want it to, and code is just one part of the mix. To do this effectively, the system needs to be tested in production. This is because many other factors, like state, inputs, and how external systems behave, all play a part in the way a system runs.

This complexity has given rise to the idea of “dark debt," referring to the unforeseen anomalies that happen in complex systems when different parts of the software and hardware interact with one another in ways that can’t be predicted. The term borrows from the concepts behind “technical debt” (IT) and “dark matter” (space) to suggest the inevitable, unseen complications that arise in complex systems. This is exactly what chaos engineering seeks to identify.

How that turbulence in production is managed is a critical part of the planning that needs to go into every experiment. Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system. Here are a few pointers:

No surprises

The best approach -- at least, the one I advocate -- is to talk to co-workers, explain your plans, and don't do anything if you suspect it will fail. (In that case, fix the weakness). Chaos engineering is no substitute for resiliency planning and patterns. Instead, organizations embarking on chaos engineering should carefully create hypotheses they wish to prove, considering how to limit their blast radius. The meticulously planned reality of chaos engineering is a far cry from how it was once described by Amazon’s Werner Vogel, “Break everything to see how your systems respond.”

Small is beautiful

Start small and limit the blast radius of your experiments. That includes taking into consideration when the experiment runs, and which departments and resources are available after the experiment runs. By now, I hope it is clear that when I talk about chaos engineering, it’s never about cutting a cable or unplugging a machine randomly to see what happens. The goal is to prove a hypothesis. Even when fault tolerance is within acceptable margins, there are always insights to be gained from examining how the system responded.

The environment matters

If running experiments in a full production environment feels like a step too far into the abyss, that’s ok. For an organization’s baby steps in chaos engineering, production may be too risky. In this case, they should start in a different environment, but one that is as close to the production environment as possible. Quite simply, the findings will not be sufficiently relevant to shed light on potential failures of the system unless the environment is very similar.

Keep going

Software and systems are continuously being tweaked, so chaos engineering experiments should mirror this. It is not safe to assume that if a system responded to a fault injection test (FIT) in a particular way a month ago, the same holds true today. Many of these experiments can be automated, which enables engineers to focus on increasing the scope, intensity, and variety of tests.

Expanding efforts

Once you’ve tested the system for one type of fault, it’s time to adapt the hypothesis. It may also be time to try other hypotheses. Organizations that embark on chaos engineering sometimes get “stage fright” after the initial few tests, especially if these have been fairly minor. The thinking goes a little like this, “I don’t think there’s a problem in service X, but it’s too big a deal to risk.” Wrong!! Remember dark debt and the unforeseen anomalies inherent in complex systems? As Nora Jones from the original Netflix chaos engineering team has said, “Chaos engineering doesn’t cause problems. It reveals them.” Instead of getting cold feet when it matters most, organizations should absolutely tackle the big, important services, but do so in a careful, cautious way. When it comes to improving resiliency and confidence in systems, knowledge is power.

Manish Mistry is Chief Technology Officer of Infostretch, a Silicon Valley digital engineering professional services company.

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
News
Think Like a Chief Innovation Officer and Get Work Done
Joao-Pierre S. Ruth, Senior Writer,  10/13/2020
Slideshows
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
News
Northwestern Mutual CIO: Riding Out the Pandemic
Jessica Davis, Senior Editor, Enterprise Apps,  10/7/2020
White Papers
Register for InformationWeek Newsletters
2020 State of DevOps Report
2020 State of DevOps Report
Download this report today to learn more about the key tools and technologies being utilized, and how organizations deal with the cultural and process changes that DevOps brings. The report also examines the barriers organizations face, as well as the rewards from DevOps including faster application delivery, higher quality products, and quicker recovery from errors in production.
Video
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Slideshows
Flash Poll