Many of the challenges surrounding consistent software reliability are due to the complexity of modern cloud environments.

Nathan Eddy, Freelance Writer

July 30, 2023

5 Min Read
Graphic representation of the need to improve and update software development.
Panther Media GmbH via Alamy Stock

The expanding definition of reliability has seen a huge increase in complexity and effort while developing and operating software.

Traditionally, reliability has had a focus on functional quality (that it works correctly), while today’s users expect software to be performant, secure, and always available. This, coupled with a massive increase in devices and running environments makes for a huge challenge in meeting all user expectations.

Engineers are also expected to work with reliability as an important core factor. Traditionally, development teams produce features as primarily defined by an internal product team.

Now the requirements are expanded to include the expectations on reliability from users -- so developers must focus on meeting these requirements as an equal part of their work. This impacts time to completion, complexity in testing and more importantly a bigger dependence on good architecture.

Stu Hume, vice president of of product engineering at Kobiton, explains these challenges drive a focus on testing, and a huge increase in time spent developing observability and monitoring systems for the operation of software systems.

“We can only deliver on the expectations of users if we can understand how our system operates, which means a large investment in tooling to monitor and auto respond to anomalous situations,” he says.

This in turn drives a much bigger expectation on IT professionals to have a much broader set of knowledge.

Developers and quality assurance professionals must address many more requirements during development -- now covering things such as scale and performance and impact on usage.

Developers Need Access to Data

Ben Sigelman, general manager of ServiceNow cloud observability and co-founder of Lightstep and OpenTelemetry, says developers need actionable insights in their normal workflows, as they have different needs than IT professionals.

“Developers need to fix bugs, address performance regressions, build features, and get deep insights about particular service or feature level interactions in production,” he says.

That means they need access to necessary data in views, graphs, and reports that make a difference to their workflows.

“However, this data must be integrated and aligned with IT operators to ensure teams are working across the same data sets,” he says.

Sigelman says IT operations is a crucial part of an organization’s overall reliability and quality posture.

“By working with developers to connect cloud-native systems such as Kubernetes with traditional IT applications and systems of record, the entire organization can benefit from a centralized data and workflow management pane,” he says.

From this point, event and change management can be combined with observability instruments, such as service level objectives, to provide not only a single view across the entire IT estate, but to demonstrate the value of reliability to the entire organization.

Finding the Signal in the Noise

Aakash Shah, CTO and co-founder of oak9, explains “finding signal in the noise” is the biggest issue regarding software reliability today.

“We have a lot of tooling that provides a ton of visibility but connecting the dots to understand the issue is a problem,” he says. “We are building large complex systems with interdependent parts and understanding dependencies and correlating telemetry across components becomes quite hard.”

He points out developers are taking ownership of the entire software development lifecycle (SDLC) and good workflows for observability help create feedback loops that lead to more informed choices as developers build software.

Shah notes integrating observability workflows into the early stages of the development lifecycle can help developers identify reliability issues early. For example, being able to say, “If the request is taking longer than it typically does, notify me so I can dig into it”, and as a result catching this issue in non-production environments, issues in production can be prevented.

Hume points out that in any operational environment, things will go wrong. “The goal is to detect and react to these before there is a material impact on the business or users,” he says. “The only way to do this is to have a deep understanding of what is happening in the complex systems that are behind every application, website or mobile system.”

From his perspective, observability and detection of anomalous situations quickly are core to preventing adverse impacts on the business. 

“The tools, data and metrics are core to developers having this insight to see the impact of their changes on the full system and business,” he adds. “This allows them to respond to issues before they create negative situations.”

Reliability Starts with Good Design

“We use a lot of telemetry and metrics to help us operate a complex technical environment,” Hume explains. “We have developed application monitoring that helps us understand how information transmission, network connectivity and code performance impact the user experience.”

Given how easy it is for users to see performance in the system, better systems must be developed to predict problems that will happen before they see them.

“This is where we have used automated error rate analysis, and hardware state change monitoring to detect and announce issues internally before they show up in our user metrics,” he notes.

Kahn explains software reliability starts with good design, which means following mature SDLC processes is critical to ensuring reliability.

“For example, two-person peer reviews can go a long way to ensuring software quality,” he says.

In addition, having the right tooling that empowers developers to operate autonomously can catch issues early in the lifecycle can not only improve reliability but can also drive the overall cost of reliability down.

What to Read Next:

Risks and Strategies to Use Generative AI in Software Development

How to Avoid App Switching Brain Drain

Understand and Manage Software Release Cycles

About the Author(s)

Nathan Eddy

Freelance Writer

Nathan Eddy is a freelance writer for InformationWeek. He has written for Popular Mechanics, Sales & Marketing Management Magazine, FierceMarkets, and CRN, among others. In 2012 he made his first documentary film, The Absent Column. He currently lives in Berlin.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights