The Right System Architecture Will Reduce Software Failures

Companies can minimize software program failures by following these best practices.

Vigneswaran Kennady, Senior Manager, Software Engineering, Capital One

September 23, 2022

4 Min Read
compass pointing to best practices
Olivier Le Moal via Adobe Stock

Microservice architecture is the building block most often used when creating software applications, breaking programs into smaller modules, each focusing on a different function of the application being constructed. It features loosely connected software components that are designed to be independent, automatically deployable, and cohesive. Microservice architecture can be easier to manage, although the high number of services within that structure can make troubleshooting and debugging difficult. Conversely, it is easier to isolate faults.

This is a change from a monolithic architecture, where all the applications are single-tiered. Monolithic architecture combines multiple applications into one large complex program without those singular modules, with everything embedded within the same code. That structure can be more difficult to manage over time. One failed component in a monolithic design can knock down the entire “stack.”

The Impetus for Change

Integrated modules are ideal for a specific use case, as in the example of a payment management system, which involves multiple links. First, customer information is stored and connected to other services within a software program. The microservice architecture model leaves each module with its own unique function, making it less difficult to identify where a software bug may be located and conceivably easier to isolate and then fix the bug. Within a monolithic structure testing for a singular issue can be more difficult because it is connected to the rest of the code within that program. That is why many companies are moving away from the monolithic architecture (still used for more simple use cases) to employing microservice modules.

A microservice failure can be isolated to just that one service, avoiding the cascading failures that could cause an application to crash, known as the “ripple effect.” However, since each module is connected to others under the microservice approach, a failure in one module can impact others in the chain. This means before a software application is released it should be function and load tested repeatedly, looking to minimize any downtime.

Best Practices for Avoiding Software Failures

There are several reasons why software programs fail, and some basic best practices can be employed to minimize the likelihood of that happening. They include the following:

  • Implementing load balancing. As the number of website users increase and they log on to add their personal data, a crash can impact other features, like access to the bank they hope to draw from when they check out. Think “Black Friday” and what happened when websites were not equipped to handle shopper traffic. On an e-commerce website when the number of users increases sharply to take advantage of an online offer that could potentially cause a crash, that can impact other features, like access to the payment page when they check out. Avoid a single point of failure by load balancing system traffic across multiple server locations.

  • Applying program scaling. This is the ability of a program’s application nodes to automatically adjust and ramp up to handle increased traffic via machine learning, as it analyzes the metrics on a real time basis. Scheduled scaling can be employed during forecasted peak hours or for special sale events, such as Amazon Prime Day. At off-peak hours, those nodes then can be scaled down. Dynamic scaling involves software changes based on metrics including CPU utilization and memory. Predictive scaling entails understanding current and forecasted future needs, utilizing machine learning modules and system monitoring.

  • Using continuous load and stress testing to ensure reliability of the code. Build a software program with a high degree of availability in mind, accessible every day of the year with a miniscule period of downtime. Even one hour offline a year can be costly. Employ chaos engineering during the development and beta testing stage, introducing worst-case scenarios when it comes to the load on a system. Then write a program to overcome those issues without resorting to downtime.

  • Developing a backup plan and program for redundancy. It’s crucial to be able to replicate and recover data in the event of a crash. Instill this type of business ethic within the corporate structure.

  • Monitoring a system’s performance using metrics and observation. Note any variance from the norm and take immediate action where needed. A word of caution: the most common reason for software failure is the introduction of a change to the operating system in production.

One Step at a Time

The first step in developing a software program is choosing the right type of architecture. Using the wrong type can lead to costly downtime and can discourage end users from returning for a second visit if other sites or apps offer the same products and services.

The second step is to incorporate key features including the ability to scale as demand on the program peaks (perhaps a popular retail site having a sale), redundancy that allows a backup component to takeover in case of a failure, and the need for continuous system testing.

The final step is to establish standards of high availability and high expectations where downtime is not an option. Following these steps creates a template to design better system applications that are reliable in all but the rarest of circumstances.

About the Author

Vigneswaran Kennady

Senior Manager, Software Engineering, Capital One

Vigneswaran Kennady is a senior manager for software engineering at Capital One, based in Plano, Texas. He has over 14 years of experience with building, owning, and supporting cloud-native applications and platforms. That includes complex architectural patterns, building microservices, event streams, and high-throughput systems. For additional information, contact [email protected]

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights