3 Ways to Do AIOps Right in Cloud-Native Environments

For AIOps to deliver value for these teams, it must be fully automated, in context, and able to shift left for development and shift right for operations.

Software production deployments are growing exponentially. One survey, (from IT automation company Puppet) predicts a 10x increase in deployments over the next year. Organizations must confront their old-school, manual approaches to troubleshooting and remediating software issues head on. AIOps is an automated solution that replaces time-consuming, tedious, and manual work with quick, precise answers into the performance and security of applications and infrastructure.

But many organizations still use older AIOps solutions, which rely on logs, metrics, and traces to find patterns and correlations and determine the root cause of performance and technical issues. ITOps, DevOps, and SRE teams are contending with complex multi-cloud, multi-cluster environments where production deployments happen in a matter of days -- and these older AIOps solutions just can’t keep up.

For AIOps to deliver value for these teams, it has to be done right -- fully automated, in context, and able to shift left for development and shift right for operations. Here are three use cases that demonstrate how to do AIOps the right way.

1. Ingest contextual data

Many organizations leverage tools like Azure DevOps, GitHub Actions, GitLab Pipelines, and Jenkins to automate their software delivery pipelines. Improved delivery automation is important, as it accelerates the rate at which DevOps and SREs can release high-quality code and ramp up their delivery pipelines’ output.

There are two ways AIOps can help accelerate delivery automation. One is having the AIOps solution ingest deployment and configuration data. This involves linking events like configuration changes, deployments, load balancers, and service restarts to a specific monitored entity -- like a container, application, or process. You look at deploying a new iteration of an app into a testing environment, restarting a service in a production environment, or load balancing traffic in a production environment. The point is to leverage more contextual data that can be fed into the AIOps solution, so it goes beyond simple correlation and observes the direct link between behavioral changes and executed actions to determine root causes.

This also enables DevOps and SREs to become immediately notified whenever one of those behavioral changes negatively impacts the user experience. The immediacy of that notification, along with root-cause determination, ensures the AIOps solution provides teams with fast, precise answers about the quality and scalability of their delivery pipeline.

2. Leverage AIOps insights to support data-driven decision-making

Feeding new contextual and deployment data to the AIOps solution also makes it a fountain of information that better informs and automate decision-making at every state of the DevOps life cycle, from design, development, and delivery to production monitoring and troubleshooting.

The AIOps solution generates performance data on individual software releases or tests, which teams can use to compare and baseline results to determine any possible regressions that occur during or between tests. This approach can be repeated over multiple tests and deployments. The open-source CNCF project Keptn offers another application of this approach. It automatically ingests data from multiple cloud-native sources and uses AI to calculate a single service-level objective (SLO) score. Rather than manually scouring AIOps reports and dashboards, teams can instead reference Keptn’s “SLO scores” to more quickly optimize code, roll out higher-quality software releases, remediate issues before they reach the end user, and make the delivery pipeline a smoother, more automated process.

3. Shift AIOps left into pre-production to create proactive, test-driven operations in production

Rather than waiting to deploy remediation scripts until after a user has already had a negative experience, shifting AIOps left enables a more proactive posture where remediation code can be tested before it’s deployed into production. One way of doing this is to create a chaos engineering experiment where you orchestrate a pre-production environment monitored by your AIOps solution, load it with tests that inject chaos into the environment, then use the results to validate your auto-remediation code. This “test-driven operations” environment becomes a proving ground for both the remediation code and the AIOps solution: you’re validating the solution’s capability for triggering auto-remediation scripts when a real-world issue arises by battle testing it for such a scenario.

For SREs, this means no longer worrying about a new issue boxing them into a corner and forcing them to script and deploy remediation code on the spot. Instead, if an issue arises and a user experience has been affected, the SREs can leverage an AIOps solution that has proven, battle-tested experience for identifying the issue and fixing the code immediately.

Doing AIOps right means closing the gap between solutions and processes

Leveling up your AIOps strategy calls for more tightly integrating your AIOps solution into your DevOps and SRE practices, development processes, testing environments, and internal platforms to close the gap between internal processes and the AIOps solution itself. The more you narrow that gap, the better positioned you are to leverage AIOps for fast, precise answers and remediation in your software development pipeline.

Editor's Choice
Joao-Pierre S. Ruth, Senior Writer