With massive increases in digital demand, IT teams are shifting into “hypercare” mode. The challenge: How do they stay there as demand keeps rising?

Guest Commentary, Guest Commentary

May 22, 2020

5 Min Read

Digital dependency is at an all-time high. The effects of COVID-19 have caused us to rely on digital services more than ever to work and stay connected. As a result, many organizations and IT teams have been forced to shift into hypercare mode to meet the increased demand.

Hypercare is an elevated state of support where IT operations teams closely monitor customer service, data integrity and other functions to keep services running smoothly and meet customer expectations. It’s typically used immediately following a release, or during a known period of very heavy traffic (like Black Friday). The thing is, it’s meant to be a temporary state. Yet, for many organizations, COVID-19 has raised this level of pressure and hypercare indefinitely.

The challenge today is, under this prolonged state of heightened demand, how do you continue operating at a high level? Thankfully, we have some data from teams, including ours at PagerDuty, who have managed to maintain hypercare and actually resolve incidents faster than before the pandemic. I’ve put together a few practices that the best teams use:

Prioritize customer response

First, establish your crisis response process and team, with clearly defined priorities, roles and procedures. You may already have an established and working incident response process or team -- in this case, ensure it’s staffed for the additional load. This team’s goal is to be available to address the most urgent issues quickly and effectively, dropping whatever they are doing if a customer is at risk.

In a crisis, every second counts, so this team will both need to work cohesively to respond to arising issues and harden the infrastructure against future problems. This will protect the customer experience and business revenue.

Provide visibility

You need system-wide visibility so your team can quickly understand where and how failures originate. As the central nervous system for many organizations, Network Operations Centers (NOC) typically play this role as they allow IT support technicians to supervise, monitor and maintain networks and infrastructure. The challenge is that NOCs are typically located in a central physical location, but now must be distributed and remote.

By virtualizing your NOC, you can maintain system-wide visibility and resolve issues as a distributed team. A single, shared point of real-time visibility minimizes the communication challenges of working dispersed.

Keep stakeholders informed

Managing incidents is stressful enough, the last thing an IT team needs while they’re focused on solving a critical issue is an executive asking for updates. Before something breaks, define standard ways that business leaders and other partners will be updated. Establish clear modes of communication -- ideally a dashboard where executives can follow incident resolution, rather than interrupt responders for updates. This will increase the level of confidence and reduce the panic that stakeholders have when they don’t feel informed.

Automate to create efficiencies

While in hypercare, nothing is more important than getting the right information to the right people in real-time. When you’re distributed and the work keeps piling up, breaking through the noise is near impossible.

Automation and machine learning help reduce duplicative efforts, surface urgent issues that need attention, and cut down response times. One way to automate is by grouping related alerts so responders avoid getting overloaded. Another is to apply machine learning to data on how a similar past incident unfolded and who worked on it, so you can move faster to resolve the current problem.

Learn and improve your way out of hypercare

After an incident, run a blameless postmortem with the goal of mitigating similar issues in the future. Summarize the events leading up to resolution, identify contributing factors, document agreed-upon action items and share the final report internally.

Without a postmortem, you and your team miss out on the opportunity to learn what you’re doing right, where you can improve, and how to avoid the same issues in the future. You may find you need to make investments to reduce operational load and need to pause new innovations. These types of decisions are critical to ensure teams harden and scale their services to the new normal, so they can eventually improve their way out of hypercare. 

The fallout if you fall off

Again, hypercare is typically a temporary state of elevated support. But in this environment requiring extended hypercare, one serious risk is burnout. Even before the pandemic, 62% of IT professionals in North America reported spending more than 100 hours each year on disruptive, unplanned work. Burnout increases employee turnover and decreases employee productivity -- both costly side effects that will ultimately impact your customers. In this period of increased demand on your digital services, not implementing some of the hypercare protocols above will almost guarantee burnout. Furthermore, it’s crucial to both operate efficiently and also to make the operational improvements needed to help teams exit hypercare.

PagerDuty can help

Empowering your teams with the tools and processes they need to meet the increased demands on digital experiences while working from home is a new complexity for many organizations. PagerDuty provides incident response automation, proactive system-wide visibility, and best practice knowledge so you can keep up with the current surge in demand and keep everything running smoothly. To learn more, visit PagerDuty.com

Rachel Obstler is the VP of Product at Pagerduty where she is responsible for the product direction, customer experience, and pricing. Prior to Pagerduty, Rachel served as the VP of Product at Keynote Systems, and later as the VP of Mobile Testing, overseeing the mobile testing product, sales, engineering, marketing and service organizations. Rachel has over 10 years of experience in SaaS and over 15 years in product management. Rachel holds a B.S. at MIT and M.B.A at Stanford University Graduate School of Business.

About the Author(s)

Guest Commentary

Guest Commentary

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT professionals in a meaningful way. We publish Guest Commentaries from IT practitioners, industry analysts, technology evangelists, and researchers in the field. We are focusing on four main topics: cloud computing; DevOps; data and analytics; and IT leadership and career development. We aim to offer objective, practical advice to our audience on those topics from people who have deep experience in these topics and know the ropes. Guest Commentaries must be vendor neutral. We don't publish articles that promote the writer's company or product.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights