There’s rarely a dull moment in the life of a site reliability engineer. When applications and services are down, SREs get the call. If thousands of users or millions of dollars are on the line and the clock is ticking, all eyes turn to the SRE to save the day.
The downside to carrying this kind of responsibility: a huge amount of stress. Late nights, high pressure, and constant demands to swoop in and fix problems (even ones that don’t necessarily fall under the SRE role) are all common complaints. And the problem doesn’t seem to be improving.
Why is the SRE role so hard on the people doing these jobs? And what can we do to make it better?
Evolution of the SRE
The role of the SRE evolved in response to changing methods of building digital products and services. In recent years, as more companies have embraced agile software methodologies and DevOps, they’re moving faster than ever to push out new code. When things inevitably break, it’s often the SRE’s job to fix them regardless of whether they were involved in the development and rollout processes.
In principle, SREs are not supposed to be constantly putting out fires. Rather, as Google originally defined the job, they should spend a significant portion of their time on proactive, strategic tasks like increasing system reliability, optimizing capacity planning, and improving documentation. When an incident arises, SREs don’t just bring services back online. Ideally, they conduct extensive post-mortems. They identify why the issue arose, share knowledge about the incident, and build systems and automation to prevent it from happening again.
Unfortunately, many SREs say the reactive aspects of the job end up taking most of their time. That imbalance puts more pressure on SREs than they should be asked to bear. Worse, the steps that could reduce that stress -- increasing system reliability, automating problem resolution, and improving documentation -- are the very things that get pushed aside.
Navigating SRE challenges
Several factors contribute to the stress and frustration:
- Poorly defined job responsibilities: Because the SRE role is still relatively new, there’s a lot of variation -- and misunderstanding -- about what exactly the job entails. Too often, the lines between SREs and delivery and operations teams get blurred. As one SRE told us, “Because the SRE role changes from organization to organization, there can be confusion about the SRE role versus pre-existing operations roles. This creates extra work for SREs, as we end up having to do tasks that may not be under our scope or having to push back on requests from people who don’t understand our role.”
- Outsized focus on reactive incident remediation: Along those lines, many SREs see their roles effectively morph into “ultra sysadmin.” They spend so much time detecting and solving problems, there’s little bandwidth to focus on building systems that are more reliable, efficient, and automated.
- High-pressure scenarios: SREs often feel like the control-booth technician at a big conference. When a presenter’s slides won’t load, all eyes immediately turn to the booth. For every minute that goes by in silence, the anxiety grows. SREs tell us that while they appreciate being trusted with so much responsibility, what they’d really like is some empathy.
Reimagining SRE roles
Too many organizations have a problem with maintaining the well-being and job satisfaction of their SREs. If we’re going to realize the benefits that drove the creation of the SRE role in the first place -- if companies want to be able to scale up more quickly without sacrificing reliability -- we need to make this function work better. Here are two steps to consider:
- Implement firm timetables for the different parts of the SRE job: There’s no point in bringing in SREs if they end up spending all their time on troubleshooting and operations. Organizations have to consciously carve out time for SREs to devote to building systems and working on proactive initiatives and enforce those timetables. And to lower the time they spend debugging and fixing problems, get them involved earlier in the development life cycle.
- Focus on the right metrics: A lot of companies collect data on how long it takes to resolve problems but don’t track how long it takes them to detect problems, or how long until the business is impacted. These are just as important.
It’s time to take better care of SREs
As the guardians of an organization’s critical services, SREs will always shoulder a big responsibility. That’s just the nature of the job. But there’s no reason the role has to come with so much stress and frustration. Organizations can do a better job of empathizing with SREs and making sure that everyone understands what their role is, and what it’s not. They can also make sure they’re giving SREs the time, tools, and visibility they need to be proactive in their jobs.
By taking these steps, organizations can help SREs detect and solve problems more quickly. That in turn creates more time for SREs to focus on initiatives. Ultimately, we can transform the SRE role into a virtuous circle of ongoing improvement and automation. As we do, we’ll end up with a lot less stress and frustration -- among SREs, the broader company, customers, and end users.
Nithyanand Mehta is Executive Vice President, Technical Services and GM at Catchpoint. Mehta leads global Catchpoint Technical Services teams that includes Professional Services, Sales Engineers and Support.