Finding Balance in Dev vs. Ops for Site Reliability Engineers
Results from a recent survey show some organizations have pushed SREs in directions that underutilize and squander their talents.
The demands organizations put on site reliability engineers pushes them to devote more time to the operations side of their responsibilities rather than maintain an even balance. Catchpoint released its 2020 SRE Survey Report, which gathered responses from more than 600 site reliability engineers from around the world. The annual survey was conducted in two rounds, the first in February and second in May. Those results, along with perspectives from experts at Volterra, point to how the role of SREs is reshaping.
Though it has been posited that a 50-50 split between development and operations is ideal for SREs, the majority of the Catchpoint survey respondents indicated they spend 75% of their time on operations. That imbalance can affect job effectiveness with 53% of the respondents saying they were brought in “too late” during the application lifecycle. This may be a sign that organizations should rethink how they utilize SREs as the role continues to evolve.
What companies expect out of their site reliability engineers can vary based on management’s understanding and intentions for the role. “A lot of organizations have put the word SRE in ops titles because it’s more fashionable,” says Mehdi Daoudi, CEO of Catchpoint. In such cases, he says, the engineers might not perform traditional SRE duties, which may include engineering, automation, and monitoring. “One of the biggest challenges we see this year is people are not taking full advantage of what a true SRE team can bring to the table,” Daoudi says.
When SREs have the bandwidth to fulfill their core duties, he says they can improve scalability, resiliency, monitoring, and maintaining overall functionality. Imbalances in SRE job responsibilities, Daoudi says, shown in the survey responses tend to come from organizations that still have legacy applications and infrastructure. “SREs are thrown into the fire to maintain things,” he says. Organizations with legacy technology that are also on a path to cloud, microservices, or containers tend to involve SRE teams in end-to-end platforms, Daoudi says.
Changes in the duties of SREs has been accelerated by migration to distributed cloud, says Jakub Pavlik, Volterra’s director of engineering. “Before, people just had datacenters that were all centralized.” The rise of hybrid cloud and DevOps made organizations want to move quickly and automate application deployment, he says.
The effects of COVID-19 further pushed the move to distributed cloud, which spurred the need to set up multiple locations, providers and edge computing, Pavlik says. That can put more pressure on SREs to focus on the operations side of their duties. “They don’t have as much time for some development activities because they are overburdened on making sure all the systems are running,” he says.
Successful implementations of SRE teams at disruptors such as Netflix and Google naturally have not always been matched by other enterprises, Pavlik says. Some companies simply renamed their operations team to SRE team, but he believes any current confusion will be simplified over time. Pavlik says Volterra partially runs different workloads on different cloud providers and sees challenges of standardization of monitoring and observability. That makes finding staff to fill SRE roles vital though a challenge in the current market. “Getting SRE people is not easy,” he says. “Even if you have unlimited budget, you will have a hard time getting so many talented people. It needs to be solved by right-tooling and automation.”
Catchpoint works largely with SRE organizations and Daoudi says the companies that are most successful tend to take on new projects, designs, or initiatives in bite-size portions rather than tackle everything all at once. Still some organizations try to make moves in a hurry with monolithic systems that he says are not well-suited for such approaches.
Adapting SRE principles to the organization is essential, Daoudi says, rather than strictly following examples set by other enterprises. “Rewrite the [Google SRE] guidelines for your organization and system,” he says. “This SRE transition reminds me of agile 20 years ago, where you don’t just go overnight. There are baby steps that people need to adopt.”
Taking into account the nuances of what SREs can do rather than lumping them into operations may be a way for enterprises to better utilize their skills. Daoudi says some organizations specialize their SRE teams in areas such as CDN traffic, traffic engineering, and multicloud infrastructure. SRE organizations can also be a conduit for bringing observability to life, he says, which can drive an organization to achieve their objectives. “I think you’re going to see a lot of things made specialized when it comes to machine learning and being able to write algorithms to go through the vast amount of telemetry being collected.”
For more on site reliability engineering, follow up with these stories:
Study: Cloud Migration Gaining Momentum
About the Author
You May Also Like