The role of site reliability engineer (SRE) has been around for about 15 years, but if you’re not sure what that is or if you’d be a good fit for the position, you’re in good company, because a lot of IT pros and companies are still learning about the role as well.
The SRE role was developed by Google employee Ben Treynor Sloss as a way of making sure the software, sites, and applications that Google was deploying were running nearly all the time (even Google breaks .01% of the time). He developed a new way of constructing operations teams, so operations and development would stop squabbling over who broke what and why. The SRE role is a combination of traditional system administrator tasks and coding.
Michael Kehoe, staff site reliability engineer for LinkedIn describes the person best suited for the SRE role as someone who has a little bit of knowledge about everything.
“Hiring for SRE is difficult, in general, because you need to have individuals that have multifaceted skillsets,” says Kehoe. “Generally, you have someone with a systems background, generally in Linux. They also need to have a little bit of networking knowledge, and they also need to know how to code.” But if you don’t have every skill out of the gate and are willing to study, Kehoe adds that “some companies do hire and train.”
According to the book that Google wrote on the subject, an SRE should spend no more than 50% of their time on operations tasks, freeing up the other 50% of their time for coding and automation projects.
While Google has every right to dictate the exact way the role should work in their organization; many other companies will choose to tweak the position. This is one commonly misunderstood aspect of the SRE role, according to Kehoe.
“The SRE role is different company-to-company, partly based off of company business and cultural structure,” says Kehoe.
“For small startups, sometimes public cloud features perform enough of the operations work so that an SRE isn’t required. However, as you build out operations and infrastructures get larger, there is more of a need for [an}SRE to keep everything running smoothly,” says Kehoe.
If you’re at an enterprise and you’re thinking the role of SRE might be helpful in your technology department, don’t drive yourself into a panic thinking you’re behind the rest. Kehoe says, “Enterprise IT shops are still working out some of the details.” However, if your main business is finance or media, it’s time to get going with this new role.
“If you look at industries like finance and media that keep moving towards being completely digital, it’s more important for those companies to build reliable infrastructure while pushing code quickly, so SRE becomes a necessity,” says Kehoe.
Another myth or misunderstanding about SREs is that they’re at the beck and call of the development team.
“The SRE is here to help Product and Engineering deliver the best experience possible for users,” says Kehoe. One group does not dictate what the other does.
“DevOps is meant to be about breaking down the silos between groups and working together,” says Kehoe, and one way to make sure you’re working side by side instead of behind/all around/reactive to one another is to attend each other’s meetings.
Kehoe says that when he was an embedded SRE at LinkedIn and responsible for a product, he would attend the standup meetings with the engineers and their sprint planning meetings. “To make sure I knew what was coming down the pipeline, I knew what they need to achieve, to make sure I knew what things need to be looked after before it was too late,” says Kehoe.
A third myth about the SRE role is that they’re working to maintain 100% uptime.
“The Goal of SRE is not to have zero outages, it’s to trade an ‘error budget’ that ensures maximum feature velocity,” says Kehoe.
If you’re unfamiliar with the concept of an error budget, the idea is that there’s a small amount of wiggle room for outages and errors. You decide with your team (yes, ops/SRE and dev) what that wiggle will be with an internal or external SLA or a handshake, depending on how serious your IT shop is, and agree that the site/app/software can be down for that amount of time. If things are running at 100%, developers are allowed to push new code and features that may break something, but if for some reason the entity is running in the red, developers have to freeze code pushes until the site is back running within the agreed upon error budget.
As site reliability engineer for Atlassian, Patrick Hill, said in this blog post, the beauty of the error budget is that it puts SREs and developers on the same page. “Both the SREs and developers have a strong incentive to work together to minimize the number of errors,” said Hill.
If you’re interested in learning more about this subject, Kehoe will be speaking at Interop ITX 2018 this spring in the session The Next Wave of Reliability Engineering. “The talk is...from an engineering point of view, what are the next things we’re looking to implement in our infrastructure to give a more reliable experience.”