Despite having been an established role for 15 years, people are still pretty mystified about what an site reliability engineer (SRE) does and how to best approach the role. Perhaps you have the title but you’re still not 100% sure if you’re “doing it right.”
Is it just a rebranded sysadmin role? Is it like a DevOps engineer? (Is DevOps engineer a real job?) Will it really deliver the promises of respect for the operations side of the house and a 50/50 work split of coding and operations?
Think of DevOps as a process, cloud native is an architecture, and SRE as an actual job that requires a production-first mentality, and not a development-feature-software mentality said, Hirschfeld, who has been creating leading edge infrastructure and cloud automation platforms for over 15 years.
He went on to describe what to expect in an SRE role and how to succeed with a DevOps system in place that includes SREs. Here are five important lessons that Hirschfeld shared.
1. Pay Equity Builds Partnerships
One of the longest standing gripes in IT is that developers get all the love…and the money.
It’s an industry-wide problem, said Hirschfeld, that can translate into several issues.
Money is often perceived to show the degree to which a role is or is not respected within an organization. Pay inequity can build a battleground among team members (for example, operators/SREs and developers) when group perceives their work to be more valuable and more respected within the organization.
In a recent SRE report released by Catchpoint, 416 SREs were asked if the SRE team is a well-respected and valued part of their organization. The responses by industry were a mixed bag -- only 36% of the the financial services SREs reported that they felt respected and only about half (52%) of media/entertainment SREs reported feeling respected, while a majority of the ecommerce/retail (69%) SREs and SaaS industry SREs (73%) felt respected within their organizations.
“The simplest way to fix a status and equity situation is to fix the pay,” said Hirschfeld.
And let's be clear, this is not about throwing money at a situation to solve a problem. This is about saying, your role is as important as the others, and this is a partnership where we work toward the same goal. It's not, we're just two disparate teams working toward our own goals and we happen to interact regularly.
At Interop ITX, Hirschfeld told the audience that even mighty Google once struggled to show their operators the love they deserved.
Prior to developing the SRE role, “The [Google] operators were being paid way less than the developers,” said Hirschfeld, adding that “this sent a very clear message that the operators were second-class citizens.”
In order to improve their production system and build a partnership between the two groups, Google decided that one of the first things they needed to do was fix pay equity for the newly anointed SREs, said Hirschfeld.
When SREs and developers are paid the same, IT leaders start to develop the mindset of: “I can’t let this expensive resource over here get whipped around.”
2. SREs Can’t Ignore Infrastructure
In a serverless and container-ed IT world, it may be tempting to think, “what infrastructure?” but SREs don’t get the luxury of pretending that infrastructure doesn’t exist, said Hirschfeld. "[The] SRE [role] is an acknowledgment of the system view of infrastructure."
“The cost of developers getting these performance-improving abstractions [like containers and serverless] is now SREs have to shoulder this infrastructure more on their own,” said Hirschfeld.
According to the Catchpoint SRE report, “65% of SREs have infrastructure fully or partially in the cloud and are deploying code at least once a day.”
With that in mind, Hirschfeld said “It is important in the [SRE role] that you maintain a connection to actually running the infrastructure. If you separate those responsibilities again, you lose what goes into being an SRE,” said Hirschfeld.
“Somebody somewhere cares about your infrastructure,” said Hirschfeld. “You might not have servers anymore, but you still have to look at it from your delivered products perspective.”
3. Tools Do Matter
SREs rely heavily on tools, and according to the Catchpoint SRE survey, availability and triage tools top the list. But depending on your budget or company size, you may not have access to a wide variety of tools. Only 52% of SREs at companies with fewer than 1,000 employees feel they have access to a wide range of tools, compared to 75% of SREs at companies with more than 1,000 employees.
A limited tool belt combined with a role where you’re encouraged to use 50% of your time operating and 50% of your time coding, may tempt you to start building all your tools from scratch.
Hirschfeld has one simple piece of advice: “If your team is going to build everything yourself... don’t do that,” adding that in these increasingly complex IT environments (hybrid cloud, multi-cloud), you’ll likely just end up adding to your technical debt.
The danger of the 50/50 rule is SREs become developers and they start writing tools, said Hirschfeld. When they do that, that becomes software for them to maintain, and then software maintenance takes away from time SREs need for operations or automating.
“You want to spend 50% of your time improving how your business works, and 50% operating your business, if you spend a lot of your time writing that software, you are probably wasting [it],” said Hirschfeld.
While DevOps tends to shy away from discussions about specific tools, SREs need to talk about it, says Hirschfeld.
So rather than build it all yourself, connect with other SREs to see how they’re managing their systems. Attend a conference, there are other ways to solve your tool conundrums than to build them all yourself.
4. Disrupt Less
When was the last time you felt anxiety about not being on top of the latest technology in your field? Oh, every second of your day? Here’s some advice from Hirschfeld to SREs about that industry-wide compulsion to implement the latest tech.
The impulse to get into a technology and then decide it’s the wrong thing and switch, or the fear of being “late to shiny” as Hirschfeld called it, is preventing organizations from actually getting all the value from what you’re trying to accomplish. And if you’re an SRE, it’s likely digging into your 50/50 time split.
“You can’t get pulled into firefighting constantly,” said Hirschfeld, and that’s what you’re doing if you’re regularly ditching and switching.
Hirschfeld said to disrupt less and focus on the improving the business more.
“If developers are going over their error budget by pushing new code, an SRE can push back and say the code is not ready for production,” said Hirschfeld.
While the Catchpoint report found that less than half (44%) of the companies surveyed do not strictly adhere to and follow error budgets, larger companies within the surveyed are more likely to use one. Forty-four percent of companies of 5,000 employees or more employees said that they do strictly adhere to error budgets.
He also said that it would behoove production teams to think about the 80-20 rule and put a little more effort into the last 20% to make the perfectly fine tech that they do have, really deliver.
“If you’re constantly putting your thumb in the dyke to stop the crisis and moving on to the next fire, you won’t get the chance to make big improvements,” said Hirschfeld.
Perhaps no role is justifiable today unless you can tie your work to the bottom line. For SREs, it’s no different.
According to the Catchpoint SRE report, “the majority of SREs feel their job directly contributes to their organization’s business outcomes.”
The report also stated: “When asked about metrics used to measure success at the individual, team, or organizational level, 30% indicated revenue, which shows alignment with a common business outcome. A few write-in answers also reflect organizational alignment with metrics such as member growth and retention, cost, and user adoption.”
“Part of site reliability is measuring your environment,” said Hirschfeld. “If you’re not measuring it, then you’re not really doing the job that you should be doing."
Hirschfeld explained that the reliability part of site reliability is really about knowing you’ve improved the production cycle and not causing a disruption. Or in other words: “Trending toward invisibility,” said Hirschfeld.
“Without collecting data back to see if the changes you’ve made had an effect, you’re wasting your time; you’re not able to show the value.” If you can’t show that you’re being effective, Hirschfeld said your job could be the thing that disappears.
Don't forget to measure performance data, said Hirschfeld because it’ll help you troubleshoot problems that arise.
Ultimately, measuring your work will also help build your relationship with the development team. Hirschfeld said that the SRE mindset should be to bring information to the development team about what needs to be improved. “It’s much more of a partnership between the teams than traditional thinking,” said Hirschfeld.
If analytics aren’t your thing, Hirschfeld said to work with people who like that part.
“If you help collect the data, you will find people who love correlating and producing beautiful graphs and that type of thing,” he said.Emily Johnson is the digital content editor for InformationWeek. Prior to this role, Emily worked within UBM America's technology group as an associate editor on their content marketing team. Emily started her career at UBM in 2011 and spent four and a half years in content ... View Full Bio