Ihor
Feoktistov

Why and How to Hire a Site Reliability Engineer (SRE) [Salary and Job Requirements Included]

#Dedicated teams

Is your company running an ever-growing infrastructure that has to support services and products while enabling seamless update processes and new feature releases? Or are you tasked with ensuring uninterrupted operations of a disparate and inhomogeneous infrastructure supporting mission-critical systems? 

In any case, Site Reliability Engineering, or SRE, is precisely what you need to gradually improve your infrastructure stability and performance. However, SRE expertise is quite hard to come by, and this is why we’re here to explain in detail how to hire a Site Reliability Engineer. Recommendations on the salary and job requirements are included!

What is a Site Reliability Engineer?

What is a Site Reliability Engineer?

SRE specialists should not be confused with DevOps engineers, although many sources use these two terms interchangeably. 

DevOps is a process of automating all the repetitive IT operations to minimize the human effort (and the risk of human error) while running your infrastructure. DevOps engineers focus on software development, deployment, and operating production environments.

SRE, on the other hand, is a paradigm of continuous analysis of the existing infrastructure from the reliability perspective, centered around removing performance bottlenecks, optimizing the infrastructure, the toolkit, and the workflows involved in running it. Born at Google, SRE is now the leading approach to ensuring long-term sustainability and operational resilience of digital assets. 

We’ve already covered various aspects of what Site Reliability Engineering is, so feel free to dive deeper into this topic by reading our article.

SRE job responsibilities

While Ops engineers have to run an infrastructure they’re given and put out fires all over it, and DevOps can automate various aspects of IT operations to reduce the number of incidents, SREs have to plan and design resilient infrastructure and workflows (and update them as needed).

The main job responsibilities of an SRE are:

  • Gathering project requirements from stakeholders along with BAs and PMs
  • Designing high-level schematics of the infrastructure, tools, and processes needed 
  • Performing an in-depth analysis of the possible risks and countermeasures for them
  • Calculating the potential cost of outages and planning for contingency
  • Monitoring the systems in production, analyzing their performance
  • Preparing input for infrastructure/tooling/workflow updates across the organization.
  • Teaching the Dev and Ops (or DevOps) teams to follow the guidelines and procedures to minimize the number of errors and incidents.

This list is by far not exhaustive and depends significantly on the specifics of your organization. Naturally, the SRE tasks and approaches of a global organization running legacy mainframe systems will differ from the SRE tasks of an actively growing cloud-based app

How SRE specialists work

SRE tasks can be grouped according to three major phases: design, implementation, and maintenance.

Why these 200 tech companies & startups outsource to Ukraine
Download the whitepaper

An SRE expert should be involved in all stages of any IT-related project of your organization. This includes discussing the concept of the next project, designing the infrastructure, toolset, and processes needed to deliver it, overseeing their implementation, monitoring the performance of a working system, and adjusting it if necessary. It also involves training your staff to follow the guidelines and procedures that minimize the daily toil for your IT department.

An SRE’s job never really ends; it’s a permanent effort aimed at improving your IT operations and educating your developers and Ops engineers on SRE best practices. This complex and multi-faceted approach requires having a set of important skills.

Key Site Reliability Engineering Skills

The core part of SRE responsibilities revolves around monitoring and analyzing the performance of your systems in production. Obviously, the particular set of tools SRE specialists have to use will differ based on the type of product or service your organization provides and the way it is developed, released, and run. 

However, there are important non-technical and essential technical skills every Site Reliability Engineer should have.

Non-technical SRE skills:

  • Business analysis
  • Teamwork
  • Problem-solving
  • Good performance under pressure
  • Great communication skills, both written and verbal
  • Fluency in the technical language since SRE experts should be able to pitch their ideas to project stakeholders to secure their buy-in and funding for the project or update.

Fundamental technical SRE skills:

  • In-depth knowledge of version control
  • Expert knowledge of Linux OS capabilities
  • Good understanding of DevOps concepts and best practices
  • CI/CD implementation expertise
  • Issue troubleshooting experience

Now, this person might sound like an IT rockstar and cost like one, too (we’ll discuss the SRE salary further), but this leads to a question of the actual necessity of SRE specialists in your organization. 

Do I really need to hire an SRE?

Just like with the cybersecurity of your web or mobile apps, the importance of an SRE might not seem obvious when everything works just fine. But it quickly becomes topical when things start to go awry (and as we all know, when it rains — it pours).

Here are the four most pressing reasons for hiring an SRE:

  • To minimize or prevent downtimes of your products and services. It’s 2020; customers are used to their apps being online 100% of the time. Prolonged downtime of your offerings will mean huge financial and reputational losses.
  • To assess risks and mitigate them. A DDoS attack or a cybersecurity breach can be devastating for any business, so you should plan for contingency and prepare countermeasures in advance. The same goes for all other aspects of your IT operations.
  • To shorten development cycles. By automating software delivery and establishing CI/CD best practices, Site Reliability Engineers can reduce the development overhead and help deliver your products in a faster and more predictable way.
  • To stimulate monetary gains. No wasted and idling resources while meeting all your customer demands during peak time — a dollar saved is a dollar earned, you know. Add a significantly reduced risk of downtime, and you’ll see why SRE is one of the biggest revenue drivers for your business.

While you might not be Amazon, who lost $100 during an hour of an outage, every organization with customer-facing online systems can calculate the costs of their outages — and these would be enormous. Thus, hiring an SRE is a vital step for future-proofing your business and ensuring its long-term survival. But this might not be as easy as hiring another Python engineer.

Is it hard to find an SRE?

Naturally, this is a very competitive market, as global corporations are ready to pay six-digit salaries to avoid multi-million losses. Additionally, while every DevOps engineer can evolve into an SRE specialist with enough time and experience, the really talented SRE experts are in short supply — and most are employed either by industry leaders or Managed Services Providers (MSPs).

Get world-class developers in 1 week

We provide companies with senior tech talent and product development expertise to build world-class software.

Contact us

Why so? Because boredom is a gruesome enemy. When an SRE expert has to cover all the needs of an organization (or establish an in-house SRE team and train it), the scope of the challenge is big and keeps them motivated. However, once the main pain points are dealt with, ongoing monitoring (while definitely needed) requires much less time and effort, and the level of SRE engagement inevitably decreases.

The solution is either working for huge companies, where SRE is an endless journey of transforming extensive infrastructure and workflow or working for a Managed Services Provider that has multiple clients on various stages of SRE implementation. This way, the SRE talent faces a constant influx of challenges and remains motivated to overcome them, gain experience, and grow as a professional. Besides, MSP customers pay for SRE expertise only while they need it and gain an immediate return on their investments by optimizing their IT operations and workflows.

Thus said, partnering with MSPs like Relevant Software is a win-win decision for all, allowing SRE talents to have a variety of projects and startups and SMEs to gain access to SRE expertise they wouldn’t be able to hire otherwise.

Site Reliability Engineer salary expectations

Based on Glassdoor, Statista, and other credible open data sources, here are the salary ranges for a Software Reliability Engineer.

You might be thinking, “Why is the cost of hiring an SRE in Eastern Europe two or three times lower compared to hiring them in the US?” The answer lies in two considerable factors: significantly lower cost of living and simplified taxation scheme for the IT industry, greatly decreasing the cost of software engineering. You can still try to hire an SRE in your local area, of course.

Site Reliability Engineer job description example

Here\s what companies that actively try to hire SRE talent expect them to do:

Site Reliability Engineer job description example

Below are the job requirements. While these differ a bit from vacancy to vacancy, the general scope of tasks remains mostly the same: monitoring the infrastructure, designing improvements, communicating with peers and managers, etc.

Site Reliability Engineer requirements

As you can see, actual SRE job requirements and responsibilities largely fall in the frame we displayed above: planning, implementing, and monitoring solutions that improve the resilience and performance of infrastructure based on the logs and metrics gathered in production. But how to define if a candidate you’re interviewing meets these requirements?

Questions to ask an SRE in a job interview

The questions you can ask an SRE in the interview can be split into five categories:

  • Linux expertise. An SRE expert must be able to design an infrastructure from scratch, and an in-depth understanding of Linux OS capabilities is a must.
  • Cloud expertise. Whether your infrastructure is cloud-native or you only plan to move there, a thorough understanding of what the cloud is and how it works is vital for ensuring the scalability and resilience of your digital assets.
  • DevOps and CI/CD best practices. SRE is an advanced form of infrastructure management, so using DevOps best practices is the most reliable way to build resource-efficient workflows and processes.
  • Interpersonal communication skills. SRE specialists don’t work alone’ they have to orchestrate the efforts of multiple team members and secure executive buy-in for their initiatives. This means the person has to be able to express their vision and prove its correctness.
  • Business analysis. All SRE initiatives should be focused on aligning IT efforts with business goals, so your SRE engineer must demonstrate an understanding of how the implementation of a particular solution can help reach a certain business objective.

Feel free to take a look at the detailed list of SRE interview questions and answers to them. Naturally, the questions depend on the type of your organizational structure, the products and services you provide, and the approach to management, so adjust this list based on your needs.

Summary

Site Reliability Engineering is an essential aspect of successful business growth. Employing an SRE expert is vital if you wish to mitigate risks and ensure stable operations. However, hiring an in-house SRE talent can pose a challenge, as the market demand for such specialists is quite high.

Thus said, outsourcing SRE tasks to a reliable MSP can be the best solution for companies and organizations that need results, not names in their employee roster. Finding a reliable MSP can be a challenge, yes, but Relevant Software can prove our professionalism and ensure the successful completion of your projects. Should you need trustworthy and effective DevOps and SRE services, contact us anytime. We’re always ready to help!

How to Choose a Software Development Company?
Download the ebook