Companies like Amazon, Google or Microsoft lose zillions of dollars every minute their systems are down, so they had to find a way to ensure redundancy, fault tolerance, and uninterrupted customer experience. The answer to this challenge is Site Reliability Engineering (SRE). What is SRE, you ask? Read on to get the gist of what SRE is about, how to implement it in your organization, and how to build a reliable product using SRE.
Table of Contents
The idea of SRE was introduced by Ben Traynor, VP of Engineering at Google, in his book “SRE at Google.” You’re welcome to read it (several times), but who has the time? So, we decided to provide you with the gist of the SRE methodology in this article and describe its basic principles and applications.
We provide companies with senior tech talent and product development expertise to build world-class software. Let's talk about how we can help you.
Contact usAccording to Mr. Traynor, “SRE is what happens when developers have to design and operate an engineering function.” Here’s what he meant: SRE is a paradigm of building systems in a way that maximizes reliability, tracking the results, and constantly adjusting and improving the workflow. It is vital for decreasing the number of incidents in production and implementing prescriptions and procedures that result in measurable performance improvements.
By putting engineers in direct contact with their software in production, along with customer and peer-to-peer feedback from users and colleagues, SRE provides invaluable input and incentive for continuously improving the quality of software and the infrastructure that runs it.
If you still can’t wrap your head around what SRE is, let’s see how SRE is implemented at Google.
At Google, SRE is a set of practices, workflows, and policies aimed at setting service reliability goals, assessing efficiency, and improving services as needed. Now that sounds like a plan you can attach KPIs to, doesn’t it?
The reason for the need of SRE arose from the fact that Google had to constantly update its vast array of products and services while ensuring their uninterrupted availability. Developers wanted to push updates to production, while Ops engineers wanted to have as little issues as possible. This created an apparent conflict that resulted in everlasting debates and attempts to sneak around the processes.
This was when Mr. Traynor suggested a series of steps that later formed the basis of the SRE methodology:
These basics created a win-win incentive for both teams to write better code and deliver better services since they were the ones responsible for running them.
At this point, you might think SRE is like DevOps. Nope.
DevOps is the culture of automating repetitive software delivery processes to minimize the risk of human error and the effort needed to consistently provide products and services. It is also a mindset of collaboration between developers and the Ops team to make Ops engineers the final judges on all decisions (because they have to deal with the production environment most of the time). Implementing DevOps means decreased development time, reduced numbers of bugs, automation of updates and rollbacks, and more.
Thus, while DevOps is centered around automating repetitive operations to minimize the routine and maximize the performance, SRE is centered around ironing out inconsistencies in the infrastructure and workflows to ensure the reliability of services.
See also our detailed guide on hiring a dedicated DevOps engineer.
SRE implementation provides significant benefits:
Now that you know the history of SRE and why it’s not the same as DevOps, let’s take a look at the basic SRE principles you need to instate at the beginning of your SRE journey:
The list is not exhaustive, so feel free to add the things that work for you.
Seems too good to be true? Well, it worked for Google and many industry-leading companies across the globe — but you have to understand what SRE team composition you need based on your market niche and stage of operational maturity.
The typical roles in an SRE team are:
These roles are flexible and can grow into or replace one another based on your scope of tasks. The responsibilities SRE teams perform mostly depend on the stage of your SRE journey.
Here are the basic prerequisites for a productive SRE team:
While the beginner level SRE scope of tasks might seem a bit overwhelming, all it actually takes to implement is dedicating some time and effort to establish procedures.
Intermediate SRE teams are more mature and begin to take a proactive approach, trying to solve issues before they arise. Here are their prerequisites:
At this stage, SRE project results become feasible, and minute monitoring is needed to lay the ground for a lasting company-wide success.
These results can be observed in mature SRE teams that are done with infrastructure redesign and concentrate now on maximizing positive outcomes of all IT-related business processes. Here’s what makes an advanced SRE team:
There are three primary ways of SRE implementation, as you can see in the image below. The SRE team can work in tandem with product teams, be spread among them, or be a separate centralized unit.
Before you begin implementing SRE, allocate some time from all the parties involved. Gather and discuss the best approach to implementing SRE based on your business specifics. At least one SRE advocate has to be present and able to answer questions, or the project might not even take off.
This is an easy way to start your SRE journey without much investment and organizational change. Plus, it helps test out different SRE models and choose the best fit. On the other hand, it will take you some effort to free up time for many professionals to gather.
Relevant is an 8-years old software development company. We can provide you with a dedicated SRE engineer or SRE consultation to improve your product reliability. Contact us now.
Contact usSpeaking of SRE implementation models, here are six of them.
In this model, a single SRE team must cover all processes in the organization. It is the most widely used approach, and it allows the team to grow organically along with the business.
Pros:
Cons:
Best used: In smaller companies with a single or a couple of products and one or two customer journeys. In this case, the SRE needs are present, but the scope is not enough to justify more than a single dedicated SRE team. This is the approach taken by most technology providers like Relevant Software. It covers all customer needs and provides end-to-end SRE services.
Such SRE teams dedicate their effort to improving the reliability of a single mission-critical product or application at a time.
Pros:
Cons:
Best used: By large companies that cannot cover the needs of all their products/services with a single SRE team.
Just like the DevOps teams, the infrastructure SRE teams are centered around improving the job quality and performance of the rest of your business. Through automating repetitive actions and removing structural and procedural bottlenecks, such teams speed up software delivery.
Pros:
Cons:
Best used: In larger companies with several separate development teams as they will need to issue common standards to uniform the processes across the board. The DevOps team will handle CI/CD, testing automation, and product releases, while the SRE team should ensure reliability.
Such SRE teams mostly concentrate on creating tools and features that help their fellow developers be most productive. However, tool-centered SRE teams lack direct contact with customer-facing reliability issues and might begin solving irrelevant problems. So, they have to put a lot of effort to stay in the loop.
This approach is very similar to the Infrastructure one, so the same pros and cons apply. However, this one does have a couple more drawbacks:
Best used: By any company in need of software tools not readily available through DevOps or SaaS platforms.
When SRE specialists are embedded within development teams, they usually perform hands-on work like changing environment configurations to ensure maximum performance at every step of the SDLC journey.
Pros:
Cons:
Best used: When starting an SRE journey to empower adoption and speed up transformation. However, this is a limited time approach that must be later replaced with other models.
While being quite similar to the Embedded model, the Consulting SRE approach tends to avoid actively changing the existing code and infrastructure configuration. Instead, such specialists build tools that complement the existing processes.
Pros:
Cons:
Best used: Before beginning your SRE implementation to get a grip of SRE best practices. Alternatively, when your company is too large to cater to all its operational needs using only the in-house SRE potential.
There are two standard practices that have a significant impact, regardless of the SRE model chosen:
This way, issues can escalate to appropriate levels of SRE seniority and be solved quickly and efficiently.
We hope this article gave you an understanding of what SRE is, what SRE’s basic principles are, how you can implement SRE in your organization, and how to use it to build a reliable product or service. While initially developed by Google, these recommendations can be applied to any infrastructure and workflow.
Once you decide on the SRE model that fits your business needs best, you’re free to build an internal SRE team or outsource this task to a trustworthy technology partner like Relevant Software. We have ample expertise in supporting large-scale systems using DevOps and SRE best practices and would love to help you out. Contact us today!
If you’ve been building up a stack of AI solutions that don’t quite play nicely…
Businesses integrating AI into their workflows could unlock a transformative 40% boost in workforce productivity…
No one dreams of studying regulatory documents all day. Yet, for financial institutions, that’s exactly…