A Case for SRE
Michael Doryumu
Posted on August 18, 2022
For the past eight weeks, I started on a journey to acquire knowledge and skills in Cloud Platform for my career in DevOps. I started with Google Cloud Platform (GCP) when I got accepted into The Google Africa Developer Scholarship powered by Andela and Pluralsight.
I have been privileged with courses like Foundations in GCP Infrastructure, Core Services, GCP architecting principles in Design and Process and Scaling and Automation, developing Google’s SRE culture, and not forgetting Security.
In my experience as an IT Systems Operations Engineer, I have worked on numerous projects ranging from infrastructure, network, software, and application systems. Furthermore, I have worked with different functional units such as other operations teams, marketing teams, development teams, infrastructure, and solution architects.
One thing I have observed is the conflicting priorities development and operations teams have in many organizations. Mostly stemming from no agreed standard of communication between these two teams. What's more, the organization’s culture and structure play a big role.
Development teams aim to work faster; agile, innovating, succeeding, and failing quickly while most operation teams work slower; keeping the system stable while focusing on reliability and consistency.
This difference in mindsets tends to put a lot of stress on the Ops teams because of increasing workloads and the effort to resolve service outages caused by some of these changes.
And that is where Site Reliability Engineering (SRE) comes in.
Site Reliability Engineering (SRE) is the practice of balancing the velocity of development features with the risk to reliability. Google’s Mission of SRE is “to protect, provide for, and progress software and systems with a consistent focus on availability, latency, performance, and capacity.”
SRE in DevOps promotes shared ownerships between Devs and Ops Teams. Effective communication is highly encouraged and maximized between both teams. Changes are rolled out effectively and productively ensuring high system reliability and assuring Service Level Objectives (SLOs) are achieved.
Developing a Google SRE culture incorporates and promotes the below technical and cultural practices within and between teams.
- Creating shared responsibility ownership using SLOs and error budgets.
- Failures are documented in blameless postmortems, which focus on systems and processes instead of people.
- Fostering psychologically safe environments is necessary for learning and innovation in organizations.
- Promoting automation of repetitive work (toil) as excessive toil is toxic to the SRE role.
Understanding SRE practices and norms will help you build a common language to use when speaking with your IT teams and support your organization’s adoption of SRE both in the short and long term.
For a comprehensive overview of the key differences between DevOps and SREs, this Spacelift article [https://spacelift.io/blog/sre-vs-devops] also explains what problems DevOps and SRE teams solve, ending with a quick list of tools used by both groups.
Posted on August 18, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.