SRE book notes: Eliminating Toil

bitmaybewise

Hercules Lemke Merscher

Posted on January 17, 2023

SRE book notes: Eliminating Toil

These are the notes from Chapter 5: Eliminating Toil from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:


So what is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Not every task deemed toil has all these attributes, but the more closely work matches one or more of the following descriptions, the more likely it is to be toil

We don’t need to go that far to find toil in our daily work. A very simple example of a task that can be considered toil is a (semi)manual step-by-step deployment process.

Who never had to have installed a set of tools in the laptop to build and deploy a program at some point in time, and perform some actions in a predefined order?!

It’s the kind of thing that becomes more annoying the more the company grows, and nobody is willing to do it.


At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.

We share this 50% goal because toil tends to expand if left unchecked and can quickly fill 100% of everyone’s time.


Toil isn’t always and invariably bad… It’s fine in small doses, and if you’re happy with those small doses, toil is not a problem. Toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly.


Your career progress will slow down or grind to a halt if you spend too little time on projects. Google rewards grungy work when it’s inevitable and has a big positive impact, but you can’t make a career out of grunge.

Be attentive if you’re the kind of professional who starts to happily accept too much toil to help your colleagues. People around you will be super happy that you’re dealing with the boring work for them, but eventually, you could be setting a trap for yourself, as without noticing, they will let you deal with the toil all the time to the point you will not have time to work on projects that will make you progress in your career.


People have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout, boredom, and discontent.


Excessive toil makes a team less productive. A product’s feature velocity will slow if the SRE team is too busy with manual work and firefighting to roll out new features promptly.


Even if you’re not personally unhappy with toil, your current or future teammates might like it much less. If you build too much toil into your team’s procedures, you motivate the team’s best engineers to start looking elsewhere for a more rewarding job.


New hires or transfers who joined SRE with the promise of project work will feel cheated, which is bad for morale.

Most probably, the best engineers will dodge the bullet before even joining the team by asking the right questions during the interview process - considering you’re being frank with them.

Few engineers are motivated to jump into a big pile of toil and accept the challenge of dealing with and reducing it to acceptable levels.


This chapter resonated a lot with some experiences I had in the past. I wish I had read this chapter before. At least now I have a piece of literature to recommend.

So, better keep toil under acceptable thresholds!


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by Tim Gouw on Unsplash

💖 💪 🙅 🚩
bitmaybewise
Hercules Lemke Merscher

Posted on January 17, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

SRE book notes: Eliminating Toil