On-call Manual: Measuring the quality of the on-call

moozzyk

Pawel Kadluczka

Posted on July 4, 2024

On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

  • noisy alerts
  • problematic dependencies
  • features causing customers’ complaints
  • repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

  • outages of the products the team owns
  • external incidents impacting the products the team owns
  • the number of alerts, broken down by urgency
  • the number of alerts alerts acted on and ignored
  • the number of alerts outside the working hours
  • time to acknowledge alerts
  • the number of tickets opened by customers
  • the number of internal tasks
  • build breaks
  • test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

  • the number of alerts
  • the number of tasks
  • the number of alerts outside the working hours
  • the noisiest alerts, tracked by alert ID

Beeping, Beeping Everywhere

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

  • Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
  • Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
  • How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
  • How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
  • Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.


💙 If you liked this article...

I publish a weekly newsletter for software engineers who want to grow their careers. I share mistakes I’ve made and lessons I’ve learned over the past 20 years as a software engineer.

Sign up here to get articles like this delivered to your inbox:
https://www.growingdev.net/

💖 💪 🙅 🚩
moozzyk
Pawel Kadluczka

Posted on July 4, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related