How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

roshan8

Roshan shetty

Posted on April 12, 2021

How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

Today, Most of the organization track their product SLO’s to avoid being liable for breach of SLAs (Service level agreements). In case of any SLO violation, They will be under obligation to pay something in return for breach of contract. Once the SLO for their product has been defined, A corresponding error budget will be calculated based on that number. For example, If 99.99% is the SLO, then the error budget will be 52.56 mins in a year. That’s the amount of downtime that the product may have in a year without breaching the SLO.

Once companies agree on the SLO, they need to pick the most relevant SLI’s(service level indicators). Any violation of these SLI’s will be considered as downtime and the duration of downtime will be deducted from the error budget. For example, a payment gateway product might have the following SLI’s.

  • Latency on p95 for requests
  • ErrorRates
  • Payment failures etc

Additional reading:

https://sre.google/workbook/implementing-slos/
https://sre.google/workbook/error-budget-policy/

Why is it challenging for many companies to track error budgets at the moment?

Usually, Organizations use a mix of tools to monitor/track these SLI’s (For eg: latency-related SLI’s generally tracked in APM’s such as Newrelic while other SLI’s tracked in monitoring tools such as Prometheus/Datadog etc). That makes it hard to keep track of the error budget in one centralized location.

Sometimes companies have a very low retention period(<6 months) for their metrics in Prometheus. Retaining metrics for a longer period may require setting up Thanos/Cortex, federation rules, and performing capacity planning for their metrics storage.

Next comes the problem of false positives - Even if you are tracking something in Prometheus, it’s hard to flag an event as false positive when the incident is not a genuine SLO violation. Building an efficient and battle-tested monitoring platform takes time. Initially, Teams might end up getting a lot of false positives. and you may want to mark some old violations as false positives to get minutes back into your error budget

What does the SLO tracker do?

This error budget tracker seeks to provide a simple and effective way to keep track of the error budget, burn rate without the hassle of configuring and aggregating multiple data sources.

  • Users first have to set up their target SLO and the error budget will be allocated based on that.
  • It currently supports webhook integration with few monitoring tools(Prometheus, Pingdom, and Newrelic) and whenever it receives an incident from these tools, It will reduce some error budget.
  • If a violation is not caught in your monitoring tool or if this tool doesn’t have integration with your monitoring tool then the incident can be reported manually through the user interface.
  • Provides some analytics into SLO violation distribution. (SLI distribution graph)
  • This tool doesn’t require much storage space since this only stores violations but not every metric.

How to set this up?

  • Clone the repo
  • Repo already has a docker-compose, So just run docker-compose up -d, and your setup is done!
  • Default creds are admin:admin. This can be changed in docker-compose.yaml.
  • Now set some SLO target in the UI.
  • To integrate this tool with other monitoring tools, You can use the following webhook url’s.
    • For prometheus: serverip:8080/webhook/prometheus
    • For Newrelic: serverip:8080/webhook/newrelic
    • For Pingdom: serverip:8080/webhook/pingdom
  • Now set up rules to monitor SLI’s in your monitoring tool (Let’s see how this can be done in Prometheus). Alert manager rule to monitor an example SLI ==> Nginx p99 latency
  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Nginx latency high (instance )
      description: Nginx p99 latency is higher than 3 seconds\n  VALUE = \n  LABELS: 
Enter fullscreen mode Exit fullscreen mode

Alert routing based on tags set in checks


          global:
            resolve_timeout: 10m
          route:
            routes:
            - receiver: blackhole
            - receiver: slo-tracker
              group_wait: 10s
              match_re:
                severity: critical
              continue: true
          receivers:
          - name: ‘slo-tracker’
            webhook_config: 
              url: 
'http://ENTERIP:8080/webhook/prometheus'
              send_resolved: true
Enter fullscreen mode Exit fullscreen mode
  • Add different tags if you don’t want to route requests based on the severity tags.

What’s next:

  • Add a few more monitoring tool integration
  • Tracking multiple product SLO’s
  • Add more graphs for analytics
  • Better visualization tools to pinpoint problematic services

This project is open-source. Feel free to open a PR or raise issue :)

If you would like to see the dashboard then please check this out!
(admin:admin is the creds. Also please use laptop to open this webapp. It's not mobile-friendly yet)

💖 💪 🙅 🚩
roshan8
Roshan shetty

Posted on April 12, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related