Bringing reliability closer to you with Reliably and DataDog

lawouach

Sylvain Hellegouarch

Posted on July 23, 2021

Bringing reliability closer to you with Reliably and DataDog

As engineers we care about our users, at least we ought to :) They depend on us and our services to run just fine. This is reliability in a nutshell.

Site Reliability Engineering, or SRE if you're casual, has gained momentum to codify this view on reliability. This article is not about detailing SRE but focusing on how we can use one of its tools, Service Level Objectives (or SLO for short), to signal loss of reliability as close to engineers as we can.

Let's say we have a web application like this one below:

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route

async def homepage(request):
    return JSONResponse({'hello': 'world'})

app = Starlette(debug=True, routes=[
    Route('/', homepage),
])
Enter fullscreen mode Exit fullscreen mode

Nothing fancy about it, just a Hello World example. When running it as follows:

$ uvicorn --reload server:app
Enter fullscreen mode Exit fullscreen mode

where server is the name of the Python module containing that code: server.py. The --reload flag allows us to change the code and let uvicorn restart the server automatically.

We can access this server as follows:

$ curl localhost:8000/
Enter fullscreen mode Exit fullscreen mode

Now; let's run a basic load against this server using hey:

$ hey -c 3 -q 10 -z 20s http://localhost:8000/

Summary:
  Total:    20.0125 secs
  Slowest:  0.0164 secs
  Fastest:  0.0020 secs
  Average:  0.0046 secs
  Requests/sec: 29.9813

  Total data:   10200 bytes
  Size/request: 17 bytes

Response time histogram:
  0.002 [1] |
  0.003 [152]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [184]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [195]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.008 [53]    |■■■■■■■■■■■
  0.009 [11]    |■■
  0.011 [1] |
  0.012 [1] |
  0.014 [0] |
  0.015 [0] |
  0.016 [2] |


Latency distribution:
  10% in 0.0027 secs
  25% in 0.0033 secs
  50% in 0.0043 secs
  75% in 0.0055 secs
  90% in 0.0067 secs
  95% in 0.0074 secs
  99% in 0.0083 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0020 secs, 0.0164 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0005 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:    0.0043 secs, 0.0017 secs, 0.0161 secs
  resp read:    0.0001 secs, 0.0001 secs, 0.0007 secs

Status code distribution:
  [200] 600 responses
Enter fullscreen mode Exit fullscreen mode

This will gently load our server without going overboard.

We likely want to monitor this server, why not use DataDog to do so, as follows:

from ddtrace import config, patch
import ddtrace.profiling.auto
from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route


async def homepage(request):
    return JSONResponse({'hello': 'world'})


patch(starlette=True)
config.starlette['service_name'] = 'my-test-service'

app = Starlette(debug=True, routes=[
    Route('/', homepage),
])
Enter fullscreen mode Exit fullscreen mode

What differs is that we are importing DataDog ddtrace to push requests to the local DataDog agent. The agent is started as follows on a different terminal:

$ export DD_API_KEY=...
$ export DD_SITE=datadoghq.eu

$ docker run --rm -it --name dd-agent \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-e DD_API_KEY=${DD_API_KEY} \
-e DD_SITE=${DD_SITE} \
-e DD_APM_ENABLED=true \
-e DD_APM_NON_LOCAL_TRAFFIC=true \
-p 8126:8126/tcp \
gcr.io/datadoghq/agent:latest
Enter fullscreen mode Exit fullscreen mode

After a couple of minutes, you'll be able to search for metrics from this application on DataDog. Look for metrics with starlette in the name.

Could we now trick the application into raising odd errors to fake a faulty service? Why yes of course! By simply returning a 4xx or 5xx class of errors at random from time to time:

import random


from ddtrace import config, patch
import ddtrace.profiling.auto
from starlette.applications import Starlette
from starlette.requests import Request
from starlette.responses import JSONResponse
from starlette.routing import Route


async def index(request: Request) -> JSONResponse:
    if random.random() > 0.91:
        return JSONResponse({'error': 'boom'}, status_code=500)
    return JSONResponse({'hello': 'world'})


patch(starlette=True)
config.starlette['distributed_tracing'] = True
config.starlette['service_name'] = 'my-frontend-service'

app = Starlette(debug=True, routes=[
    Route('/', index),
])
Enter fullscreen mode Exit fullscreen mode

Let's see how this impacts our client now, run again our mild load:

$ hey -c 3 -q 10 -z 20s http://localhost:8000/

Summary:
  Total:    20.0120 secs
  Slowest:  0.0189 secs
  Fastest:  0.0018 secs
  Average:  0.0051 secs
  Requests/sec: 29.9820

  Total data:   10142 bytes
  Size/request: 16 bytes

Response time histogram:
  0.002 [1] |
  0.004 [146]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [193]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.007 [146]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.009 [101]   |■■■■■■■■■■■■■■■■■■■■■
  0.010 [10]    |■■
  0.012 [0] |
  0.014 [0] |
  0.016 [0] |
  0.017 [1] |
  0.019 [2] |


Latency distribution:
  10% in 0.0029 secs
  25% in 0.0036 secs
  50% in 0.0050 secs
  75% in 0.0065 secs
  90% in 0.0074 secs
  95% in 0.0079 secs
  99% in 0.0092 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0018 secs, 0.0189 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0005 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:    0.0048 secs, 0.0017 secs, 0.0187 secs
  resp read:    0.0002 secs, 0.0001 secs, 0.0011 secs

Status code distribution:
  [200] 542 responses
  [500] 58 responses
Enter fullscreen mode Exit fullscreen mode

Now notice how we get a summary that does show us some responses were in errors as per our change above. Yai we broke something!

Can we now ask DataDog about these recorded errors? Yes we can:

# datadog info (change them to fit your owns)
export DD_API_KEY=
export DD_APP_KEY=
export DD_SITE=datadoghq.eu

# your query data
$ export query="(sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) / (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())"
$ export from=$(date "+%s" -d "15 min ago")
$ export to=$(date "+%s")

$ curl -G -s -X GET "https://api.${DD_SITE}/api/v1/query" \
--data-urlencode "from=${from}" \
--data-urlencode "to=${to}" \
--data-urlencode "query=${query}" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" | jq .
Enter fullscreen mode Exit fullscreen mode

The query we are running may look daunting but is rather straightforward. We take the total number of requests and we remove the ones that were on error. We then divide by the total again and this should give us a ratio of good requests as a percentage.

Great, we now have a query we can use to create a service level object (SLO) that will tell us how our service is doing over time. Let's use Reliably for this.

$ reliably slo init
? What is the name of the service you want to declare SLOs for? my-frontend-service
| Paste your 'numerator' (good events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
| Paste your 'denominator' (total events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())
? What is your target for this SLO (in %)? 99
? What is your observation window for this SLO? custom
? Define your custom observation window PT5M

? What is the name of this SLO? 99% of frontend responses over last 5 minutes are 2xx
SLO '99% of frontend responses over last 5 minutes are 2xx' added to Service 'my-frontend-service'

? Do you want to add another SLO? No
Service 'my-frontend-service' added

? Do you want to add another Service? No

✓ Your manifest has been saved to ./reliably.yaml
Enter fullscreen mode Exit fullscreen mode

In a nutshell, we created a file that contains the definition of the SLO:

apiVersion: reliably.com/v1
kind: Objective
metadata:
  labels:
    name: 99% of requests  over last 5 minutes
    service: my-test-service
spec:
  indicatorSelector:
    datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()
    datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
  objectivePercent: 99
  window: 1h0m0s
Enter fullscreen mode Exit fullscreen mode

Now we can make reliably know about it:

$ reliably slo sync
Enter fullscreen mode Exit fullscreen mode

Finally, while the application is still running with some load injected into it, start fetching data from DataDog, using the query we saw earlier and let Reliably consolidate them over the window duration given in the objective:

$ reliably slo agent -i3
Enter fullscreen mode Exit fullscreen mode

Open now a new terminal and run the following:

$ reliably slo report -w
Enter fullscreen mode Exit fullscreen mode

This will show you the SLO report for your service as computed by Reliably.

So what happened exactly? Well, let's zoom in on a section of the SLO:

  indicatorSelector:
    datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()
    datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
Enter fullscreen mode Exit fullscreen mode

The indicatorSelector property is how the magic happens. These are used for the following purposes:

  • giving the reliably slo agent command the means to know what provider to use, here DataDog, and therefore how to fetch the required datapoints, here the two queries. These datapoints are stored under the name of indicators on Reliably
  • declaring how these objective and the indicators are mapped together

That second point is key. Indicators themselves are not declared as entities (or objects) as objectives are. Instead they are merely a stream of values consumed by Reliably when sent by a client (reliably slo agent or via the API directly). Upon receiving an indicator, Reliably looks at its labels and match this to any indicatorSelector of any objectives (in the current organization). This tells us that objectives and indicators are loosly coupled. The fact the reliably.yaml manifest contains the selector doesn't define the indicator, only how to match indicators to objectives.

At this stage, you have a simple declaration of a service level object that relies on DataDog's data to compute it. Since the SLO is a just a file, you can now store it alongside your code base and use it as part of your CI/CD pipeline to automate decision about releasing. We'll see this in a future article using GitHub actions.

The code for this article can be found on GitHub.

💖 💪 🙅 🚩
lawouach
Sylvain Hellegouarch

Posted on July 23, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related