Bringing reliability closer to you with Reliably and DataDog
Sylvain Hellegouarch
Posted on July 23, 2021
As engineers we care about our users, at least we ought to :) They depend on us and our services to run just fine. This is reliability in a nutshell.
Site Reliability Engineering, or SRE if you're casual, has gained momentum to codify this view on reliability. This article is not about detailing SRE but focusing on how we can use one of its tools, Service Level Objectives (or SLO for short), to signal loss of reliability as close to engineers as we can.
Let's say we have a web application like this one below:
from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route
async def homepage(request):
return JSONResponse({'hello': 'world'})
app = Starlette(debug=True, routes=[
Route('/', homepage),
])
Nothing fancy about it, just a Hello World example. When running it as follows:
$ uvicorn --reload server:app
where server
is the name of the Python module containing that code: server.py. The --reload
flag allows us to change the code and let uvicorn restart the server automatically.
We can access this server as follows:
$ curl localhost:8000/
Now; let's run a basic load against this server using hey:
$ hey -c 3 -q 10 -z 20s http://localhost:8000/
Summary:
Total: 20.0125 secs
Slowest: 0.0164 secs
Fastest: 0.0020 secs
Average: 0.0046 secs
Requests/sec: 29.9813
Total data: 10200 bytes
Size/request: 17 bytes
Response time histogram:
0.002 [1] |
0.003 [152] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.005 [184] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.006 [195] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.008 [53] |■■■■■■■■■■■
0.009 [11] |■■
0.011 [1] |
0.012 [1] |
0.014 [0] |
0.015 [0] |
0.016 [2] |
Latency distribution:
10% in 0.0027 secs
25% in 0.0033 secs
50% in 0.0043 secs
75% in 0.0055 secs
90% in 0.0067 secs
95% in 0.0074 secs
99% in 0.0083 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0020 secs, 0.0164 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0005 secs
req write: 0.0000 secs, 0.0000 secs, 0.0002 secs
resp wait: 0.0043 secs, 0.0017 secs, 0.0161 secs
resp read: 0.0001 secs, 0.0001 secs, 0.0007 secs
Status code distribution:
[200] 600 responses
This will gently load our server without going overboard.
We likely want to monitor this server, why not use DataDog to do so, as follows:
from ddtrace import config, patch
import ddtrace.profiling.auto
from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route
async def homepage(request):
return JSONResponse({'hello': 'world'})
patch(starlette=True)
config.starlette['service_name'] = 'my-test-service'
app = Starlette(debug=True, routes=[
Route('/', homepage),
])
What differs is that we are importing DataDog ddtrace to push requests to the local DataDog agent. The agent is started as follows on a different terminal:
$ export DD_API_KEY=...
$ export DD_SITE=datadoghq.eu
$ docker run --rm -it --name dd-agent \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-e DD_API_KEY=${DD_API_KEY} \
-e DD_SITE=${DD_SITE} \
-e DD_APM_ENABLED=true \
-e DD_APM_NON_LOCAL_TRAFFIC=true \
-p 8126:8126/tcp \
gcr.io/datadoghq/agent:latest
After a couple of minutes, you'll be able to search for metrics from this application on DataDog. Look for metrics with starlette
in the name.
Could we now trick the application into raising odd errors to fake a faulty service? Why yes of course! By simply returning a 4xx or 5xx class of errors at random from time to time:
import random
from ddtrace import config, patch
import ddtrace.profiling.auto
from starlette.applications import Starlette
from starlette.requests import Request
from starlette.responses import JSONResponse
from starlette.routing import Route
async def index(request: Request) -> JSONResponse:
if random.random() > 0.91:
return JSONResponse({'error': 'boom'}, status_code=500)
return JSONResponse({'hello': 'world'})
patch(starlette=True)
config.starlette['distributed_tracing'] = True
config.starlette['service_name'] = 'my-frontend-service'
app = Starlette(debug=True, routes=[
Route('/', index),
])
Let's see how this impacts our client now, run again our mild load:
$ hey -c 3 -q 10 -z 20s http://localhost:8000/
Summary:
Total: 20.0120 secs
Slowest: 0.0189 secs
Fastest: 0.0018 secs
Average: 0.0051 secs
Requests/sec: 29.9820
Total data: 10142 bytes
Size/request: 16 bytes
Response time histogram:
0.002 [1] |
0.004 [146] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.005 [193] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.007 [146] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.009 [101] |■■■■■■■■■■■■■■■■■■■■■
0.010 [10] |■■
0.012 [0] |
0.014 [0] |
0.016 [0] |
0.017 [1] |
0.019 [2] |
Latency distribution:
10% in 0.0029 secs
25% in 0.0036 secs
50% in 0.0050 secs
75% in 0.0065 secs
90% in 0.0074 secs
95% in 0.0079 secs
99% in 0.0092 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0018 secs, 0.0189 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0005 secs
req write: 0.0000 secs, 0.0000 secs, 0.0002 secs
resp wait: 0.0048 secs, 0.0017 secs, 0.0187 secs
resp read: 0.0002 secs, 0.0001 secs, 0.0011 secs
Status code distribution:
[200] 542 responses
[500] 58 responses
Now notice how we get a summary that does show us some responses were in errors as per our change above. Yai we broke something!
Can we now ask DataDog about these recorded errors? Yes we can:
# datadog info (change them to fit your owns)
export DD_API_KEY=
export DD_APP_KEY=
export DD_SITE=datadoghq.eu
# your query data
$ export query="(sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) / (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())"
$ export from=$(date "+%s" -d "15 min ago")
$ export to=$(date "+%s")
$ curl -G -s -X GET "https://api.${DD_SITE}/api/v1/query" \
--data-urlencode "from=${from}" \
--data-urlencode "to=${to}" \
--data-urlencode "query=${query}" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" | jq .
The query we are running may look daunting but is rather straightforward. We take the total number of requests and we remove the ones that were on error. We then divide by the total again and this should give us a ratio of good requests as a percentage.
Great, we now have a query we can use to create a service level object (SLO) that will tell us how our service is doing over time. Let's use Reliably for this.
$ reliably slo init
? What is the name of the service you want to declare SLOs for? my-frontend-service
| Paste your 'numerator' (good events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
| Paste your 'denominator' (total events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())
? What is your target for this SLO (in %)? 99
? What is your observation window for this SLO? custom
? Define your custom observation window PT5M
? What is the name of this SLO? 99% of frontend responses over last 5 minutes are 2xx
SLO '99% of frontend responses over last 5 minutes are 2xx' added to Service 'my-frontend-service'
? Do you want to add another SLO? No
Service 'my-frontend-service' added
? Do you want to add another Service? No
✓ Your manifest has been saved to ./reliably.yaml
In a nutshell, we created a file that contains the definition of the SLO:
apiVersion: reliably.com/v1
kind: Objective
metadata:
labels:
name: 99% of requests over last 5 minutes
service: my-test-service
spec:
indicatorSelector:
datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()
datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
objectivePercent: 99
window: 1h0m0s
Now we can make reliably know about it:
$ reliably slo sync
Finally, while the application is still running with some load injected into it, start fetching data from DataDog, using the query we saw earlier and let Reliably consolidate them over the window duration given in the objective:
$ reliably slo agent -i3
Open now a new terminal and run the following:
$ reliably slo report -w
This will show you the SLO report for your service as computed by Reliably.
So what happened exactly? Well, let's zoom in on a section of the SLO:
indicatorSelector:
datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()
datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
The indicatorSelector
property is how the magic happens. These are used for the following purposes:
- giving the
reliably slo agent
command the means to know what provider to use, here DataDog, and therefore how to fetch the required datapoints, here the two queries. These datapoints are stored under the name of indicators on Reliably - declaring how these objective and the indicators are mapped together
That second point is key. Indicators themselves are not declared as entities (or objects) as objectives are. Instead they are merely a stream of values consumed by Reliably when sent by a client (reliably slo agent
or via the API directly). Upon receiving an indicator, Reliably looks at its labels and match this to any indicatorSelector
of any objectives (in the current organization). This tells us that objectives and indicators are loosly coupled. The fact the reliably.yaml
manifest contains the selector doesn't define the indicator, only how to match indicators to objectives.
At this stage, you have a simple declaration of a service level object that relies on DataDog's data to compute it. Since the SLO is a just a file, you can now store it alongside your code base and use it as part of your CI/CD pipeline to automate decision about releasing. We'll see this in a future article using GitHub actions.
The code for this article can be found on GitHub.
Posted on July 23, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.