Make your lambda fail! Use Chaos Engineering to improve the resiliency of your serverless application.
Davide de Paolis
Posted on November 12, 2020
Resilience is the ability to recover from or adjust easily to adversity or change.
Generally speaking in software engineering, resilience means that your system/application is able to withstand a failure in one of its components, work nevertheless, and eventually recover in a reasonable manner and acceptable duration.
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. wikipedia
Ok.., that does not add much to our understanding:
What does experimenting on a system mean? How do we experiment?
Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes.
Principle of chaos
That´s mostly the key aspect. Each component might work very well on its own. We have unit tests, integration tests, E2E tests, but the truth is, especially with a distributed system, it can be very hard to predict the outcome of some failing interaction.
Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.
Principle of chaos
Ah! got it!
- It's very hard to anticipate disruptive events,
- distributed systems are complicated because of all those moving parts,
- it's inevitable that code gets messy over time
That´s why it´s Chaos Engineering!
Well... Not really!
what is really Chaos Engineering about?
Chaos engineering is all about asking: “What if?” (Emrah Samdan)
- What if the 3rd party API my Lambda is relying on is very slow or completely unreachable?
- What if our DynamoDB Table throughput is exceeded?
- What if one of our Lambdas gets throttled?
- What if some uncaught exception occurs?
If you have long enough experience in software engineering, you will know that ANYTHING CAN HAPPEN.
Therefore, we need to identify weaknesses before they manifest in system-wide, aberrant behaviors. We need to be prepared, we need to build confidence about what will happen and how our system will react.
Some may say that Chaos Engineering is about breaking things on purpose. And to some extent it is true. Because if we can add some chaos to the system and manage to break it, we can then find ways to fix it, and/or at least try to minimize the blast radius of any disruptive event.
What does blast radius even mean? It means: if something stops working - and we can't do much about it - at least handle the error gracefully, have a fallback ready, contain that event to the smallest amount of components possible. Don't let that event crash your entire application!
5 little steps to bring order to chaos
In order to do that we need to:
- define our application’s steady-state. (how is it supposed to behave when everything works fine)
- make hypothesis about the steady-state in both the control and experimental groups
- inject realistic failures (throw in some chaos)
- observe the results (what happened, compared to what we thought was going to happen)
- adjust your code-base/infrastructure as necessary (use the results to act on your system so that it becomes resilient).
There are a lot of tools out there to run Chaos Engineering experiments and monitoring, one of which is Gremlin but in some cases, especially with serverless and lambda, it is not so easy to orchestrate such tests.
What we started using in our applications is Failure-Lambda, an npm module that lets you easily inject some chaos into your lambda and allows you to test how your serverless system reacts to it.
We found it very easy to prove how our lambda behaves with failures that do not really depend on our code.
Because, sure, you might have handled all edge cases and exceptions in your code, null pointers, dynamodb writeconditions rejections, and any other business logic errors. But there is still a lot that is not much under your control: 3rd party APIs, Latency, Disk space, and timeouts.
These issues might be very rare, but they will inevitably happen. So we must be prepared.
Do we have a proper retry mechanism, do we have Dead Letter Queues? Is our Lambda idempotent?
With FailureLambda you can test easily how a failure in your lambda could affect your system, and Gunnar has recently also published a repo that clearly shows how to use his module and what are its effects on a sample serverless application.
Usage is very simple. Just wrap your handler with it and configure the type and frequency of chaos you want via a SSM parameter.
// in failure-config.json
{
"isEnabled": true,
"failureMode": "latency",
"rate": 1,
"minLatency": 3000,
"maxLatency": 10000,
"exceptionMsg": "Everything is broken",
"statusCode": 404,
"diskSpace": 100,
"denylist": ["s3.*.amazonaws.com", "dynamodb.*.amazonaws.com"]
}
// from CLI
aws ssm put-parameter --name your-failure-injection-config
--value "$(cat failure-conf.json)" --type String --overwrite
Do you want to block a specific 3rd party API? Add it to the denylist and set failureMode as denylist
Do you want to throw in some random latency issues? Set failureMode as latency (and specify how long)
Do you want to make everything break with a custom Error? Set error and specify your exceptions in exceptionMsg
Then see hell unleash in your application. ( hopefully not )
Some further ideas and improvements
This module is awesome and very useful but we found a couple of limitations ( for which, if I have time I might open a PR):
- configuring the LambdaFailureInjection for local use. I wanted to write some integration tests dealing with some failure issues, but I needed to rely on the configuration of SSM, and that was not very handy. Having the possibility of using a .env file instead of SSM would mean I can write some sort of integration tests that can be run on CI or locally as a git hook when pushing.
- every single lambda invocation, requests the Failure Injection configuration from SSM, no caching is defined, (which is somehow good because we don't want our lambda to cache the activated failure for 15 minutes if we want to shut down the experiments! ). But if you really deploy the lambda with the failure wrapper you end up paying for all the SSM requests, or you have to redeploy the lambda entirely without the wrapper. To avoid changing the code and redeploy every time, we resorted to wrapping the wrapper, and add it only if a Lambda variable is configured to run the failure injection.
const failureLambda = require('failure-lambda')
const failureWrapper = fn => {
if (process.env.CHAOS_MODULE_AVAILABLE === 'true') {
console.log('FailureInjectionModule is available to inject some chaos - configuration comes from SSM MyFailureLambda Parameter'
)
return failureLambda(fn)
}
return fn
}
module.exports.handler = failureWrapper(handlerLogic)
We still need to edit the EnvVariable but I find that quicker and safer than editing the main handler code and redeploy.
To edit the Lambda Param and activate the failure module ( so that it reads from SSM ) you can use the CLI.
aws lambda update-function-configuration --function-name my-lambda-under-test --environment <env-as-json>
but be careful because unfortunately, this command replaces all the environment variables, therefore you must always pass every single variable you might have, not just the one you want to change.
Either you use CLI to get the current configuration, edit it and update the lambda or write a better node script that combines all those steps (or just use this module)
For now, let's just edit the value of the variable in the console. Changing our flag CHAOS_MODULE_AVAILABLE from false to true will let us run our experiments - provided we configured our SSM parameter.
Also, one small thing to note about Rate configuration, especially in regards to denyList, is that this rate is applied to the lambda invocations not to the invocations of the denied URL. ( i was running some tests on a bunch of Records from SQS, and expected that only 50% of the requests to an API for each record was being blocked, but they always all worked, or all failed.. :facepalm)
Also worth mentioning in order to run some experiments and write tests to mock failing behaviors is
[MITM[(https://www.npmjs.com/package/mitm) - which is used internally by FailureLambd for the denylist feature).
MITM stays for Man in the middle and is a library that allows you to intercept and mock outgoing network TCP and HTTP connections gives extra potential to your testing capabilities.
Hope it helps
Photo by Moritz Mentges on Unsplash
Posted on November 12, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 12, 2020