Circuit Breaker Solution for AWS Lambda Functions
chgerkens
Posted on January 13, 2021
CloudWatch Metrics and Alarms can be used to add circuit breaker functionality to AWS Lambda functions that are triggered by SQS messages in a non-intrusive and cost-effective way.
You can protect overwhelmed downstream services without the need to make code changes, replay messages from dead letter queues or increase operating costs significantly.
Introduction
A Serverless Architecture frees you from the responsibility to ensure your application scales rapidly with increasing demands and is available even when underlying infrastructure components fail. But as soon as your application calls external APIs — either third-party services hosted somewhere or managed (non-serverless) AWS services — the ideal world is crumbling. You are confronted with increasing latency, long running calls and increasing error rate.
A couple of well-known stability patterns exist, like use Timeouts, Bulkheads, Decoupling Middleware and Circuit Breaker (published in Michael T. Nygard’s Book Release It!). In the context of Serverless on AWS, you can configure timeouts for Lambda functions, decouple your application from external APIs by putting a message queue like SQS in front of your single-purpose Lambda functions (Bulkheads). But there is currently no straightforward approach to apply a circuit breaker to AWS Lambda functions. If your message processing lambda functions start to fail recurrently due to an incident in the downstream service, AWS Lambda will retry to send SQS messages to your function (respecting an optional configured dead-letter queue and maximum receives count). Your function might get even more load, since AWS Lambda scales concurrent invocations based on available messages. If you restrict the function concurrency, AWS Lambda might throttle and fail to process messages.
Circuit Breaker Pattern
When a downstream service is in trouble, for instance due to very high load or failing underlying infrastructure components, the idea of the Circuit Breaker Pattern is to stop an upstream system making further calls (open state). The downstream service gets the chance to recover and the upstream system does not waste time nor operating resources to make calls which will probably fail anyway. After some time, the circuit breaker allows a small number of calls to find out whether the downstream is operating normal again (half open state). If a threshold of successful calls is reached, the circuit breaker enables all calls to the downstream service again (closed state).
Three key aspects are important to implement a circuit breaker:
- Detect when a timeout or error threshold is exceeded
- Prevent calls to the downstream service for a certain time
- Allow some calls to pass periodically, to detect if the downstream service has recovered
Existing Approaches for AWS Lambda
A common approach is to implement a circuit breaker inside your function and use DynamoDB to store the circuit breaker state (like Gunnar Grosch’s failure lambda node.js implementation and Jeremy Daly outlines in his AWS Reference Architecture Pattern). The Lambda function will fail before calling the Third-Party API when a failure threshold has been exceeded. This protects the downstream service, but it will not stop AWS Lambda polling the upstream queue and invoking your function. You also have to make changes to lambda function code, specific to the particular Lambda runtime and programming language. The approach introduces a number of DynamoDB requests which could significantly increase costs.
A Solution based on CloudWatch Alarms and Event Source Mapping
This solution relies on CloudWatch metrics and alarms to detect message processing issues caused by the downstream service.
- When the number of timeouts or errors exceed a threshold, a CloudWatch alarm is triggered — based on Lambda function metrics. To reduce false alarms, you should use a combination of ratio and sum metric thresholds. I recommend custom metrics with high-resolution alarms over AWS-provided function metrics to get a prompt response once a failure situation occurs. Log metric filters can detect errors and timeouts based on your function log streams.
- When the CloudWatch alarm is triggered, a Lambda function disables the event source mapping. AWS Lambda will not poll the message queue from now on. The circuit breaker is in state “open”. Once the alarm falls back to OK, an AWS Step Function takes over: It periodically tries to invoke the protected function with a message from the queue. The circuit breaker is in state “half open”.
- If a certain number of trial messages succeed, the step function enables the event source mapping again. AWS Lambda starts polling the queue again. The circuit breaker is back in state “closed”.
This solution supports any Lambda runtime. No changes to your function code are required. Fix costs incur for CloudWatch alarms and metrics per month (AWS free tier can be applied, except for high-resolution alarms). Costs for Step Functions transitions and Lambda Functions invocations incur only in failure state. On the other hand, you save costs for unnecessary queue service requests and lambda invocations.
I designed the solution for SQS as function trigger, but other services like Amazon MQ, that are integrated by AWS Lambda via event source mappings, should work too.
Deploy the solution
You can find an implementation of this Circuit Breaker solution on GitHub.
Related Approaches
Jeremy Daly’s Lambda Orchestrator pattern goes a step further. It does not rely on AWS Lambda event source mapping at all to receive messages and invoke Lambda functions. Instead, a long-running Lambda function polls the queue and invokes processing lambda function, similar to the solution described above in state “half open”. The Lambda Orchestrator pattern enables sophisticated ways to throttle Third-Party API calls, like respecting API quotas.
Posted on January 13, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.