Suspend any Lambda functions using a single CloudWatch Alarm
shimo
Posted on December 19, 2022
(Note: This post does not include the update of CloudWatch alarm in 2022/Dec.)
Motivation
When using Lambda functions, there are some risks that too many executions are invoked unintentionally. For example, self-looping of the Lambda functions or DDoS attack on the Lambda function URLs. A user in Japan reported the incident with 2.8 billion invocations in 10 days (Japanese blog).
So it would be nice to set some thresholds to suspend the uncontrollable functions. CloudWatch can detect such behaviors and trigger alarms. Surely this works, but we usually need to set one alarm for one Lambda function. The alarm costs 0.1 USD per month per alarm. If we use many Lambda functions, it should be costly.
In this post, I share the idea that a single CloudWatch Alarm takes care of all of the Lambda functions in the region. (So, 0.1 USD per month!)
How it works
The image below shows the steps of this architecture.
Suppose Lambda(s) is running repeatedly too much. Here we don't care which function or how many functions are involved. Only the sum of invocations is used as the metric.
When the sum of invocations reached first threshold, CloudWatch Alarm is triggered. This alarm tells that "SUM of the invocation reached threshold."
Lambda-throttle is invoked by CloudWatch alarm. This function queries which functions and how many times were invoked at a high rate.
When the invocations of the Lambda functions are beyond second threshold, Lambda-throttle sets the concurrency to 0 of them.
Lambda-throttle sends a message to the user via Amazon SNS.
Code
(Find complete CDK code in my repository.)
First, this is the snapshot of the CloudWatch alarm setting, which is just a normal setting with AWS/Lambda namespace. The first threshold for the sum of invocations is set to 100 this time.
Next, let's see the Lambda code.
- The second threshold is used for determining whether to suspend each function or not.
- The most fun part of this post is MetricDataQueries. In this query, Lambda query with Metrics Insights sum of invocations for each lambda functions. (I write twice because it's important: Not sum of all functions, but the sum of each function.)
- Query range is 5 minutes. Counts for every 1 minutes are obtained for this range.
- Option: Adding "LIMIT 10", for example, in the
Expression
narrows the result.
- After the metrics query, compare the sum of invocations for a function and the second threshold (threshold_lambda_stop).
- When suspending the function,
put_function_concurrency
withReservedConcurrentExecutions=0
works. - ("Failed to throttle." is just a verbose part in the case.)
import os
from datetime import datetime, timedelta
import boto3
def send_sns(message, subject):
client = boto3.client("sns")
topic_arn = os.environ["SNS_ARN"]
client.publish(TopicArn=topic_arn, Message=message, Subject=subject)
def get_invocation_top_functions():
"""
Check which functions are invoked many times
"""
range_minutes = 5
cloud_watch = boto3.client("cloudwatch")
response = cloud_watch.get_metric_data(
MetricDataQueries=[
{
"Id": "q1",
"Expression": """
SELECT SUM(Invocations)
FROM SCHEMA(\"AWS/Lambda\", FunctionName)
GROUP BY FunctionName
ORDER BY SUM() DESC
""",
"Period": 60,
"Label": "Invocation top",
},
],
StartTime=datetime.now() - timedelta(minutes=range_minutes),
EndTime=datetime.now(),
)
return response
def handler(event, context):
threshold_lambda_stop = int(os.environ["THRESHOLD_LAMBDA_STOP"])
response = get_invocation_top_functions()
# Count invocation in range_minutes for each function
# If the count is more than threshold, throttle the function
for fn in response["MetricDataResults"]:
count = sum(fn["Values"])
fn_name = fn["Label"].split()[-1]
if count >= threshold_lambda_stop:
client = boto3.client("lambda")
response = client.put_function_concurrency(
FunctionName=fn_name, ReservedConcurrentExecutions=0
)
# Notify
if response["ResponseMetadata"]["HTTPStatusCode"] == 200:
message = f"Lambda: {fn_name} was throttled. Count in 5 minutes: {count}."
subject = "Lambda throttled."
send_sns(message, subject)
else: # Verbose
message = f"Failed throttling Lambda: {fn_name}. Count in 5 minute: {count}."
subject = "Failed to throttle."
send_sns(message, subject)
Try
For a test, I've set first threshold in the CloudWatch alarm to 10, and the second alarm to 5. Then manually run a Lambda function more than 10 times in a minute.
I received an SNS message like this.
Note
When testing, make sure not to suspend your critical Lambda functions.
As the resources are region specific, one alarm for a region is required.
Summary
I have shared how to detect and throttle all Lambda functions in a single CloudWatch Alarm.
Setting two thresholds properly is quite essential. One is for catching the sum of the invocations, and the other is for determining to suspend each function. They depend on your system.
Appendix
Find complete CDK code in my repository.
Posted on December 19, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
December 18, 2023