Your AWS Lambda Function Failed, What Now?

peacing

Paul Singman

Posted on November 10, 2020

Your AWS Lambda Function Failed, What Now?

On the analytics team at Equinox Media, we invoke thousands of Lambda functions daily to perform a variety of data processing tasks. Examples range from the mundane shuffling of files around on S3, to the more stimulating generation of real-time fitness content recommendations on the Variis app.

Because of our reliance on Lambda, it’s critical to diagnose issues as quickly as possible.

Here’s a diagram of the process we’ve set up to do so:

Serverless error handling architecture

If you are also a user of Lambda, what does your error alerting look like? If you find yourself struggling to figure out why a failure occurred, or worse — unaware one happened at all — we hope sharing our solution will help you become a more effective serverless practitioner!

Step #1: Create An Error Metric-Based CloudWatch Alarm

After every single run of a Lambda function, AWS sends a few metrics to the CloudWatch service by default. Per AWS documentation:

Invocation metrics are binary indicators of the outcome of > an invocation. For example, if the function returns an error, Lambda sends the Errors metric with a value of 1. To get a count of the number of function errors that occurred each minute, view the Sum of the Errors metric with a period of one minute.

To make us aware of any failures, we create a CloudWatch Alarm based on the Errors metric for a specific Lambda resource. The exact threshold of the alarm depends on how frequently a job runs and its criticality, but most commonly this value is set to trigger upon three* failures in a five minute period.

*One for the original failure, plus two automatic retries.

For some, generic alerting of this sort is sufficient, and notifications are simply directed to a work email or perhaps a PagerDuty Service tied to an on-call schedule.

However, we know in this scenario valuable information about the failed invocation is being ignored. To be most efficient, we strive to automate more of the debugging process.

Our journey, eager Lambda user, is only beginning.

Step #2: With A Little Help From An SNS Topic + Lambda Friends

Instead of sending straight to an alerting service, we send alarm notifications to a centralized SNS topic that handles failure events for all Lambda functions across our cloud data infrastructure.

Configuring a CloudWatch Alarm to send to SNS

What happens to an Alarm record sent to the topic? It triggers another Lambda function of course!

We call this special Lambda function the Alerting Lambda and it performs three main steps:

  1. Sends a message to Slack with details about the failure.
  2. Creates an incident in PagerDuty, also populated with helpful details.
  3. Queries CloudWatch Logs for log messages related to the failure, and if found, sends to Slack.

The first two steps are relatively straightforward so we’ll quickly cover how they work before diving into the third.

If you inspect the payload sent from CloudWatch Alarms to SNS, you’ll see it contains data related to the alarm itself like the name, trigger threshold, old and current alarm state, and relevant CloudWatch Metric.

The Alerting Lambda takes this data and parses it into a super-helpful Slack message (via a webhook) that looks like this:

Slack message from #data-alerts channel

Similarly, using the pypd package we create a PagerDuty event with helpful Custom metrics and AWS console link populated:

PagerDuty Incident with Alarm data populated as Custom Events

Both of these notifications help us instantly determine if an alert is legitimate or perhaps falls more into the “false alarm” category. When managing 100+ tasks, this provides a quality-of-life improvement for everyone on the team.

The third step of the Alerting Lambda is recently implemented (inspired by this post on effective Lambda logging) and has proven to be a beloved shortcut for Lambda debugging.

The output is a message in Slack containing log messages from recent Lambda failures that looks something like this:

CloudWatch logs automatically appear in Slack!

How does this work exactly?

The first step is to parse out the Lambda function name from the SNS event. This allows us to know which CloudWatch Log Group to query against for recent errors, shown in the code snippet below:

And after parsing the query response for a requestId, we run a second Insights query filtered on that requestId, re-format the log messages returned in the response, and send the results to Slack.

Place code like this in your Alerting Lamba and before you know it, you’ll be getting helpful log messages sent to Slack too!

Final Thoughts

Though this solution has proven effective for our needs, there is room for improvement. Notably while we query CloudWatch Logs when a Lambda Errors, we don’t handle other Lambda failures (like timeouts or throttling).

The idea to run an Insights query when a Lambda fails didn’t come to us in a “Eureka!” moment of inspiration… but rather from observing any consistent, predictable actions we perform that could be automated. Maintaining an awareness for these situations will serve any developer well in his or her career.

Another lesson for some getting started with serverless technologies is that you cannot be afraid of managing many, many cloud resources. Critically, the marginal cost of adding an additional Lambda function or SQS queue to your architecture should be near-zero.

The idea of spinning up an additional SNS topic and Lambda for error handling was a turn off to some. We hope we’ve shown the benefits of growing past that limiting mindset. If you want to read more on this topic, check out our post on painlessly deploying Lambda functions.

One final thought, you may be wondering if all other Lambdas are monitored by the Alerting Lambda, what then, monitors the Alerting Lambda function?

Hmmm.

💖 💪 🙅 🚩
peacing
Paul Singman

Posted on November 10, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related