Level up your Lambda Game with Canary Deployments using SST

Have you ever deployed a new version of your Lambda function to production and immediately regretted it? Have you heard people advocating "testing in production" and wondered how that is possible?

In this tutorial, you will learn how to use canary deployments in a Serverless environment with SST to safely expose a small percentage of users to new versions. If sh*t happens, which it inevitably will, automatic rollbacks got your back.

1. How it works

Canary deployments is a deployment strategy that releases a new application version to a small subset of users. Let's say that you deploy a new version of your application, and your new version includes a defect. If you send 100% of traffic to this version, you could break the application for all users.

Canary deployments aim to solve this problem by validating your new version against a small fraction of the traffic. You can, for example, opt to send 5% of traffic to the latest version for a set period while monitoring its behavior. If all metrics look good during this period, you can switch over 100% of traffic to the new version. But, if error metrics for the new version start rising, you can automatically roll back to the previous version.

The simplest way to implement canary deployments of Lambda functions is with Lambda aliases and AWS CodeDeploy.

1.1. Lambda Aliases

AWS Lambda lets you have multiple versions of a Lambda function deployed at the same time. A Lambda alias functions as a pointer to a specific version. Aliases also let you configure a second version and define the percentage of incoming events it should route to each version. CodeDeploy utilizes this feature to control the traffic weighting of the alias during a deployment.

You can configure API Gateways to point to aliases instead of functions. This means that all traffic entering your API Gateway will be routed to the different versions based on the weights in the alias.

1.2. AWS CodeDeploy

AWS CodeDeploy is a service that helps you automate releases to AWS Lambda, AWS ECS, and AWS EC2. To use CodeDeploy with AWS Lambda, you need a few resources:

A CodeDeploy Application
A CodeDeploy Deployment Group
A CodeDeploy Deployment Configuration

CodeDeploy Application

An application is simply a container for deployments and deployment groups.

CodeDeploy Deployment Group

A deployment group defines the target of a deployment. For Lambda deployments, the target is a Lambda alias. It also defines the deployment configuration to use, as well as any alarms that should trigger a rollback.

CodeDeploy Deployment Configuration

A deployment configuration defines the deployment strategy. CodeDeploy comes with a few built-in strategies, but you can also create your own. The built-in strategies for Lambda functions are:

Linear: Shift traffic in equal increments with an equal number of minutes between each increment. For example, 10% every 10 minutes.
Canary: Shift traffic in two increments. For example, first 10% for 5 minutes and then 100% afterward.
All at once: Shift all traffic to the new version immediately.

2. Tutorial

This tutorial assumes that your local environment is configured with AWS credentials. The tutorial uses SST to define and deploy infrastructure. You should be able to accomplish the same with any of the other CloudFormation-based tools (SAM, CDK, etc.).

2.1. Create a new SST Project

Initialize a new SST project. We use the standard/api as a starting point for illustrative purposes.

$ npx create-sst@latest --template standard/api sst-lambda-canary

Open up the generated project in your favorite editor and open the stacks/MyStack.ts file. Start by removing everything from the stack so that you start from a clean slate:

import { StackContext } from 'sst/constructs';

export function API({ stack }: StackContext) {
  // empty stack
}

SST defaults to the us-east-1 region. You can specify another region in sst.config.ts.

2.2. Add a simple Lambda-backed API

Let's add a simple API backed by a Lambda function to the stack. The standard/api template you used comes pre-baked with a simple handler function in packages/functions/src/lambda.ts that you can use for the purpose of this tutorial. This handler will be invoked when a GET request hits the root path (/) of the API Gateway.

In stacks/MyStack.ts, add the following:

import { StackContext, Api, Function as SSTFunction } from 'sst/constructs';
// I usually import Function like above to keep linters happy about shadowing.

export function API({ stack }: StackContext) {
  const func = new SSTFunction(stack, 'MyFunc', {
    handler: 'packages/functions/src/lambda.handler',
  });

  const api = new Api(stack, 'MyApi', {
    routes: {
      'GET /': func,
    },
  });

  stack.addOutputs({
    ApiEndpoint: api.url,
  });
}

Deploy your application to your AWS account using the SST CLI:

$ npx sst deploy --stage prod

SST v2.24.25

➜  App:     sst-test
   Stage:   prod
   Region:  us-east-1
   Account: 123456789012

✔  Building...
...

✔  Deployed:
   API
   ApiEndpoint: https://YOUR_API_ID.execute-api.eu-west-1.amazonaws.com

Test your API to make sure everything is setup correctly:

$ curl https://YOUR_API_ID.execute-api.eu-west-1.amazonaws.com

Hello world. The time is 2023-09-07T19:18:24.360Z

2.3. Create a Lambda Alias

In stacks/MyStack.ts, add a Lambda alias to your Lambda function and update your API Gateway to route traffic to the alias instead:

const func = ...

const alias = func.addAlias('live');

const api = new Api(stack, "MyApi", {
  routes: {
    'GET /': {
      cdk: {
        function: alias
      }
    },
  },
});

Deploy again with npx sst deploy --stage prod to ensure the alias works.

2.4. Create an Alarm

CodeDeploy can automatically roll back a deployment in case one or more specified alarms get triggered during the deployment. To illustrate this, you will add an alarm that triggers when the new Lambda version produces any errors.

It is a bit annoying to have to specify the values of dimensionsMap manually. It would be great if CDK could infer this automatically if you use alias.metricErrors or func.currentVersion.metricErrors. This is reported as a bug in the following GitHub issue.

In the stack, add the following:

import { Alarm } from "aws-cdk-lib/aws-cloudwatch";
import { Duration } from 'aws-cdk-lib/core';
...

export function API({ stack }: StackContext) {
  ...

  const alarm = new Alarm(stack, 'MyAlarm', {
    alarmName: `${func.functionName}-${func.currentVersion.version}-errors`,
    metric: func.metricErrors({
      period: Duration.minutes(1),
      dimensionsMap: {
        FunctionName: func.functionName,
        Resource: `${func.functionName}:${alias.aliasName}`,
        ExecutedVersion: func.currentVersion.version,
      },
    }),
    threshold: 1,
    evaluationPeriods: 1,
  });
}

This alarm will check for any errors in the newly deployed Lambda version. If your current Lambda version is X, a new deploy will create a version X+1 alongside X and send a percentage of traffic to it. Using the dimensions above ensures only the X+1 version can trigger the alarm.

The alarm is named after the function name and version, which ensures that the alarm is recreated during a deployment. If the currently deployed version is experiencing errors, the alarm could be in an alarm state when you deploy a new version, resulting in an instant rollback. Updating the underlying metric of the alarm does not reset the alarm status. This workaround creates a new alarm with the correct state and underlying metric for each deployment.

2.5. Add CodeDeploy Configuration

You must first create a CodeDeploy Application. In your stack, add:

import { LambdaApplication } from 'aws-cdk-lib/aws-codedeploy'
...

export function API({ stack }: StackContext) {
  ...

  const application = new LambdaApplication(stack, 'MyApplication');
}

Next, add a new Deployment Group to your CodeDeploy Application, referencing the Lambda alias and alarm you created earlier:

import {
  LambdaApplication,
  LambdaDeploymentConfig,
  LambdaDeploymentGroup,
} from 'aws-cdk-lib/aws-codedeploy';
...

export function API({ stack }: StackContext) {
  ...

  const deploymentGroup = new LambdaDeploymentGroup(stack, 'MyDeploymentGroup', {
    application,
    alias,
    deploymentConfig: LambdaDeploymentConfig.CANARY_10PERCENT_5MINUTES,
    alarms: [alarm],
  });
}

This creates a CodeDeploy Deployment Group that uses a built-in deployment strategy. The strategy sends 10% of the traffic to the new version and it keeps that weight for a duration of five minutes. If any specified alarm is triggered during this time window, CodeDeploy automatically rolls back the deployment and sends 100% of traffic to the old version. If this happens, CloudFormation will roll back the stack to the previous state.

CloudFormation will be stuck in the UPDATE_IN_PROGRESS state until the CodeDeploy deployment is complete.

2.6. Trying it out

Change the return value of your Lambda function (packages/functions/src/lambda.ts):

export const handler = ApiHandler(async (_evt) => {
  return {
    statusCode: 200,
    body: "Hello there. I'm a canary.",
  };
});

Deploy your stack again with npx sst deploy --stage prod. After some time you should notice the deployment getting stuck at API MyFunc/Aliaslive AWS::Lambda::Alias UPDATE_IN_PROGRESS. This means that the CodeDeploy deployment is in progress.

Hit your endpoint a few times with curl or similar. Most of the responses should be Hello world. The time is ... but you should see a few Hello there. I'm a canary. as well. After five minutes have passed, the deployment should be complete, and all future requests will return the new message.

Let's simulate a failed deployment. Inject an error in your handler:

export const handler = ApiHandler(async (_evt) => {
  throw new Error("Oops!");
  return { ... };
});

Deploy again and hit your endpoint again while the CodeDeploy deployment is in progress. Some of the requests should hit the canary and return a 500 Internal Server Error. Shortly after this happens, the alarm will be triggered and the deployment will be rolled back. All requests you do should now return Hello there. I'm a canary.. The CloudFormation stack will also be rolled back to its previous state.

2.7. Customizing the Deployment Strategy

CodeDeploy comes with a couple of built-in deployment strategies. If these do not fit your needs, you can create a custom deployment config. Imagine you want to send 50% of the traffic to the new version, and let it "bake" for 10 minutes. Simply create a new deployment configuration and update the deployment group:

import {
  LambdaApplication,
  LambdaDeploymentConfig,
  LambdaDeploymentGroup,
  TimeBasedCanaryTrafficRouting
} from 'aws-cdk-lib/aws-codedeploy';
...

export function API({ stack }: StackContext) {
  ...

  const deploymentConfig = new LambdaDeploymentConfig(stack, 'MyDeploymentConfig', {
    trafficRouting: new TimeBasedCanaryTrafficRouting({
      interval: Duration.minutes(10),
      percentage: 50,
    }),
  });

  const deploymentGroup = new LambdaDeploymentGroup(stack, 'MyDeploymentGroup', {
    application,
    alias,
    deploymentConfig,
    alarms: [alarm],
  });
}

2.8. Managing Different Environments

Currently, your application will use the same deployment strategy in all environments. This is probably not what you want. In ephemeral environments and when using SST's Live Lambda Development you most likely want instant deploys without any canary. You can use the stage parameter to conditionally define the deployment configuration to use:

const deploymentGroup = new LambdaDeploymentGroup(stack, 'MyDeploymentGroup', {
  application,
  alias,
  deploymentConfig:
    stack.stage === 'prod'
      ? deploymentConfig
      : LambdaDeploymentConfig.ALL_AT_ONCE,
  alarms: [alarm],
});

3. Conclusion

In this tutorial you have learned how to use Lambda aliases and CodeDeploy to do canary deployments of your Lambda functions. You have also learned how to create custom deployment strategies as well as how to use different strategy for different environments.

With this knowledge, you can improve the robustness and resilience of your serverless architectures, and you are one step closer to testing in production.

Blog