Scheduled Vertical Scaling of AWS Aurora with EventBridge & Lambda in CloudFormation

mathewthe2

Mathew Chan

Posted on February 28, 2024

Scheduled Vertical Scaling of AWS Aurora with EventBridge & Lambda in CloudFormation

Aurora under heavy write workloads

The following is the database load metrics of our Aurora cluster under peak loads with auto-scaling enabled.

We adjusted the size of our instances after each test to see how much they would throttle under load.

Image description

db.t3.medium instances - extreme throttling
db.r6g.large instances - moderate throttling
db.r6g.xlarge instances - no throttling

In short, auto-scaling in Aurora does not handle heavy writes.

Note: even with throttling, the application may still receive reasonable response times from the DB because the t3 family is a burstable instance, but that will incur CPU credits and extra cost.

Problem: handling periodical heavy writes

You can scale out an Amazon Aurora cluster by adding read replicas, or scale it up by using larger instances.

Auto-scaling on instance-based Aurora only supports scaling out, but not scaling up. While extra read replicas help with increased read workloads, it doesn't help with increased write workloads.

Since there is only one writer endpoint (master), the only way to handle heavy write workloads are by replacing the master instance with an instance of a larger size. That being said, updating the master instance size will cause unwanted downtime for our database.

This article will present an approach to using EventBridge to schedule a scale-up of an Aurora writer instance with minimal downtime. Traffic will be simulated with load tests using k6 and verified against RDS performance metrics.

Note on Aurora Multi-Master and Aurora Serverless v2

While Aurora multi-master was once a feature for Aurora MySQL 5.6, it has since been deprecated, and if you are using Aurora for postgreSQL like myself, you're out of luck.

On the other hand, Aurora serverless can scale for increasing write workloads. Serverless v1 wasn't suitable for ongoing web services given its long wakeup time from idle (~30 seconds), and serverless v2, while not having a cold start issue, is prohibitively expensive.

Image description

From: AWS Aurora Serverless V2 — What’s new?

Each GB of Serverless V2 RAM is twice the price of V1 and more than 3 times the price of provisioned Aurora capacity (Sam Gibbons)

That is not to mention that

  • you can reserve instances in regular Aurora for even more cost savings.
  • RDS proxy is more expensive on serverless v2 unless you have half a dozen of replicas.

As there's always an instance running, v2 is more a managed service than a serverless offering. It all feels like a marketing ploy to align it with pay-per-use services like Lambda.

Other AWS services like DynamoDB or Redis on cluster mode support write scaling by default, so it was quite surprising to me to learn that write scaling wasn't a built-in feature for Aurora.

Scheduled Scale-up for Aurora

Image description

  1. Schedule a lambda function to run before write peaks
  2. The function launches a db.r6g.xlarge instance to the Aurora cluster.
  3. On instance creation, an RDS event triggers a Lambda function.
  4. The function fails over the cluster to the newly created instance.

Scale up - Launching a new instance to the Aurora cluster

Lambda function for launching a db.r6g.xlarge instance to the cluster



const AWS = require('aws-sdk');
const rds = new AWS.RDS();

exports.handler = async (event) => {
    try {
        const params = {
            DBClusterIdentifier: process.env.DBClusterIdentifier,
            DBInstanceIdentifier: process.env.DBInstanceIdentifier, // rds with higher write
            Engine: 'aurora-postgresql',
            DBInstanceClass: process.env.DBInstanceClass,
            EngineVersion: '14.6',
            PubliclyAccessible: false,
            AvailabilityZone: process.env.AvailabilityZone,
            MultiAZ: false,
            EnablePerformanceInsights: true,
            MonitoringInterval: 60,
            MonitoringRoleArn: process.env.MonitoringRoleArn
        };
        const data = await rds.createDBInstance(params).promise();
        console.log('Aurora instance created successfully:', data.DBInstance);
        return data.DBInstance;
    } catch (err) {
        console.error('Error creating Aurora instance:', err);
        throw err;
    }
};


Enter fullscreen mode Exit fullscreen mode

CloudFormation Resources

Lambda function definition



  CreateRDSInstanceFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
      Runtime: nodejs16.x
      Timeout: 60
      Code:
        ZipFile: |
          const AWS = require('aws-sdk');
          const rds = new AWS.RDS();
          // ...
      Environment:
        Variables:
          DBClusterIdentifier:
            Ref: RDSCluster
          DBInstanceIdentifier: myapp-postgres-instance-3
          DBInstanceClass: db.r6g.xlarge
          AvailabilityZone: ap-northeast-1a
          MonitoringRoleArn: !Join
            - ""
            - - "arn:aws:iam::"
              - !Ref AWS::AccountId
              - ":role/rds-monitoring-role"


Enter fullscreen mode Exit fullscreen mode

Scheduled rule to invoke the function before the peak



  ScheduledCreateRDSInstance:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: "cron(40 12 ? * MON-FRI *)"
      State: ENABLED
      Targets:
        - Arn: !GetAtt CreateRDSInstanceFunction.Arn
          Id: CreateRDSInstanceFunction


Enter fullscreen mode Exit fullscreen mode

Scale up - Failover to newly created instance

Lambda function for failing over to new instance



const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const logger = require('console');

const wait = (ms) => new Promise(resolve => setTimeout(resolve, ms));

const waitUntilRDSAvailable = async (dbInstanceIdentifier, maxAttempts = 90, pollingInterval = 10000) => {

    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
      try {
          const response = await rds.describeDBInstances({ DBInstanceIdentifier: dbInstanceIdentifier }).promise();
          const dbInstance = response.DBInstances[0];
          const status = dbInstance.DBInstanceStatus;

          if (status === 'available') {
              console.log("RDS instance is available!");
              return;
          } else {
              console.log(`RDS instance is not yet available (Status: ${status}). Retrying...`);
              await wait(pollingInterval);
          }
      } catch (error) {
          console.error("Error describing RDS instance:", error);
          throw error; // Propagate the error to the caller
      }
    }

    console.error("Timeout: RDS instance did not become available within the specified time.");

    throw new Error("Timeout: RDS instance did not become available within the specified time.");
};

exports.handler = async (event, context) => {
  logger.log("Received event: ", JSON.stringify(event));
  const detail = event.detail;
  if (detail && detail.SourceType === "DB_INSTANCE" && detail.EventCategories.includes("creation") && detail.SourceIdentifier === process.env.DBInstanceIdentifier) {
      logger.log(`Received an instance creation event for ${process.env.DBInstanceIdentifier}`);
      const params = {
          DBClusterIdentifier: process.env.ClusterIdentifier,
          TargetDBInstanceIdentifier: process.env.DBInstanceIdentifier
      };
      try {
          await waitUntilRDSAvailable(process.env.DBInstanceIdentifier, 90, 10000); // 15 minutes
          try {
              await rds.failoverDBCluster(params).promise();
              console.log('Failover completed successfully');
          } catch (error) {
              console.error('Error during failover:', error);
              throw error;
          }
      } catch (error) {
          return { statusCode: 500, body: "Error waiting for RDS instance to become available: " + error.message };
      }
  } else {
      logger.log(`Received an event, but it is not an instance creation event for ${process.env.DBInstanceIdentifier}`);
  }
};


Enter fullscreen mode Exit fullscreen mode

After an RDS creation event, RDS intance isn't immediately available. This is why we have to poll the status of the RDS instance periodically to check if it's avialable. Usually it will be avaiable after 2-5 minutes, but we set the Lambda timeout to 15 minutes (max) just to be safe.

I've also tried using the "availability" event, but unfortunately it's reserved for shutdowns and restarts. It won't be triggered if the RDS instance wasn't deleted or stopped recently.

Image description

CloudFormation resource

Lambda function definition



 RDSScaleUpFailoverFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
      Runtime: nodejs16.x
      Timeout: 900 # 15 minutes; wait for RDS instance to be available
      Code:
        ZipFile: |
          const AWS = require('aws-sdk');
          const rds = new AWS.RDS();
          // ...
      Environment:
        Variables:
          ClusterIdentifier:
            Ref: RDSCluster
          DBInstanceIdentifier: myapp-postgres-instance-3


Enter fullscreen mode Exit fullscreen mode

EventBridge rule to trigger the lambda function in the event of instance creation



  RDSCreateDBInstanceEventRule:
    Type: "AWS::Events::Rule"
    Properties:
      Description: After DB instance created, invoke Lambda to failover RDS cluster
      EventPattern:
        source:
          - aws.rds
        detail-type:
          - RDS DB Instance Event
        detail:
          SourceIdentifier:
            - myapp-postgres-instance-3
          EventCategories:
            - creation
      State: ENABLED
      Targets:
        - Arn: !GetAtt RDSScaleUpFailoverFunction.Arn
          Id: RDSScaleUpFailoverTarget


Enter fullscreen mode Exit fullscreen mode

Scheduled Scale-down for Aurora

Image description

  1. Schedule a lambda function to run after write peaks.
  2. The function fails over the cluster to the smaller instances.
  3. On cluster failover, an RDS event triggers a Lambda function.
  4. The function removes the large instance created during scale-up.

Scale down by failing over to smaller instance

Lambda function to fail over to smaller instance



const AWS = require('aws-sdk');
const rds = new AWS.RDS();

exports.handler = async () => {
  const params = {
      DBClusterIdentifier: process.env.ClusterIdentifier,
      TargetDBInstanceIdentifier: process.env.DBInstanceIdentifier
  };
  try {
      await rds.failoverDBCluster(params).promise();
      console.log('Failover completed successfully');
  } catch (error) {
      console.error('Error during failover:', error);
      throw error;
  }
};


Enter fullscreen mode Exit fullscreen mode

CloudFormation

Lambda function definition



RDSScaleDownFailoverFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
      Runtime: nodejs16.x
      Timeout: 60
      Code:
        ZipFile: |
          const AWS = require('aws-sdk');
          const rds = new AWS.RDS();
          // ...
      Environment:
        Variables:
          ClusterIdentifier:
            Ref: RDSCluster
          DBInstanceIdentifier: myapp-postgres-instance-1


Enter fullscreen mode Exit fullscreen mode

Scheduled rule to invoke the function after the peak



  ScheduledRDSScaleDownFailover:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: "cron(25 14 ? * MON-FRI *)"
      State: !FindInMap [EnvToParams, !Ref EnvironmentType, RDSDBScaleUpEnabled]
      Targets:
        - Arn: !GetAtt RDSScaleDownFailoverFunction.Arn
          Id: RDSScaleDownFailoverFunction


Enter fullscreen mode Exit fullscreen mode

Remove created RDS in response to failover

Lambda function for removing the previously created instance



const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const logger = require('console');

exports.handler = async (event, context) => {
  logger.log("Received event: ", JSON.stringify(event));
  const detail = event.detail;
  // ensure failover was for target instance
  if (detail && detail.SourceType === "CLUSTER" && detail.EventCategories.includes("failover") && detail.Message.includes(process.env.FailoverTargetDBInstanceIdentifier)) {
      logger.log(`Received an cluster failover event to ${process.env.FailoverTargetDBInstanceIdentifier}`);
      const params = {
          DBInstanceIdentifier: process.env.DBInstanceToRemoveIdentifier
      };
      try {
          await rds.deleteDBInstance(params).promise();
          console.log('Old replica instance deleted successfully');
      } catch (error) {
          console.error('Error during cleanup:', error);
          throw error;
      }
  } else {
      logger.log(`Received an event, but it is not an cluster failover event to ${process.env.FailoverTargetDBInstanceIdentifier}`);
  }
};


Enter fullscreen mode Exit fullscreen mode

CloudFormation resources for removing RDS instance

Lambda Function definition



RemoveRDSInstanceFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
      Runtime: nodejs16.x
      Timeout: 60
      Code:
        ZipFile: |
          const AWS = require('aws-sdk');
          const rds = new AWS.RDS();
          // ...
      Environment:
        Variables:
          DBInstanceToRemoveIdentifier: myapp-postgres-instance-3
          FailoverTargetDBInstanceIdentifier: myapp-postgres-instance-1


Enter fullscreen mode Exit fullscreen mode

EventBridge rule to trigger the lambda function in the event of cluster failover.



  RDSClusterFailoverEventRule:
    Type: "AWS::Events::Rule"
    Properties:
      Description: After RDS cluster failover to original reader, invoke Lambda to delete extra instance
      EventPattern:
        source:
          - aws.rds
        detail-type:
          - RDS DB Cluster Event
        detail:
          SourceIdentifier:
            - Ref: RDSCluster
          EventCategories:
            - failover
      State: ENABLED
      Targets:
        - Arn: !GetAtt RemoveRDSInstanceFunction.Arn
          Id: RDSScaleDownFailoverTarget


Enter fullscreen mode Exit fullscreen mode

CloudFormation permissions and roles

Lambda Execution Role



  AuroraScaleUpLambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: LambdaExecutionPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup # allow lambda to write to CloudWatch logs
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - iam:PassRole # allow lambda to pass rds-monitoring role to rds instance
                  - rds:DescribeDBInstances
                  - rds:FailoverDBCluster
                  - rds:CreateDBInstance
                  - rds:DeleteDBInstance
                Resource: "*"


Enter fullscreen mode Exit fullscreen mode

Permissions for EventBridge Rule to invoke Lambda



InvokeCreateRDSInstancePermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref CreateRDSInstanceFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt ScheduledCreateRDSInstance.Arn

InvokeRDSScaleUpFailoverPermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref RDSScaleUpFailoverFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt RDSCreateDBInstanceEventRule.Arn

InvokeRDSScaleDownFailoverPermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref RDSScaleDownFailoverFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt ScheduledRDSScaleDownFailover.Arn

InvokeRemoveRDSInstancePermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref RemoveRDSInstanceFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt RDSClusterFailoverEventRule.Arn

Enter fullscreen mode Exit fullscreen mode




Concerns

1. Downtime

The first worry with this approach is whether we will experience downtime with cluster failover. I ran some load testing, and it seems there's almost zero downtime with Aurora failover.

The load test is conducted with 10,000 requests, with failover being executed midway into the test. All requests went through successfully.

Image description

You can learn more about setting up your own load testing with k6 and EKS in my other blog post (Japanese only).

2. Excess replication lag

AWS recommends having the same size for your RDS instances. The reason is that weaker reader instance might not be able to catch up with the volume of changes in a stronger writer instance. This might cause the read replica to shut down when it falls behind the main writer instance too much.

My current setup single AZ with 2 db.t3.medium instances and 1 db.r6g.xlarge instance. I have monitored the AuroraReplicaLag in with spike tests, and it has never exceeded 85ms.

Shutdown issues with replica lag may occur after 60s, so be careful if you have a multi-AZ setup with hugely different instance sizes.

3. Data replication

One of my colleagues were concerned about possible slower startup times of instances once data accumulates, and that 30 minutes might not be enough for a new instance to spin up.

Unlike regular RDS, Aurora has a separate data layer that is synchronized across AZ.

Image description

While there may be cache residing on the instances, there is no need to replicate data across AZs when spinning up new intances. Therefore, time to spin up new intances in theory shouldn't be affected by the size of the data store.

After some testing wth inflating the database from 300Mb to 128GB, the startup time has remained consistently around 8 minutes.

Notes on automatic scale up

If write workloads are unpredictable, you can replace the EventBridge scheduled events with CloudWatch alarm triggers. That being said, starting a new aurora replica and failing over could take well over 10 minutes in my experience. With no AWS guarantee on how long it will take, it might not be a reactive approach, and the recommended way would just use a larger instance or use other DB services like DynamoDB or Redis on cluster mode instead.

References

💖 💪 🙅 🚩
mathewthe2
Mathew Chan

Posted on February 28, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related