Scheduled Vertical Scaling of AWS Aurora with EventBridge & Lambda in CloudFormation
Mathew Chan
Posted on February 28, 2024
Aurora under heavy write workloads
The following is the database load metrics of our Aurora cluster under peak loads with auto-scaling enabled.
We adjusted the size of our instances after each test to see how much they would throttle under load.
db.t3.medium instances - extreme throttling
db.r6g.large instances - moderate throttling
db.r6g.xlarge instances - no throttling
In short, auto-scaling in Aurora does not handle heavy writes.
Note: even with throttling, the application may still receive reasonable response times from the DB because the t3 family is a burstable instance, but that will incur CPU credits and extra cost.
Problem: handling periodical heavy writes
You can scale out an Amazon Aurora cluster by adding read replicas, or scale it up by using larger instances.
Auto-scaling on instance-based Aurora only supports scaling out, but not scaling up. While extra read replicas help with increased read workloads, it doesn't help with increased write workloads.
Since there is only one writer endpoint (master), the only way to handle heavy write workloads are by replacing the master instance with an instance of a larger size. That being said, updating the master instance size will cause unwanted downtime for our database.
This article will present an approach to using EventBridge to schedule a scale-up of an Aurora writer instance with minimal downtime. Traffic will be simulated with load tests using k6 and verified against RDS performance metrics.
Note on Aurora Multi-Master and Aurora Serverless v2
While Aurora multi-master was once a feature for Aurora MySQL 5.6, it has since been deprecated, and if you are using Aurora for postgreSQL like myself, you're out of luck.
On the other hand, Aurora serverless can scale for increasing write workloads. Serverless v1 wasn't suitable for ongoing web services given its long wakeup time from idle (~30 seconds), and serverless v2, while not having a cold start issue, is prohibitively expensive.
From: AWS Aurora Serverless V2 — What’s new?
Each GB of Serverless V2 RAM is twice the price of V1 and more than 3 times the price of provisioned Aurora capacity (Sam Gibbons)
That is not to mention that
- you can reserve instances in regular Aurora for even more cost savings.
- RDS proxy is more expensive on serverless v2 unless you have half a dozen of replicas.
As there's always an instance running, v2 is more a managed service than a serverless offering. It all feels like a marketing ploy to align it with pay-per-use services like Lambda.
Other AWS services like DynamoDB or Redis on cluster mode support write scaling by default, so it was quite surprising to me to learn that write scaling wasn't a built-in feature for Aurora.
Scheduled Scale-up for Aurora
- Schedule a lambda function to run before write peaks
- The function launches a db.r6g.xlarge instance to the Aurora cluster.
- On instance creation, an RDS event triggers a Lambda function.
- The function fails over the cluster to the newly created instance.
Scale up - Launching a new instance to the Aurora cluster
Lambda function for launching a db.r6g.xlarge instance to the cluster
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
exports.handler = async (event) => {
try {
const params = {
DBClusterIdentifier: process.env.DBClusterIdentifier,
DBInstanceIdentifier: process.env.DBInstanceIdentifier, // rds with higher write
Engine: 'aurora-postgresql',
DBInstanceClass: process.env.DBInstanceClass,
EngineVersion: '14.6',
PubliclyAccessible: false,
AvailabilityZone: process.env.AvailabilityZone,
MultiAZ: false,
EnablePerformanceInsights: true,
MonitoringInterval: 60,
MonitoringRoleArn: process.env.MonitoringRoleArn
};
const data = await rds.createDBInstance(params).promise();
console.log('Aurora instance created successfully:', data.DBInstance);
return data.DBInstance;
} catch (err) {
console.error('Error creating Aurora instance:', err);
throw err;
}
};
CloudFormation Resources
Lambda function definition
CreateRDSInstanceFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
Runtime: nodejs16.x
Timeout: 60
Code:
ZipFile: |
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
// ...
Environment:
Variables:
DBClusterIdentifier:
Ref: RDSCluster
DBInstanceIdentifier: myapp-postgres-instance-3
DBInstanceClass: db.r6g.xlarge
AvailabilityZone: ap-northeast-1a
MonitoringRoleArn: !Join
- ""
- - "arn:aws:iam::"
- !Ref AWS::AccountId
- ":role/rds-monitoring-role"
Scheduled rule to invoke the function before the peak
ScheduledCreateRDSInstance:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "cron(40 12 ? * MON-FRI *)"
State: ENABLED
Targets:
- Arn: !GetAtt CreateRDSInstanceFunction.Arn
Id: CreateRDSInstanceFunction
Scale up - Failover to newly created instance
Lambda function for failing over to new instance
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const logger = require('console');
const wait = (ms) => new Promise(resolve => setTimeout(resolve, ms));
const waitUntilRDSAvailable = async (dbInstanceIdentifier, maxAttempts = 90, pollingInterval = 10000) => {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
const response = await rds.describeDBInstances({ DBInstanceIdentifier: dbInstanceIdentifier }).promise();
const dbInstance = response.DBInstances[0];
const status = dbInstance.DBInstanceStatus;
if (status === 'available') {
console.log("RDS instance is available!");
return;
} else {
console.log(`RDS instance is not yet available (Status: ${status}). Retrying...`);
await wait(pollingInterval);
}
} catch (error) {
console.error("Error describing RDS instance:", error);
throw error; // Propagate the error to the caller
}
}
console.error("Timeout: RDS instance did not become available within the specified time.");
throw new Error("Timeout: RDS instance did not become available within the specified time.");
};
exports.handler = async (event, context) => {
logger.log("Received event: ", JSON.stringify(event));
const detail = event.detail;
if (detail && detail.SourceType === "DB_INSTANCE" && detail.EventCategories.includes("creation") && detail.SourceIdentifier === process.env.DBInstanceIdentifier) {
logger.log(`Received an instance creation event for ${process.env.DBInstanceIdentifier}`);
const params = {
DBClusterIdentifier: process.env.ClusterIdentifier,
TargetDBInstanceIdentifier: process.env.DBInstanceIdentifier
};
try {
await waitUntilRDSAvailable(process.env.DBInstanceIdentifier, 90, 10000); // 15 minutes
try {
await rds.failoverDBCluster(params).promise();
console.log('Failover completed successfully');
} catch (error) {
console.error('Error during failover:', error);
throw error;
}
} catch (error) {
return { statusCode: 500, body: "Error waiting for RDS instance to become available: " + error.message };
}
} else {
logger.log(`Received an event, but it is not an instance creation event for ${process.env.DBInstanceIdentifier}`);
}
};
After an RDS creation event, RDS intance isn't immediately available. This is why we have to poll the status of the RDS instance periodically to check if it's avialable. Usually it will be avaiable after 2-5 minutes, but we set the Lambda timeout to 15 minutes (max) just to be safe.
I've also tried using the "availability" event, but unfortunately it's reserved for shutdowns and restarts. It won't be triggered if the RDS instance wasn't deleted or stopped recently.
CloudFormation resource
Lambda function definition
RDSScaleUpFailoverFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
Runtime: nodejs16.x
Timeout: 900 # 15 minutes; wait for RDS instance to be available
Code:
ZipFile: |
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
// ...
Environment:
Variables:
ClusterIdentifier:
Ref: RDSCluster
DBInstanceIdentifier: myapp-postgres-instance-3
EventBridge rule to trigger the lambda function in the event of instance creation
RDSCreateDBInstanceEventRule:
Type: "AWS::Events::Rule"
Properties:
Description: After DB instance created, invoke Lambda to failover RDS cluster
EventPattern:
source:
- aws.rds
detail-type:
- RDS DB Instance Event
detail:
SourceIdentifier:
- myapp-postgres-instance-3
EventCategories:
- creation
State: ENABLED
Targets:
- Arn: !GetAtt RDSScaleUpFailoverFunction.Arn
Id: RDSScaleUpFailoverTarget
Scheduled Scale-down for Aurora
- Schedule a lambda function to run after write peaks.
- The function fails over the cluster to the smaller instances.
- On cluster failover, an RDS event triggers a Lambda function.
- The function removes the large instance created during scale-up.
Scale down by failing over to smaller instance
Lambda function to fail over to smaller instance
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
exports.handler = async () => {
const params = {
DBClusterIdentifier: process.env.ClusterIdentifier,
TargetDBInstanceIdentifier: process.env.DBInstanceIdentifier
};
try {
await rds.failoverDBCluster(params).promise();
console.log('Failover completed successfully');
} catch (error) {
console.error('Error during failover:', error);
throw error;
}
};
CloudFormation
Lambda function definition
RDSScaleDownFailoverFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
Runtime: nodejs16.x
Timeout: 60
Code:
ZipFile: |
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
// ...
Environment:
Variables:
ClusterIdentifier:
Ref: RDSCluster
DBInstanceIdentifier: myapp-postgres-instance-1
Scheduled rule to invoke the function after the peak
ScheduledRDSScaleDownFailover:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "cron(25 14 ? * MON-FRI *)"
State: !FindInMap [EnvToParams, !Ref EnvironmentType, RDSDBScaleUpEnabled]
Targets:
- Arn: !GetAtt RDSScaleDownFailoverFunction.Arn
Id: RDSScaleDownFailoverFunction
Remove created RDS in response to failover
Lambda function for removing the previously created instance
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const logger = require('console');
exports.handler = async (event, context) => {
logger.log("Received event: ", JSON.stringify(event));
const detail = event.detail;
// ensure failover was for target instance
if (detail && detail.SourceType === "CLUSTER" && detail.EventCategories.includes("failover") && detail.Message.includes(process.env.FailoverTargetDBInstanceIdentifier)) {
logger.log(`Received an cluster failover event to ${process.env.FailoverTargetDBInstanceIdentifier}`);
const params = {
DBInstanceIdentifier: process.env.DBInstanceToRemoveIdentifier
};
try {
await rds.deleteDBInstance(params).promise();
console.log('Old replica instance deleted successfully');
} catch (error) {
console.error('Error during cleanup:', error);
throw error;
}
} else {
logger.log(`Received an event, but it is not an cluster failover event to ${process.env.FailoverTargetDBInstanceIdentifier}`);
}
};
CloudFormation resources for removing RDS instance
Lambda Function definition
RemoveRDSInstanceFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt AuroraScaleUpLambdaExecutionRole.Arn
Runtime: nodejs16.x
Timeout: 60
Code:
ZipFile: |
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
// ...
Environment:
Variables:
DBInstanceToRemoveIdentifier: myapp-postgres-instance-3
FailoverTargetDBInstanceIdentifier: myapp-postgres-instance-1
EventBridge rule to trigger the lambda function in the event of cluster failover.
RDSClusterFailoverEventRule:
Type: "AWS::Events::Rule"
Properties:
Description: After RDS cluster failover to original reader, invoke Lambda to delete extra instance
EventPattern:
source:
- aws.rds
detail-type:
- RDS DB Cluster Event
detail:
SourceIdentifier:
- Ref: RDSCluster
EventCategories:
- failover
State: ENABLED
Targets:
- Arn: !GetAtt RemoveRDSInstanceFunction.Arn
Id: RDSScaleDownFailoverTarget
CloudFormation permissions and roles
Lambda Execution Role
AuroraScaleUpLambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: LambdaExecutionPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup # allow lambda to write to CloudWatch logs
- logs:CreateLogStream
- logs:PutLogEvents
- iam:PassRole # allow lambda to pass rds-monitoring role to rds instance
- rds:DescribeDBInstances
- rds:FailoverDBCluster
- rds:CreateDBInstance
- rds:DeleteDBInstance
Resource: "*"
Permissions for EventBridge Rule to invoke Lambda
InvokeCreateRDSInstancePermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref CreateRDSInstanceFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt ScheduledCreateRDSInstance.Arn
InvokeRDSScaleUpFailoverPermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref RDSScaleUpFailoverFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt RDSCreateDBInstanceEventRule.Arn
InvokeRDSScaleDownFailoverPermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref RDSScaleDownFailoverFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt ScheduledRDSScaleDownFailover.Arn
InvokeRemoveRDSInstancePermission:
Type: "AWS::Lambda::Permission"
Properties:
Action: "lambda:InvokeFunction"
FunctionName: !Ref RemoveRDSInstanceFunction
Principal: "events.amazonaws.com"
SourceArn: !GetAtt RDSClusterFailoverEventRule.Arn
Concerns
1. Downtime
The first worry with this approach is whether we will experience downtime with cluster failover. I ran some load testing, and it seems there's almost zero downtime with Aurora failover.
The load test is conducted with 10,000 requests, with failover being executed midway into the test. All requests went through successfully.
You can learn more about setting up your own load testing with k6 and EKS in my other blog post (Japanese only).
2. Excess replication lag
AWS recommends having the same size for your RDS instances. The reason is that weaker reader instance might not be able to catch up with the volume of changes in a stronger writer instance. This might cause the read replica to shut down when it falls behind the main writer instance too much.
My current setup single AZ with 2 db.t3.medium instances and 1 db.r6g.xlarge instance. I have monitored the AuroraReplicaLag in with spike tests, and it has never exceeded 85ms.
Shutdown issues with replica lag may occur after 60s, so be careful if you have a multi-AZ setup with hugely different instance sizes.
3. Data replication
One of my colleagues were concerned about possible slower startup times of instances once data accumulates, and that 30 minutes might not be enough for a new instance to spin up.
Unlike regular RDS, Aurora has a separate data layer that is synchronized across AZ.
While there may be cache residing on the instances, there is no need to replicate data across AZs when spinning up new intances. Therefore, time to spin up new intances in theory shouldn't be affected by the size of the data store.
After some testing wth inflating the database from 300Mb to 128GB, the startup time has remained consistently around 8 minutes.
Notes on automatic scale up
If write workloads are unpredictable, you can replace the EventBridge scheduled events with CloudWatch alarm triggers. That being said, starting a new aurora replica and failing over could take well over 10 minutes in my experience. With no AWS guarantee on how long it will take, it might not be a reactive approach, and the recommended way would just use a larger instance or use other DB services like DynamoDB or Redis on cluster mode instead.
References
Posted on February 28, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 11, 2024