Resizing EC2 instances without downtime

As your project grows and user base expands, scaling up your compute resources becomes essential. In this blog I will explore the process of resizing AWS EC2 instances without incurring downtime and discuss some considerations to ensure a seamless transition.

Real world walkthrough

When scaling your infrastructure, it is important to regularly evaluate whether your current instance sizes are meeting the demands of your project. Consider factors such as CPU utilisation and user growth to determine when resizing becomes necessary.

I have been working on a project that has recently went live and over the past few months, its user base has been continually growing. I have auto scaling policies in place to scale up when the CPU utilisation gets to a certain level - however, there is a point when you realise you need to scale up your instances instead of continuously scaling out.

I have used Terraform for all my AWS infrastructure as code - so this is where I have defined my EC2 configuration. I have the config set up so that there should always be at least one healthy instance.



resource "aws_autoscaling_group" "example_ec2_asg" {
  name                      = "example-asg"
  vpc_zone_identifier       = [aws_subnet.private_subnet_1.id, aws_subnet.private_subnet_2.id]
  launch_configuration      = aws_launch_configuration.ecs_launch_config.name
  force_delete              = true
  health_check_grace_period = 10

  desired_capacity = 2
  min_size         = 1
  max_size         = var.max_size

  lifecycle {
    create_before_destroy = true
  }
}

# EC2 launch configuration
resource "aws_launch_configuration" "example_ecs_launch_config" {
  name_prefix          = "ecs_launch_config-"
  image_id             = data.aws_ami.example_ami.id
  iam_instance_profile = aws_iam_instance_profile.ecs_agent.name
  security_groups      = [aws_security_group.ecs_tasks_sg.id]
  instance_type               = var.ec2_instance_type // passing in as var
  associate_public_ip_address = false
  user_data                   = <<EOF
    #!/bin/bash
    echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
    EOF

  depends_on = [aws_ecs_cluster.main]
  lifecycle {
    create_before_destroy = true
  }
}

In the code snippet above you can see the launch configuration for my EC2 instances. You will notice that I pass the instance type in as a variable - this is because I use different instance sizes depending in deployment environment (e.g. QA, test, production).

You can see I have a lifecycle event set up, which specifies to always create a new instance before one is destroyed. I also have a desired capacity set of 2 instances running at any time.

The Goal

My EC2 instances are currently t2.small and I want to update them to be t2.medium

The Worry

For code deployments I have it set up to do rolling deployments, but my worry was that this change is bigger than just a code update, it is updating the actual compute the code is running on.
I was afraid if I updated my terraform configuration and ran it, it might try to update all instances at once and cause downtime.
I decided to test this update in a test AWS environment, before trying in QA or Production so I wouldn't cause any issues for the development team or end customers, while trying to understand this process.

The Plan

I updated the environment variable for terraform to now be a t2.medium and ran a terraform plan to view the pending changes:

In the screenshot, you can see the terraform plan is showing that there will be a force replacement of the EC2 instance to update the type.

To test and see if it would create downtime, I confirmed the deployment in my test environment. Once the terraform plan had finished, I went into the EC2 console and I could see that my instances were both still size t2.small.

I thought when the deployment had completed, it would have triggered each instance to start updating to the new size. However this is not the case - because I have updated the EC2 size in the launch configuration, which only gets triggered when you launch a new instance.

The Test

I have two instances running, so I thought I would try and stop one instance to test my theory and see what would happen. Within the AWS Console (again in my test environment), I selected one of my instances and selected to Stop it.

When I clicked this, I began hitting my APIs health endpoint, to verify traffic was still being routed to the running instance and when I checked back in the console I could a new medium sized instance was automatically being created.
I hit a health endpoint a few more times to verify everything was still running as expected while the new instance was being created and it was!

Once the new medium instance was in a ready state and I could see in ECS that the service was up and running, I then stopped the second small instance. Again once it was stopped, I could see a new medium instance being created in its place and being registered in ECS.

I now have confidence that I can do this upgrade on my QA and Production environments without any end user downtime.

Considerations

Utilise an infrastructure-as-code tool to help keep track of configuration updates.
Implementing rolling deployments to ensure one deployment is successful before you start another one.
Ensure you have a desired instance count higher than one instance, to enable a rolling update strategy and also help in the event one of your instances become unhealthy.
Ensure you have a health endpoint, set up to allow you to easily test the service during the infrastructure update to make sure traffic is being routed as expected.
Have a roll back strategy in place in event that something fails during your update

Conclusion

Resizing EC2 instances without incurring downtime is an essential aspect of managing a growing project. By following best practices such as the ones above you can seamlessly resize your instances to meet the changing demands of your application and confidently scale your infrastructure while providing a smooth user experience.

Blog