AWS Outpost and RDS: Reslotting Checklist

Overview

At Underdog, we use Amazon Web Services’ (AWS) Relational Database Service (RDS) and Elastic Kubernetes Service (EKS) (among many other services) to power our sports betting applications. Most readers will be readily familiar with these services and if you aren’t, I will explain most of the terms and technologies briefly as we go. What will be different about this post is a recent AWS service we’ve been using in conjunction with these two (RDS and EKS) called AWS Outpost.

In order to meet United States state gaming regulations and requirements, we use Outpost to meet data and application security residency requirements in particular states where we operate. Outpost is great for delivering compute and storage services close to a region where we operate that might be outside of Amazon’s many regional centers.

This post will detail some of the challenging aspects of running and operating these services that we had recently and hope that the experience will enlighten you, the dear reader, if you ever need to venture down this road in the future. If you are already familiar with configuring and operating these services together, you might also enjoy reading it with the pleasure and satisfaction of schadenfreude.

AWS Regions and Outpost

If you are unfamiliar with Outpost (as I was initially), you may wonder what the service is and why Underdog chose to use it. In simple terms, you may already know that AWS regions are spread across the globe in various countries to offer services that are close to major population centers so that AWS’ cloud services can be placed closest to the most consumers. Inside each country, as in the United States, there are several regions located inside several states, for example Virginia (us-east-1), Ohio (us-east-2), and Oregon (us-west-2), among others.

What happens if Underdog wants to operate services inside a state that is not listed as an AWS region but, never-the-less, wants to collocate cloud services within the borders of a particular state? There are many options to independently host our own hardware and networking gear, but we wanted to continue to enjoy the automation and speed of deploying Infrastructure as a Service (IAAS) via Application Programming Interfaces (APIs) and so-called Infrastructure as Code (IAC). This is where the promise of AWS Outpost comes into play.

With AWS Outpost, customers like us can order capacity from AWS and deploy it into the hosting facility of our choice, along with connectivity from AWS Direct Connect or VPN services to provide a regionally-located service in a particular location that we need to operate in. While the capacity for Outpost is inherently limited, and also is not “instantly” deployable or provisioned like other services, at least we do not need to directly manage infrastructure, networking, security, or software ourselves. Also, any services that are provisioned inside the Outpost will be managed by the same AWS APIs and dashboards that we are already using. Most importantly, the resources inside the Outpost can be shared by our staging and production AWS accounts and Virtual Private Clouds (VPCs) in a relatively seamless and integrated way to operate from a location outside an established AWS region.

Getting Familiar With Outpost

As with all AWS services, it is important to know what the services’ strengths, weaknesses and limitations are. It is also important to understand how one or more services may (or may not) interoperate together! This is where the majority of our problems originated as we went into provisioning our applications in a remote environment. In terms of provisioning Elastic Cloud Compute (EC2) services that we can use as nodegroups in EKS, this is a relatively understood solution and works well with Outpost configurations. What isn’t as directly easy to understand or configure was the RDS integration with Outpost and we ran into issues with some basics like choosing instance sizes and using instances in Outpost with RDS.

The first issue we encountered initially was choosing the correct capacity and sizing of instances for both compute and RDS instances. If you are spoiled by AWS’ amazing depth and breadth of instance sizes, architectures, and variety, you will need to reorganize your thinking around a fixed set of capacity and architecture limitations for your Outpost racks and/or servers. As an example, let’s say that you have one rack with 4 each m5.24xlarge “raw” capacity. You could subdivide that capacity as you saw fit, let’s say that you started conservatively (as we did) with way too large of instance sizes as 8 each m5.12xlarge instances, spread across staging, production, RDS, and EKS as follows (please note all drawings are for illustrative purposes only and should not be relied upon for factual reference):

This seemed like a reasonable starting point for us and would allow us plenty of capacity and resources to operate without worrying about needing more vertical scaling capacity. With this setup, we were able to successfully launch services relatively quickly and simply, all options considered.

Right Size Scaling

If you are familiar with AWS services, you may now be thinking to yourself, as we later did, “Whoah, this seems like a lot of capacity to use for RDS and EKS!” The truth is that if we had more time, insight, and experience, we might have been able to come up with a much better provisioning scheme to right-size the capacity for each use case. The engineering issue isn’t so much the large over-allocation of capacity (which is a concern), but rather the operational downstream issues such as having spare capacity, being able to migrate or upgrade services and versions, and being able to scale horizontally (instead of vertically) as needed to meet demands.

After some post-launch issues were settled, we analyzed the data and came up with a much more reasonable allocation scheme that would suit our needs now and in the future. We settled on something that looked more like the following drawing.

You will notice that we have much more reasonably-sized (but still beefy) m5.8xlarge RDS databases. We also have enough capacity to add more RDS replicas (or new primaries even), and plenty of application-worthy EKS worker nodes for redundancy and failover. Not only that, but we now have way more free “spare” capacity for future needs as either smaller RDS databases or as beefier EKS workloads emerge.

Armed with this new information we let our AWS representatives know about our plans and future configuration. We had a pretty good idea of the plan of operations and had laid out a strategy for performing the changes “in place” with a reasonable amount of downtime during a maintenance window.

Best Laid Plans of Mice and Men

Experts in RDS and Outpost may already see the issue we were going to find ourselves faced with and will be chuckling to themselves, but this was the original plan we were going to follow to migrate from the original configuration to the new capacity configuration. We had consciously chosen the shortest amount of time that the maintenance window would occur to re-slot the entire Outpost rack in one window without causing undue issues affecting both staging and production. We did not have multiple Outpost racks available to work with; but this is something we definitely will consider in the future.

See if you can spot the issue:

Initiate downtime maintenance window in the application by shutting down all application services and issuing maintenance page notifications
Temporarily stop staging RDS instances and EKS worker nodes
Take snapshots for disaster recovery purposes
Temporarily stop production RDS instances and EKS worker nodes
Take snapshots for disaster recovery purposes
AWS support will apply new Outpost reslotting configuration
Restart staging and production RDS instances with new sizes
Restart staging and production EKS worker nodes and join to clusters
Test and validate the applications
End the maintenance window and allow normal operations

If you spotted the issue in step 7 labeled “restart staging and production RDS instances”, congratulations! For everyone else, you can follow along and learn from our experience when you attempt to do this yourself.

You Can't Modify a Stopped DB Instance.

This statement from the AWS documentation should be tattooed on the forehead of any AWS RDS practitioner – in reverse so that they can read it in the mirror. Or, perhaps, both forward and reverse for people who look at the tattoo and for themselves looking at it in the mirror. The issue we immediately faced as we tried to start the new instances was that the previous db.m5.12xlarge instances were not available any more in our Outpost configuration, so we could not start the RDS instances. We also could not convert the instances into the new db.m5.8xlarge instances sizes that did exist in the new configuration since the databases were shut down!!

I’m not exaggerating too much when I say that I briefly considered the fact that we had made a fatal mistake and were going to be down in production for hours doing a disaster recovery at this point. It is very important in these situations not to panic but to stay calm, talk through your options, and decide on a safe course of corrective actions.

Fortunately, we had the following in our favor, which you should also have at your disposal if you attempt anything like this. We made sure we had an AWS representative and AWS support people on the call while our maintenance window was active and the team on our side were engaged. This enabled us to get real time feedback on the reslotting process, get answers to RDS and Outpost answers from AWS, and also (critically) allowed us to reconfigure the capacity online while we were in the midst of trying to salvage our operations. If you do not have enterprise support, then you will most likely not be able to resolve a situation like this. Nor should you even attempt to do something like this without enterprise support obviously.

Failure is Not an Option

We quickly broke out our calculators, slide rulers, and pocket pens to come up with an emergency configuration that would enable us to start both db.m5.12xlarge and db.m5.8xlarge target instance capacity at the same time. It was like that scene in Apollo 13 where Ed Harris says that we’ve never lost a database in the cloud and we were going to use everything at our disposal to figure out a solution. We were able to come up with the following configuration to solve our issue.

Fortunately, we had enough of the “spare” capacity to configure as RDS interim instances. Later on, AWS could then reslot the unused capacity back into our spare capacity as needed. There was a huge sigh of relief as the configuration was reslotted and the databases were started, modified to new instance sizes, then rebooted as the correct target instance size!

Summary

In summary, we learned quite a bit and hope that you have too, if you have any plans for Outpost capacity in your future. In no particular order these lessons are:

Always plan, check your plan, recheck your plan, and have a backup plan
Always work closely with your AWS support and representatives to avoid problems like this if you can, and have them available when you need them in advance
Always stay calm and consider your options. Stick to the plan but react appropriately when circumstances change
Read your documentation and pay close attention to every detail as it impacts your planned path
When migrating capacity, ensure enough spare capacity is available both before, during, and after your migration plans
Use multiple phases of the migration plan where possible; consider initial phases, interim phases, and final phases
Please, please, please give your AWS support people and representatives a big show of appreciation for their hard work, dedication, and help!

About the Author

Regis is a staff platform engineer at Underdog. He has designed, built, and operated cloud-native architectures since 2015.

We're Hiring!

If you want to work on exciting projects like these with exciting people like me, please check out our hiring page

Image Credit

Photo by Torsten Dederichs on Unsplash

Blog