AWS RDS Backup Dilemma: Why It is Hard to Do Good on RTO and RPO Simultaneously

Objectives and Disaster recovery

Because a disaster event can potentially take down a workload, your objective for Disaster Recovery should be bringing your workload back up or avoiding downtime altogether. A recovery strategy itself is most often built upon two objectives. RTO and RPO:

Recovery time objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
Recovery point objective (RPO) is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.

(Database) Backups

Needless to say, a vast majority of software projects still contain some kind of database in their setup. That persistent data store is often also one of the main subjects in the disaster recovery strategy. Backups are the answer to this issue. In the context of AWS RDS, there are two options for doing Database backups:

RDS DB snapshots: a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases.
RDS Recovery Point in Time: this allows you to restore a DB instance to a specific point in time, creating a new DB instance.

The Dilemma

So far the introduction, to make my point, we need to set some objectives first. I don't want to push it to the extreme, so let's put some reasonable figures for RTO and RPO:

RTO: 30 mins, preferably less
RPO: 1h, preferably less

No questions asked, the less data we lose when recovering, the better. With this in mind, "Recovery Point in Time" always seems the better option. But is it? Have you ever tried doing a "Point in Time recovery"? Although offering better RPO, is there any difference regarding RTO? While not all that obvious, yes there is! Point in Time recovery requires more time to restore your data.

"Recovery Point in Time" behind the scenes.

To understand why a point in time restore is slower, we need to know how it works. To do its point-in-time magic, regular snapshots are taken from time to time. On top of that, to prevent losing as little data as possible, a binary log containing all operations (aka oplog or operations log) is stored (on AWS S3). So whenever you run a Point in Time recovery, there's added time to replay the operations log on top of the regular snapshot restore time. Of course, the extra time will depend on the age of the snapshot and the number of average manipulations that you run on the database. But inevitably, it will take you extra time.

So here's your dilemma:

"Do you choose regular DB snapshots offering a likely lower RPO but faster RTO" vs. "Do you choose Point in time restore offering better RPO but slower RTO?"

No, you can't have the best RPO without paying a price for it.

What have I learned?

In my case, it's okay to lose some data, and in fact, my service can live with an RPO of 1 hour. On the other hand, the better our RTO, the happier business will be in case of a catastrophe.
Those objectives made us drop "Recovery Point in Time Restore" and daily backup instead of just hourly snapshots. This new approach will offer us a one-hour RPO with the best possible RTO.

After all, testing the Disaster Recovery Plan to sort out all those assumptions taught me a lot 😉

A small update: DB Snapshot Restore uses lazy loading

After reading this blog, fellow AWS Hero Renato Losio pointed out another factor to consider regarding RTO when an RDS Snapshot restore is performed: Lazy loading. It is something I was completely unaware of, although it’s all in docs:

You can create a new DB instance by restoring from a DB snapshot. You can use the restored DB instance as soon as its status is available. However, the DB instance continues to load data in the background. This is known as lazy loading. Furthermore, If you access data that hasn't been loaded yet, the DB instance immediately downloads the requested data from Amazon S3, and then continues loading the rest of the data in the background.

There’s even a way to diminish these effects:

To help mitigate the effects of lazy loading on tables to which you require quick access, you can perform operations that involve full-table scans, such as SELECT *. This allows Amazon RDS to download all of the backed-up table data from S3.

From: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RestoreFromSnapshot.html

Kudos to Renato for making me aware of this!

Enjoy and until next time!

Blog