Exposed By Time and Use

Last year, a project with both high complexity and high urgency was dropped in my lap. Worse, the project was, to be charitable, not well specified. It was basically, "we need you to automate the deployment of these six services and we need to have something demo-able within thirty days".

Naturally, the six services were ones that I'd never worked with from the standpoint of installing or administering. Thus, there as a bit of a learning curve around how best to automate things that wasn't aided by the paucity of "here's the desired end-state" or other related details. All they really knew was:

Automate the deployment into AWS
Make sure it works on our customized/hardened build
Make sure that it backs itself up in case things blow up
GO!

I took my shot at the problem. I met the deadline. Obviously, the results were not exactly "optimal" — especially from the "turn it over to others to maintain standpoint. Naturally, after I turned over the initial capability to the requester, they were radio silent for a couple weeks.

When they finally replied, it was to let me know that the deadline had been extended by several months. So, I opted to use that time to make the automation a bit friendlier for the uninitiated to use. That's mostly irrelevant here — just more "background".

At any rate, we're now nearly a year-and-a-half removed from that initial rush-job. And, while I've improved the ease of use for the automation (it's been turned over to others for daily care-and-feeding), much of the underlying logic hasn't been revisited.

Over that kind of span, time tends to expose a given solution's shortcomings. Recently, they were attempting to do a parallel upgrade of one of the services and found that the data-move portion of the solution was resulting in build-timeouts. Turns out the size of the dataset being backed up (and recovered from as part of the automated migration process) had exploded. I'd set up the backups to operate incrementally, so, the increase in raw transfer times had been hidden.

The incremental backups were only taking a few minutes; however, restores of the dataset were taking upwards of 35 minutes. The build-automation was set to time out at 15 minutes (early in the service-deployment, a similar opration took 3-7 minutes) So, step one was to adjust the automation's timeouts to make allowances for the new restore-time realities. Second step was to investigate why the restores were so slow.

The quick-n-dirty backup method I'd opted for was a simple s3 sync --delete /<APPLICATION_HOME_DIR>/ s3://<BUCKET>/<FOLDER>. It was a dead-simple way to "sweep" the contents of the directory /<APPLICATION_HOME_DIR>/ to S3. And, because S3's sync method defaults to incremental, the cron-managed backups were taking the same couple minutes each day that they were a year-plus ago.

Fun fact about S3 and its transfer performance: if the objects you're uploading have keys with high degrees of commonality, transfer performance will become abysmal.

You may be asking why I mention "keys" since I've not mentioned encryption. S3, being an object-based filesystem, doesn't have the hierarchical layout of legacy, host-oriented storage. If I take a file from a host-oriented storage and use the S3 CLI utility to copy that file via its fully-qualified pathname to S3, the object created in S3 will look like:

<FULLY>/<QUALIFIED>/<PATH>/<FILE>

Of the above, "<FILE>" is the object name stored in S3 while "<FULLY>/<QUALIFIED>/<PATH>" is the key for that file. If you have a few thousand objects with the same or sufficiently-similar "<FULLY>/<QUALIFIED>/<PATH>" values, you'll run into that "transfer performance will become abysmal" issue mentioned earlier.

We very definitely did run into that problem. HARD. The dataset in question is (currently) a skosh less than 11GiB in size. The instance being backed up has an expected througput of about 0.45Gbps of sustained network throughput. So, we were expecting that dataset to take only a couple minutes to transfer. However, as noted above, it was taking 35+ minutes to do so.

So, how to fix? One of the easier methods is to stream your backups to a single file. A quick series of benchmarking-runs showed that doing so cut that transfer from over 35 minutes to under five minutes. Similarly, were one to iterate over all the files in the dataset, and individually copying the files into S3 using either randomized filenames (and setting the "real" fully-qualified-path as an attribute/tag of the file) or simply reversing the path-name (doing something like S3NAME=$( echo "${REAL_PATHNAME} | perl -lne 'print join "/", reverse split/\//;') ) and storing that then your performance goes up dramatically.

I'll likely end up doing one of those three methods ...once I have enough contiguous time to allocate to re-engineering the backup and restore/rebuild methods. There's tradeoffs with each — especially around avoiding copying of source-data that already exists in the destination — that need to be balanced.

Blog

Exposed By Time and Use

Thomas H Jones II

Join Our Newsletter. No Spam, Only the good stuff.

Related