Why You Can't (yet) Deprecate Write Sharding on DynamoDB
André
Posted on June 1, 2020
Disclaimer: I'm not affiliated with AWS and this is not technical advice.
Motivation
A couple of months ago, revisiting DynamoDB's documentation, this piece caught my attention:
Immediately after that, I remembered all the thousand times I watched Rick Houlihan's talks, specially the parts where he explains how to make use of write sharding on your DynamoDB table designs, in order to maximize throughput by avoiding hot partitions.
It was clear to me that this piece of documentation was about a feature that presumably addressed the same issue as write sharding (hot partitions), but in an adaptive way. Although the documentation doesn't tell us much, it does make a bold promise: to automatically repartition your items across as many nodes as needed, down to a single-item partition. It looked like a very handy feature, since hot partitions are a real thing for highly scalable applications, and it's something you really need to consider before committing your table design. Also, not having to deal with write sharding means not pushing it down the pipe, forcing the application layer to handle the bundled complexity.
Most importantly, write sharding strategies increase your throughput capacity consumption. Since this "auto-split" feature comes at no cost, making use of it means saving money.
Naturally, an adaptive solution at the service layer screams for a loosely coupled, fault tolerant system (which is not necessarily the case for write sharding, where the splitting happens on your side, under your control). But then again, what's the matter? The problem we're trying to address concerns applications running at high scale, and these are most likely built on top of those best practices, anyway.
Too good to be true?
This whole thing got me intrigued. How come this feature exists and no one talks about it? So I started asking questions:
Kirk Kirkconnell's answers aside, which confirmed that the feature existed and there was an upcoming documentation overhaul that would make things clearer, no one else had an actual answer. In a recent tweet, I even tried to tease Rick Houlihan himself, but had no luck there.
edit: A while later, Rick attentively answered all my questions. Thanks, Rick!
At this point, I was already getting paranoid.
I needed to do something about it.
Tests
I finally dedicated some time to build a CDK app to test this. If you're interested, check out this blog post for a walkthrough.
Considerations
A few things to consider before load testing a DynamoDB table:
- There's a default limit of 40k WCUs and 40k RCUs.
- Tables running in On-Demand capacity mode have an initial "previous peak" value of 2k WCUs or 6k RCUs.
- The above means that the initial throughput limit for an On-Demand table is 4k WCUs or 12k RCUs (or a linear combination of the two, eg.: 0 WCUs and 12k RCUs).
- The new "previous peak" is established approximately 30 minutes after the new peak is reached, then effectively doubling the throughput limit.
It's also relevant to remember what exactly we are testing. We want to prove the hypothesis of performing read and write operations in a single table, using a single DynamoDB Partition Key, at a sustained throughput that has to be higher than the advertised limit (1k WCUs / 3k RCUs per partition).
Smoke Testing
The first thing I wanted to test, was if this feature existed at all. And I thought that if it was to exist, it would have to at least support reads.
In order to make read requests, I needed to populate the table. So, I triggered the execution of an insert batch, making sure the load didn't reach the throughput limit of the partition.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
3 units | 600 secs | 300 items | 1 sec | 900 WCU | 540k WCU |
With the items in place, I ran another batch, this time with a target throughput of about 6k RCUs, which is twice the partition limit:
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
20 units | 600 secs | 300 items | 0.8 sec | 6000 RCU | 3600k RCU |
As you can see, we could reach the target throughput. This is our custom CloudWatch Dashboard Widget after performing both steps of this test:
However, we have experienced some throttling. DynamoDB Metrics says we had around 1.40% of throttled reads.
Since the node limit is 3k RCU, populating the table at around 900 WCU might have split our data into two nodes, allowing us to reach 6k RCUs.
Knowing how DynamoDB throughput limits work at the table level, I thought that maybe we have reached a new plateau at 6k RCUs. This would explain this marginal throttling rate.
Smoking Test: Higher Load
Then, I ran another test. Starting again with a new table, populated within the partition limits. This time though, I'll try to reach something just below 10k RCUs. My goal here is to find another plateau and establish a pattern on the feature behaviour.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
33 units | 600 secs | 300 items | 0.9 sec | 9900 RCU | 5940k RCU |
As suspected, we indeed had reached a plateau at 6k RCU. It took around 8 minutes for the auto-split to kick in and repartition our data, so we could again reach another peak. Then we could run for the remaining 2 minutes without throttling.
Now let's see what happens when trying to read at 12k RCUs on the same table. The idea is to test for a new plateau and see if the feature can handle a sustained throughput at around the previous peak without throttling.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
42 units | 600 secs | 300 items | 0.9 sec | 12600 RCU | 7200k RCU |
Just like we have guessed, whatever the "auto-split" feature did, it's there. A relatively small throttle count happened when the throughput surpassed the 12k plateau, either because of the On-Demand initial table limit or because we have reached another plateau:
Testing the Plateau Pattern in a Real World Scenario
Now that we have a better hypothesis on how this feature works, let's try a real world scenario. Let's imagine that an application needs to support the case where a new Partition Key suddenly starts to suffer a big amount of reads and writes. Since it's a new partition, there's no chance it was already re-partitioned.
Our write workload will target 2.5k WCUs, and our read workload will target 6k RCUs
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
5 units | 600 secs | 500 items | 1 sec | 2500 WCU | 1500k WCU |
20 units | 600 secs | 300 items | 0.9 sec | 6000 RCU | 3600k RCU |
As you can see, many bad things happened here. As we could have guessed, the re-partitioning isn't instantaneous, and it's unclear what exactly makes it happen.
It's also interesting to see that the throughput capacity randomly reaches another plateau as if the feature didn't follow a pattern over time.
Another Test Case
Now let's imagine an application that needs its partitions to have their throughput capacity linearly increasing over time. Based on the tests performed so far, I'll run a test that simultaneously reads and writes to a new partition at an initial velocity of 100WCUs / 300RCUs and accelerates 15% every minute.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
5 units | 1200 secs | 20-277 items | 1 sec | 100-1385 WCU | 600k WCU |
20 units | 1200 secs | 15 items | 1 sec | 300-4230 RCU | 1830k RCU |
If we look at all the tests so far, besides the fact that the split seems to be happening randomly across the test duration, we can see that the capacity is always increased by 1k WCUs each time a new plateau is reached. This behaviour makes me think that regardless of how much throttling is happening, the feature acts by adding a single node to the partition.
Running another batch on the same table and partition, now loading 6000 RCU + 2000 WCU steadily (which is actually 4 times the parition capacity, since the operations are running simultaneously), we can see that there's no throttling.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
5 units | 600 secs | 400 items | 1 sec | 2000 WCU | 1200k WCU |
20 units | 600 secs | 300 items | 1 sec | 6000 RCU | 3600k RCU |
It looks like the repartitioning is somewhat persistent.
Settling down
In the next test, we'll apply the exact same load as the previous batch, which is again 4 times the initial partition capacity. But this time, we'll do it for 30 minutes in a new partition. The idea is to validate the behavior we could infer from last test. We'll do it on the same table, so the table limits don't masquerade the results.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
5 units | 1200 secs | 400 items | 1 sec | 2000 WCU | 1200k WCU |
20 units | 1200 secs | 300 items | 1 sec | 6000 RCU | 3600k RCU |
Again, as you can see in the graph above, the time it takes for the partition to be split is pretty random, and it certainly depends on many variables we cannot control.
To make things even more clear, we'll do another batch on the same table and new partition, but now with a load of 5k WCUs for over an hour.
Workers | Duration | Load | Interval | Target Throughput | Total Capacity |
---|---|---|---|---|---|
10 units | 4800 secs | 400 items | 1 sec | 4000 WCU | 19200k WCU |
Test Conclusions
It looks like DynamoDB, in fact, has a working auto-split feature for hot partitions. Also, there are reasons to believe that the split works in response to a high usage of throughput capacity on a single partition, and that it always happens by adding a single node, so that the capacity is increased by 1kWCUs / 3k RCUs each time. The "split" also appears to be persistent over time.
Final Thoughts
The "auto-split" feature seems to be a best effort to accommodate unpredicted loads that require more throughput capacity than available in the partition.
It's important to note that the feature is very briefly described on the official docs, and there's no promise on how it actually performs. If I had to guess, I would say this is something still being worked on. My instincts say it'll be improved and launched as an "auto-sharding" feature when it's mature enough.
In the meantime though, since this clearly isn't a replacement for write sharding and considering the auto-split as proven to be persistent over time, is there any other way we could make use of it?
For some applications that have to make use of write sharding to increase the throughput capacity of their partitions, here is an architecture suggestion:
For more details and hopefully a discussion about this suggestion, check this post.
Thanks for reading!
--
André
Posted on June 1, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.