DocumentDB Vacuum Locks
Felipe Malaquias
Posted on March 23, 2024
Beware possible locks on large updates/deletions
Historically, traditional databases dealt with writes with pessimistic locks on records during writes to avoid inconsistency, which had the obvious drawback of being unable to handle concurrency properly, as transactions could fail.
This is solved by MVCC (Multiversion Concurrency Control), by creating a new version of a record on every update, circumventing the need to lock records, and allowing concurrency (see this video from Cameron McKenzie for a nice and simple illustrated explanation).
However, to clean up the old versions, a vacuum process must run in the background, which may cause locks in your complete collection, bottlenecks, and possibly unexpected downtimes in your application.
There is no permanent fix for this at the moment, but if you need to perform such updates, you may contact support and ask them to disable the process that reclaims unused storage space. This will not negatively impact your workload, and space reclaimed by the garbage collector will continue to be recycled. However, the size of your collections will never decrease, even if a significant amount of data has been deleted.
The good news
AWS is currently working on a fix for it, which may be available at any time, so you should keep an eye on the DocumentDB release notes.
This is how we experienced it
On a lovely Tuesday morning, we reset one of our Kafka topics (~71 GB) to re-consume all our data for a particular domain to aggregate it with new fields in our database. All messages were successfully consumed and written in the primary DB instance in a few minutes as expected:
What we did not expect, though, are those waves of latency increase in our workload hours after the records were consumed and initially without much of a pattern until approx. 5 pm:
Those were all caused by locks on a particular collection in the read replicas as shown by the pink bars in the Document performance insights metrics below:
As you see, the locks were gone after around 11 pm, matching the end of the DocumentDB freeable memory metrics changes below:
Reaching out to AWS support, they investigated the issue. They confirmed it was caused by the process of reclaiming unused space during the garbage collection on the vacuum process. This process must be synchronized between the writer and readers because the readers might still have in-flight transactions that can see the deleted data. In some rare circumstances and the presence of a large amount of reclaimable data, this synchronization can adversely affect the workload on the replicas.
Conclusion
Be aware of the MVCC strategy and check how it may affect your database (not only DocumentDB) in case of large updates as described above, and probably most importantly, always test it in staging first ;)
Also, be aware of the known issue with the DocumentDB vacuum process and watch the release notes.
Posted on March 23, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.