Deletes and MVCC in YugabyteDB
Franck Pachot
Posted on June 3, 2024
An application pattern that could cause issues with Multi-Version Concurrent Control (MVCC) databases is a queuing table with frequent insertions and deletions. Since all reads are based on a consistent snapshot, rows are not physically deleted, allowing the system to retrieve the state at any point since the start of the longest transaction. By default, YugabyteDB retains these intermediate versions for 15 minutes. A delete is simply a new version with a deletion marker, often called a tombstone.
In some MVCC implementations, the dequeue job may need to scan previously deleted rows before finding one that is visible to the current read time. This can increase processing times.
In YugabyteDB, all versions of a key are stored together, with the Hybrid Logical Clock timestamp appended to the key when stored in the LSM Tree. The order of the key determines whether you start scanning from the oldest (First-In-First-Out), which may include deleted rows, or from the newest (Last-In-First-Out), which may include newly inserted rows.
Here is a simple demonstration of the two cases. I use an identifier generated from a sequence to obtain an approximate ordering, but you can also use a timestamp.
Example
git clone git@github.com:FranckPachot/yb-metrics-lab.git
cd yb-metrics-lab
docker compose run --rm -it pgbench bash
I have a simple table that queues payloads for an account:
yugabyte=# \d job_queue
Table "public.job_queue"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+------------------------------
id | bigint | | not null | generated always as identity
account_id | integer | | not null |
payload | text | | |
I'll use the following scripts to enqueue (insert) and dequeue (select for update skip locked and delete) from this table:
cat > enqueue.sql <<'SQL'
insert into job_queue(account_id) values(0);
SQL
cat > dequeue.sql <<'SQL'
begin transaction;
select account_id, id, payload, 1 as found from job_queue
where account_id=0
for update skip locked limit 1
\gset
delete from job_queue
where account_id=:account_id and id=:id;
commit;
SQL
First-In-First-Out (FIFO)
I created my table with the primary key in the same ascending order as the sequence-generated ID:
drop table if exists job_queue;
create table job_queue (
primary key (account_id, id asc)
, id bigint generated always as identity
, account_id int
, payload text
);
I run the enqueue from one session and, starting one minute later, the dequeue from ten sessions:
# enqueue job
pgbench --rate=100 -nf enqueue.sql -T 3600 & sleep 60
# dequeue job
pgbench --client=10 -nf dequeue.sql -T 3600 -P 60
Here is the PgBench progression for dequeue, slowing down because of increasing latency:
pgbench (16.3 (Debian 16.3-1.pgdg120+1), server 11.2-YB-2.21.0.1-b0)
pgbench (16.3 (Debian 16.3-1.pgdg120+1), server 11.2-YB-2.21.0.1-b0)
progress: 60.0 s, 146.7 tps, lat 67.712 ms stddev 72.708, 0 failed
progress: 120.0 s, 77.5 tps, lat 129.086 ms stddev 133.242, 0 failed
progress: 180.0 s, 60.6 tps, lat 164.808 ms stddev 161.830, 0 failed
progress: 240.0 s, 51.4 tps, lat 194.510 ms stddev 181.549, 0 failed
progress: 300.0 s, 45.5 tps, lat 219.726 ms stddev 196.644, 0 failed
progress: 360.0 s, 41.4 tps, lat 241.760 ms stddev 206.806, 0 failed
progress: 420.0 s, 37.9 tps, lat 263.482 ms stddev 214.651, 0 failed
progress: 480.0 s, 35.3 tps, lat 283.004 ms stddev 219.732, 0 failed
progress: 540.0 s, 33.8 tps, lat 295.915 ms stddev 225.483, 0 failed
progress: 600.0 s, 31.9 tps, lat 313.742 ms stddev 225.922, 0 failed
progress: 660.0 s, 30.7 tps, lat 325.523 ms stddev 231.293, 0 failed
progress: 720.0 s, 29.4 tps, lat 340.384 ms stddev 234.496, 0 failed
progress: 780.0 s, 28.5 tps, lat 351.654 ms stddev 235.606, 0 failed
progress: 840.0 s, 27.3 tps, lat 366.428 ms stddev 235.558, 0 failed
progress: 900.0 s, 26.7 tps, lat 374.767 ms stddev 235.163, 0 failed
progress: 960.0 s, 25.6 tps, lat 389.821 ms stddev 238.707, 0 failed
progress: 1020.0 s, 25.4 tps, lat 394.482 ms stddev 238.515, 0 failed
progress: 1080.0 s, 24.5 tps, lat 408.922 ms stddev 239.427, 0 failed
progress: 1140.0 s, 23.9 tps, lat 418.647 ms stddev 236.840, 0 failed
progress: 1200.0 s, 23.6 tps, lat 422.421 ms stddev 237.729, 0 failed
progress: 1260.0 s, 23.1 tps, lat 433.824 ms stddev 238.868, 0 failed
progress: 1320.0 s, 22.7 tps, lat 441.515 ms stddev 238.022, 0 failed
progress: 1380.0 s, 22.0 tps, lat 454.995 ms stddev 236.985, 0 failed
progress: 1440.0 s, 19.5 tps, lat 514.327 ms stddev 229.819, 0 failed
progress: 1500.0 s, 19.2 tps, lat 519.461 ms stddev 234.298, 0 failed
progress: 1560.0 s, 19.0 tps, lat 528.162 ms stddev 229.675, 0 failed
progress: 1620.0 s, 18.7 tps, lat 534.814 ms stddev 230.564, 0 failed
progress: 1680.0 s, 18.2 tps, lat 548.424 ms stddev 227.151, 0 failed
progress: 1740.0 s, 18.1 tps, lat 553.188 ms stddev 228.198, 0 failed
progress: 1800.0 s, 17.7 tps, lat 565.126 ms stddev 229.877, 0 failed
progress: 1860.0 s, 17.5 tps, lat 569.887 ms stddev 229.138, 0 failed
progress: 1920.0 s, 17.6 tps, lat 569.976 ms stddev 232.376, 0 failed
progress: 1980.0 s, 17.2 tps, lat 580.405 ms stddev 232.279, 0 failed
progress: 2040.0 s, 17.4 tps, lat 575.567 ms stddev 229.492, 0 failed
progress: 2100.0 s, 17.1 tps, lat 585.600 ms stddev 227.470, 0 failed
progress: 2160.0 s, 17.0 tps, lat 588.600 ms stddev 226.274, 0 failed
progress: 2220.0 s, 16.8 tps, lat 597.306 ms stddev 230.114, 0 failed
progress: 2280.0 s, 16.7 tps, lat 599.443 ms stddev 224.922, 0 failed
progress: 2340.0 s, 16.4 tps, lat 612.358 ms stddev 221.710, 0 failed
progress: 2400.0 s, 16.5 tps, lat 607.770 ms stddev 222.975, 0 failed
progress: 2460.0 s, 16.3 tps, lat 613.478 ms stddev 223.859, 0 failed
progress: 2520.0 s, 16.2 tps, lat 619.936 ms stddev 227.354, 0 failed
progress: 2580.0 s, 16.2 tps, lat 615.391 ms stddev 221.524, 0 failed
progress: 2640.0 s, 15.9 tps, lat 628.707 ms stddev 215.744, 0 failed
progress: 2700.0 s, 15.8 tps, lat 633.731 ms stddev 209.020, 0 failed
progress: 2760.0 s, 15.4 tps, lat 646.383 ms stddev 209.972, 0 failed
progress: 2820.0 s, 15.5 tps, lat 645.852 ms stddev 211.241, 0 failed
progress: 2880.0 s, 15.3 tps, lat 655.629 ms stddev 205.603, 0 failed
progress: 2940.0 s, 15.0 tps, lat 665.360 ms stddev 207.632, 0 failed
progress: 3000.0 s, 15.3 tps, lat 655.442 ms stddev 204.249, 0 failed
progress: 3060.0 s, 14.8 tps, lat 672.804 ms stddev 202.524, 0 failed
progress: 3120.0 s, 14.6 tps, lat 683.751 ms stddev 190.039, 0 failed
progress: 3180.0 s, 14.7 tps, lat 681.479 ms stddev 198.928, 0 failed
progress: 3240.0 s, 14.4 tps, lat 696.753 ms stddev 199.672, 0 failed
progress: 3300.0 s, 14.0 tps, lat 708.999 ms stddev 195.875, 0 failed
progress: 3360.0 s, 14.0 tps, lat 717.781 ms stddev 198.667, 0 failed
progress: 3420.0 s, 13.9 tps, lat 719.463 ms stddev 201.397, 0 failed
progress: 3480.0 s, 13.5 tps, lat 736.315 ms stddev 199.684, 0 failed
The total statistics for enqueue:
transaction type: enqueue.sql
scaling factor: 1
progress: 3540.0 s, 13.7 tps, lat 728.964 ms stddev 208.056, 0 failed
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 360418
number of failed transactions: 0 (0.000%)
latency average = 9.583 ms
latency stddev = 18.288 ms
rate limit schedule lag: avg 4.668 (max 475.122) ms
initial connection time = 24.256 ms
tps = 100.116656 (without initial connection time)
progress: 3600.0 s, 14.1 tps, lat 707.956 ms stddev 195.490, 0 failed
and the total for dequeue:
transaction type: dequeue.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 89663
number of failed transactions: 0 (0.000%)
latency average = 401.490 ms
latency stddev = 290.233 ms
initial connection time = 235.231 ms
tps = 24.905485 (without initial connection time)
I gathered a few statistics from my Grafana dashboard (IntentsDB is in Red-Yellow, RegularDB in Blue-Purple):
Last-In-First-Out (LIFO)
I re-created my table with the primary key in descending order, which is the opposite of how it is generated by the sequence:
drop table if exists job_queue;
create table job_queue (
primary key (account_id, id desc)
, id bigint generated always as identity
, account_id int
, payload text
);
I run the same as before except that I also rate-limit the dequeue because I expect it to be faster and want to avoid pgbench failing with expected one row, got 0
:
# enqueue job
pgbench --rate=100 -nf enqueue.sql -T 3600 & sleep 60
# dequeue job
pgbench --rate=100 --client=10 -nf dequeue.sql -T 3600 -P 60
Here is the PgBench progression for dequeue, which always keep-up with the enqueue rate and with acceptable latency:
pgbench --rate=100 --client=10 -nf dequeue.sql -T 3600 -P 60
pgbench (16.3 (Debian 16.3-1.pgdg120+1), server 11.2-YB-2.21.0.1-b0)
pgbench (16.3 (Debian 16.3-1.pgdg120+1), server 11.2-YB-2.21.0.1-b0)
progress: 60.0 s, 98.7 tps, lat 8.064 ms stddev 9.179, 0 failed, lag 0.127 ms
progress: 120.0 s, 95.9 tps, lat 9.373 ms stddev 8.582, 0 failed, lag 0.100 ms
progress: 180.0 s, 99.5 tps, lat 11.545 ms stddev 12.404, 0 failed, lag 0.099 ms
progress: 240.0 s, 100.5 tps, lat 13.966 ms stddev 17.003, 0 failed, lag 0.100 ms
progress: 300.0 s, 99.0 tps, lat 16.370 ms stddev 23.553, 0 failed, lag 0.115 ms
progress: 360.0 s, 98.6 tps, lat 20.362 ms stddev 34.356, 0 failed, lag 0.833 ms
progress: 420.0 s, 101.7 tps, lat 28.105 ms stddev 57.121, 0 failed, lag 1.556 ms
progress: 480.0 s, 101.0 tps, lat 27.464 ms stddev 48.073, 0 failed, lag 0.258 ms
progress: 540.0 s, 98.9 tps, lat 25.940 ms stddev 42.101, 0 failed, lag 0.141 ms
progress: 600.0 s, 102.1 tps, lat 63.286 ms stddev 125.027, 0 failed, lag 14.132 ms
progress: 660.0 s, 100.7 tps, lat 37.192 ms stddev 72.409, 0 failed, lag 1.475 ms
progress: 720.0 s, 99.1 tps, lat 39.608 ms stddev 77.520, 0 failed, lag 0.488 ms
progress: 780.0 s, 99.9 tps, lat 44.945 ms stddev 86.302, 0 failed, lag 1.237 ms
progress: 840.0 s, 98.5 tps, lat 43.786 ms stddev 79.889, 0 failed, lag 0.386 ms
progress: 900.0 s, 100.6 tps, lat 62.501 ms stddev 126.731, 0 failed, lag 1.846 ms
progress: 960.0 s, 101.6 tps, lat 72.515 ms stddev 151.758, 0 failed, lag 6.916 ms
progress: 1020.0 s, 98.7 tps, lat 66.117 ms stddev 130.873, 0 failed, lag 2.217 ms
progress: 1080.0 s, 99.6 tps, lat 70.434 ms stddev 143.450, 0 failed, lag 1.957 ms
progress: 1140.0 s, 99.7 tps, lat 82.825 ms stddev 176.040, 0 failed, lag 9.100 ms
progress: 1200.0 s, 100.5 tps, lat 76.087 ms stddev 162.944, 0 failed, lag 4.043 ms
progress: 1260.0 s, 100.7 tps, lat 94.963 ms stddev 204.738, 0 failed, lag 15.627 ms
progress: 1320.0 s, 99.9 tps, lat 82.451 ms stddev 170.476, 0 failed, lag 4.002 ms
progress: 1380.0 s, 99.1 tps, lat 79.145 ms stddev 170.828, 0 failed, lag 2.622 ms
progress: 1440.0 s, 99.8 tps, lat 84.359 ms stddev 180.998, 0 failed, lag 3.621 ms
progress: 1500.0 s, 99.8 tps, lat 84.317 ms stddev 183.958, 0 failed, lag 4.569 ms
progress: 1560.0 s, 101.2 tps, lat 82.745 ms stddev 183.122, 0 failed, lag 3.603 ms
progress: 1620.0 s, 98.4 tps, lat 82.990 ms stddev 181.972, 0 failed, lag 3.662 ms
progress: 1680.0 s, 99.8 tps, lat 86.107 ms stddev 192.065, 0 failed, lag 4.773 ms
progress: 1740.0 s, 99.9 tps, lat 88.515 ms stddev 198.305, 0 failed, lag 5.556 ms
progress: 1800.0 s, 100.9 tps, lat 96.506 ms stddev 231.509, 0 failed, lag 8.763 ms
progress: 1860.0 s, 100.6 tps, lat 100.779 ms stddev 222.100, 0 failed, lag 14.758 ms
progress: 1920.0 s, 99.0 tps, lat 95.487 ms stddev 233.232, 0 failed, lag 7.288 ms
progress: 1980.0 s, 102.2 tps, lat 113.076 ms stddev 266.697, 0 failed, lag 25.122 ms
progress: 2040.0 s, 99.9 tps, lat 93.044 ms stddev 225.485, 0 failed, lag 6.309 ms
progress: 2100.0 s, 99.2 tps, lat 93.029 ms stddev 220.702, 0 failed, lag 5.438 ms
progress: 2160.0 s, 99.0 tps, lat 97.459 ms stddev 234.508, 0 failed, lag 8.769 ms
progress: 2220.0 s, 99.6 tps, lat 96.281 ms stddev 227.267, 0 failed, lag 8.618 ms
progress: 2280.0 s, 99.9 tps, lat 92.199 ms stddev 222.713, 0 failed, lag 5.039 ms
progress: 2340.0 s, 101.5 tps, lat 102.342 ms stddev 260.968, 0 failed, lag 12.282 ms
progress: 2400.0 s, 102.8 tps, lat 101.874 ms stddev 279.425, 0 failed, lag 12.277 ms
progress: 2460.0 s, 100.0 tps, lat 101.566 ms stddev 261.376, 0 failed, lag 12.693 ms
progress: 2520.0 s, 101.0 tps, lat 146.805 ms stddev 329.037, 0 failed, lag 54.946 ms
progress: 2580.0 s, 98.5 tps, lat 186.116 ms stddev 369.730, 0 failed, lag 98.700 ms
progress: 2640.0 s, 100.2 tps, lat 62.095 ms stddev 119.576, 0 failed, lag 2.023 ms
progress: 2700.0 s, 99.5 tps, lat 71.383 ms stddev 143.084, 0 failed, lag 4.054 ms
progress: 2760.0 s, 100.0 tps, lat 73.186 ms stddev 148.095, 0 failed, lag 1.882 ms
progress: 2820.0 s, 100.3 tps, lat 80.258 ms stddev 169.544, 0 failed, lag 5.936 ms
progress: 2880.0 s, 98.6 tps, lat 75.789 ms stddev 161.448, 0 failed, lag 2.136 ms
progress: 2940.0 s, 103.7 tps, lat 203.515 ms stddev 293.579, 0 failed, lag 118.073 ms
progress: 3000.0 s, 101.9 tps, lat 149.085 ms stddev 263.196, 0 failed, lag 66.254 ms
progress: 3060.0 s, 99.3 tps, lat 78.714 ms stddev 165.887, 0 failed, lag 3.332 ms
progress: 3120.0 s, 100.6 tps, lat 83.259 ms stddev 182.955, 0 failed, lag 4.556 ms
progress: 3180.0 s, 98.1 tps, lat 82.884 ms stddev 181.662, 0 failed, lag 3.129 ms
progress: 3240.0 s, 99.9 tps, lat 83.095 ms stddev 183.558, 0 failed, lag 2.576 ms
progress: 3300.0 s, 100.0 tps, lat 85.494 ms stddev 194.323, 0 failed, lag 4.479 ms
progress: 3360.0 s, 98.2 tps, lat 84.920 ms stddev 193.003, 0 failed, lag 2.676 ms
progress: 3420.0 s, 101.4 tps, lat 86.418 ms stddev 204.487, 0 failed, lag 3.915 ms
progress: 3480.0 s, 98.9 tps, lat 91.951 ms stddev 213.238, 0 failed, lag 5.161 ms
The total statistics for the enqueue job:
transaction type: enqueue.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 359955
number of failed transactions: 0 (0.000%)
latency average = 6.192 ms
latency stddev = 4.475 ms
rate limit schedule lag: avg 2.012 (max 73.152) ms
initial connection time = 21.283 ms
tps = 99.988027 (without initial connection time)
progress: 3540.0 s, 101.9 tps, lat 98.697 ms stddev 234.471, 0 failed, lag 12.376 ms
progress: 3600.0 s, 13.6 tps, lat 19861.627 ms stddev 15905.049, 0 failed, lag 19125.271 ms
and for the dequeue job:
transaction type: dequeue.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 354818
number of failed transactions: 0 (0.000%)
latency average = 124.958 ms
latency stddev = 1261.227 ms
rate limit schedule lag: avg 55.943 (max 51818.490) ms
initial connection time = 207.699 ms
tps = 98.550957 (without initial connection time)
My Grafana dashboard shows that this higher throughput and lower latency didn't use more resources in terms of terms of LSM Tree seek/next or memory:
It experienced fewer read restarts:
It's not easy to recommend a one-size-fits-all design for queuing tables in SQL databases because it depends on how the application performs the queuing and dequeuing. The goal of this post was to demonstrate some patterns and show how to test them. One specificity of MVCC databases is that they must keep the deleted rows for a while, and you should avoid scanning through an increasing number of them.
Posted on June 3, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
December 23, 2023
November 18, 2023