Observing CPU/RAM/IO pressure in YugabyteDB with Linux PSI on AlmaLinux8 (Pressure Stall Information)
Franck Pachot
Posted on July 31, 2024
Pressure Stall Information is a Linux kernel metric that accounts for when the OS tasks are waiting for CPU, RAM, or IO resources.
I run this example with a YugabyteDB cluster provisioned with the Yugabyte DBaaS "YugabyteDB Anyware". It deploys on AlmaLinux release 8.9 with kernel 4.18.0-513
. To enable Pressure Stall Information in the Linux kernel command line, I connect to each node as root and run the following:
cat /proc/cmdline | grep " psi=" ||
grubby --update-kernel ALL --args 'psi=1' &&
echo "reboot node $(hostname) to enabled PSI"
In the list of nodes of YugabyteDB Anyware there's a "🔗 Connect" action that displays the ssh
command. I changed yugabyte@
with ec2-user@
to connect with a sudoer user. To reboot, I use YugabyteDB Anyware "🔁 Initiate Rolling Restart"
It is easy to query PSI values with tail /proc/pressure/*
:
[ec2-user@ip-172-159-16-208 ~]$ tail /proc/pressure/*
==> /proc/pressure/cpu <==
some avg10=0.06 avg60=0.45 avg300=0.43 total=10181922
==> /proc/pressure/io <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=2629731
full avg10=0.00 avg60=0.00 avg300=0.00 total=2127499
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
In short, it provides the percentage of available time where at least one task is waiting (some
), or all tasks were stalled (full
), waiting in the runqueue (cpu
), or waiting for RAM access (memory
), or disk read/write operations (io
), in the last ten seconds (avg10
), minute (avg60
), and five minutes (avg300
).
My lab is a small three-node cluster with two vCPUs and 4GB for each node, where I may easily face resource shortages when running PgBench activities.
pgbench -i -s 200
I initialize pgbench tables with ysql_bench -i
:
alias ysql_bench="$(find / -name ysql_bench | tail -1)"
alias ysqlsh="$(find / -name ysqlsh | tail -1)"
export PGHOST=$(hostname) PGPASSWORD=Tablesp@c3
time ysql_bench -i -s 200 && tail /proc/pressure/*
This ran for ten minutes, and tail /proc/pressure/*
shows some pressure on CPU:
==> /proc/pressure/cpu <==
some avg10=54.32 avg60=53.20 avg300=42.55 total=802267049
==> /proc/pressure/io <==
some avg10=1.91 avg60=1.53 avg300=1.26 total=36474251
full avg10=0.65 avg60=0.26 avg300=0.16 total=17523081
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=565109
full avg10=0.00 avg60=0.00 avg300=0.00 total=218065
It seems the pressure increased, higher in the last minute than in the previous five minutes.
As PSI is collected by node_exporter, which YBA gives access to, I can get the details in Prometheus:
irate(node_pressure_cpu_waiting_seconds_total[30s])*100
This is the CPU pressure and looks similar to CPU Usage statistics:
However, they measure different things. CPU Usage refers to what is currently running on the CPU, using the available CPU cycles, while CPU Pressure refers to what is waiting to run on the CPU, consuming response time without making progress. They look similar because the processes wait in the run queue because the CPU is in use. One is the consequence of the other, but PSI is closer to what you want to know: the impact of high CPU usage on response time.
We have the same difference for other resources. For example, I can see some disk write activity but don't need to look at it as PSI told me that there's no io
pressure (no impact on response time):
It is the same for memory, which was used for filesystem cache and seems to be sufficient a PSI told me that there's no memory
pressure
Here is the summary of PSI waiting percentages:
avg(100*irate({node_prefix="yb-15-fpachot-test", saved_name=~"node_pressure_cpu_waiting_seconds_total|node_pressure_memory_waiting_seconds_total|node_pressure_io_waiting_seconds_total"}[30s])) by (saved_name)
Once the pressure is identified, the other statistics may be useful. For example, at the same time (PSI was UTC, but this one is CET), I may see that the increase of CPU pressure came after tablet splitting and compaction:
After tablet splitting, the insert operations are distributed to more tablets, maintaining the throughput while using more CPUs. If this workload's pressure reaches 100%, you may consider running with more vCPUs. With YugabyteDB, it can be as simple as adding nodes, and the tablets will be rebalanced.
pgbench -i -s 2000
I run a similar data generation at a scale ten times larger to ensure the database doesn't exceed memory capacity.
time ysql_bench -i -s 2000 && tail /proc/pressure/*
...
199800000 of 200000000 tuples (99%) done (elapsed 6570.54 s, remaining 6.58 s)
199900000 of 200000000 tuples (99%) done (elapsed 6575.12 s, remaining 3.29 s)
200000000 of 200000000 tuples (100%) done (elapsed 6579.70 s, remaining 0.00 s)
done.
real 110m0.836s
user 1m10.653s
sys 0m15.283s
==> /proc/pressure/cpu <==
some avg10=21.80 avg60=24.13 avg300=26.29 total=3472556285
==> /proc/pressure/io <==
some avg10=24.20 avg60=22.05 avg300=20.30 total=810389316
full avg10=22.01 avg60=19.61 avg300=17.73 total=647292638
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.09 avg300=0.25 total=18390387
full avg10=0.00 avg60=0.06 avg300=0.17 total=13384152
Here, there's some additional pressure in disk I/O with full saturation around 20% of the time.
PSI shows the same CPU pressure as with a smaller scale, plus some IO pressure:
irate(node_pressure_cpu_waiting_seconds_total[30s])*100
irate(node_pressure_io_waiting_seconds_total[30s])*100
irate(node_pressure_io_stalled_seconds_total[30s])*100
There is no memory pressure during this data ingest, but as the size doesn't fit in memory, reading randomly from all those rows should show some memory pressure.
pgbench -c 10 -S with 3x2vCPU
I run the select-only PgBench workload from ten connections for one hour on the whole scale so that the working set is larger than the available memory:
ysql_bench -c 10 -n -S -T 3600 && tail /proc/pressure/*
scaling factor: 2000
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 300 s
number of transactions actually processed: 1098436
maximum number of tries: 1
latency average = 2.731 ms
tps = 3661.328960 (including connections establishing)
tps = 3661.579417 (excluding connections establishing)
==> /proc/pressure/cpu <==
some avg10=60.92 avg60=67.65 avg300=47.41 total=328937398
==> /proc/pressure/io <==
some avg10=71.28 avg60=61.14 avg300=47.93 total=472784815
full avg10=16.94 avg60=11.32 avg300=10.86 total=234523603
==> /proc/pressure/memory <==
some avg10=29.97 avg60=22.82 avg300=11.92 total=92159505
full avg10=8.76 avg60=5.17 avg300=3.38 total=43135045
Here are all the PSI percentage (they add to more than 100% because the some
values concern different processes waiting at the same time)
avg(100*irate({node_prefix="yb-15-fpachot-test", saved_name=~"node_pressure_cpu_waiting_seconds_total|node_pressure_memory_waiting_seconds_total|node_pressure_memory_stalled_seconds_total|node_pressure_io_waiting_seconds_total|node_pressure_io_stalled_seconds_total"}[30s])) by (saved_name)
This suggests that scaling out for such a workload would benefit you by distributing the pressure to other nodes. A new node adds CPU, Memory, and IO.
pgbench -c 10 -S with 6x2vCPU
I have doubled the number of nodes and ran the same:
ysql_bench -c 10 -n -S -T 3600 && tail /proc/pressure/*
transaction type: <builtin: select only>
scaling factor: 2000
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 3600 s
number of transactions actually processed: 16326236
maximum number of tries: 1
latency average = 2.205 ms
tps = 4535.058165 (including connections establishing)
tps = 4535.085793 (excluding connections establishing)
==> /proc/pressure/cpu <==
some avg10=80.87 avg60=80.63 avg300=80.57 total=2957633146
==> /proc/pressure/io <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=22159835
full avg10=0.00 avg60=0.00 avg300=0.00 total=10339498
==> /proc/pressure/memory <==
some avg10=0.05 avg60=0.02 avg300=0.00 total=1311355
full avg10=0.00 avg60=0.00 avg300=0.00 total=435471
The thoughput has increased, and I have no more pressure on IO and Memory. Let's now scale the vCPU.
pgbench -c 10 -S with 6x4vCPU
Finally, I scale up the CPU to lower the CPU pressure:
ysql_bench -c 10 -n -S -T 3600 && tail /proc/pressure/*
transaction type: <builtin: select only>
scaling factor: 2000
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 3600 s
number of transactions actually processed: 46462327
maximum number of tries: 1
latency average = 0.775 ms
tps = 12906.195380 (including connections establishing)
tps = 12906.264876 (excluding connections establishing)
==> /proc/pressure/cpu <==
some avg10=25.37 avg60=23.94 avg300=23.82 total=866880887
==> /proc/pressure/io <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=26769014
full avg10=0.00 avg60=0.00 avg300=0.00 total=15453692
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
The initial pressure on disk I/O results from the rolling restart during scaling up, which begins with a cold cache. When there's 25% CPU pressure on a 4 vCPU node, it means that, on average, only one task is waiting for CPU. The server is over-provisioned but offers the best performance (achieving 12906 transactions per second with a 0.775 ms latency) and can handle a higher workload.
pgbench -c 10 -S with 3x4vCPU
Finally, because increasing the size of the instance has increased the RAM and the vCPU, my working set fits again in the available memory. This means I can reduce the cluster to 3 nodes without putting more pressure on disks. I stopped three nodes, one after the other at a 15 minutes interval while running PgBench:
ysql_bench -c 10 -n -S -T 3600 && tail /proc/pressure/*
transaction type: <builtin: select only>
scaling factor: 2000
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 3600 s
number of transactions actually processed: 47227455
maximum number of tries: 1
latency average = 0.762 ms
tps = 13118.729762 (including connections establishing)
tps = 13118.797145 (excluding connections establishing)
==> /proc/pressure/cpu <==
some avg10=29.01 avg60=29.10 avg300=28.82 total=1843235106
==> /proc/pressure/io <==
some avg10=0.00 avg60=0.00 avg300=0.16 total=30467826
full avg10=0.00 avg60=0.00 avg300=0.00 total=16404190
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
The throughput didn't decrease. The pressure on the CPU increased a little, and there was only some short pressure on I/O when new replicas were bootstrapped on the remaining nodes:
Pressure Stall Information helps size the cluster to maximize resource usage without impacting response time and throughput. It also shows it can accept a higher workload when it is far from 100% pressure. In this example, 3xc5.xlarge
was undersized for this workload, with pressure on CPU and IO, 6xc5.xlarge
solved the IO pressure, and finally, 3xc5.2xlarge
was the best configuration.
Posted on July 31, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.