PostgreSQL double buffering: understand the cache size in a managed service (OCI)
Franck Pachot
Posted on November 27, 2023
It can be difficult to consider performance when you have limited information about the infrastructure of a managed service where your database is hosted. I have set up a PostgreSQL service on Oracle Cloud (OCI) to understand its cache size, for which we have very little information.
PostgreSQL relies on double buffering. Two levels of buffer caches are involved: PostgreSQL shared buffer cache and the OS filesystem buffer cache.
The PostgreSQL shared buffers is usually small and is used to manage concurrent reads and writes by multiple processes on the same table or index page. Increasing this cache results in fewer system calls as well as reduced overhead for full WAL page writes, but it also reduces the memory available on the system for other caches or work areas.
If PostgreSQL can't find the required page in the shared buffers, which stores only the recently used pages, it reads and writes to the OS filesystem buffer cache, which may or may not result in disk access. To reduce the number of pages that need to be read and written to disk, it's helpful to increase the size of the filesystem cache. This also decreases the number of IO calls for consecutive pages because the OS can reorder and combine the pages to read and write. However, on Linux, you usually don't have direct control over the size of this cache because it is allocated by the kernel, based on the available memory, which can grow or shrink depending on the workload.
The PostgreSQL query planner tries to account for this double buffering with a simple setting that we can define, effective_cache_size
, to give an idea of the total size. It doesn't allocate anything but is used to estimate the number of physical reads. It is recommended to set it as the sum of the allocated shared buffers plus the estimated available memory. I explained how this double buffering works with two samples, read-only and read-write workloads:
Trying to understand the memory allocation
I've configured a PostgreSQL DB System with 4 cores and 64GB of RAM:
The configuration is the default for this shape:
This can also be verified when connected:
postgres=> \dconfig shared_buffers
List of configuration parameters
Parameter | Value
----------------+-------
shared_buffers | 16GB
(1 row)
postgres=> \dconfig effective_cache_size
List of configuration parameters
Parameter | Value
----------------------+-------
effective_cache_size | 48GB
(1 row)
postgres=>
The shared_buffers is read when starting PostgreSQL and is the size of its shared buffers to allocate. The effective_cache_size is used by the query planner to account for the double buffering. From theses settings, The instance was started with a 16GB shared buffers and supposes an additional 32GB of filesystem cache so that 48GB can be cached in memory.
We have few metrics available in the cloud console. Memory Utilization is defined as: The total percentage of RAM that is currently being utilized (Used/Total). It is usually not a very useful metric as it doesn't account for available memory which can be free but is utilized for the filesystem buffer cache. Let's look at it and guess how RAM is allocated here:
This is 84% of the 64GB RAM instance, which means 54GB used and 10GB available. This doesn't fit the usual PostgreSQL deployments where a large part of the RAM should be available for the filesystem buffer cache. As the instance uses 16GB for its shared buffers, an additional 32G is allocated by something else. It could be HugePages but we don't need more huge pages than the shared buffers. Interestingly, this matches the 48GB effective_cache_size
setting as if those additional 32GB were used for the filesystem cache.
I've no idea about the underlying architecture, but it is possible that the managed PostgreSQL on OCI uses a filesystem that allocates memory for its cache and of course ZFS ARC (Adaptive Replacement Cache) comes to my mind. 32GB is half of the physical RAM, which is the default maximum for the ARC size of ZFS. And 16GB allocated by PostgreSQL, plus 32GB by ZFS, plus a few GB for the kernel and processes would show the same picture of 84% used. Another clue about this guess is that full page write is disabled:
postgres=> \dconfig *full*
List of configuration parameters
Parameter | Value
------------------+-------
full_page_writes | off
(1 row)
And this is safe only when the filesystem guarantees write atomicity.
Testing IO by scanning multiple sizes
To understand the behavior, I can create a table, scan it, and measure the elapsed time. Knowing the number of blocks read, I can guess, from the response time, if those I/O come from cache or disk.
I'll run a process that adds 100MB on each iteration and then performs a sequential scan on the whole table. You might know that sequential scan doesn't utilize the entire shared buffer cache, but rather a buffer ring of 256KB. However, with what I'm running, that's not applicable as the data is already present in the shared buffer cache following the insertion. To clear all doubt's I explain (analyze, buffers)
to know exactly the shared buffers hits and misses.
psql -e <<'SQL'
\timing on
drop table if exists large;
create table large ( filler text ) with (fillfactor=10);
insert into large
select string_agg( chr((32+random()*94)::int) ,'' )
from generate_series(1,1000) c, generate_series(1,12800) r
group by r
\; select 'size', pg_table_size('large') , pg_size_pretty( pg_table_size('large') ) , now()
\; explain (costs off, analyze, buffers) select * from large
\; explain (costs off, analyze, buffers) select * from large
\; explain (costs off, analyze, buffers) select * from large
\watch i=120 c=1000
SQL
This is easy and you can run the same in your environment if you want to understand the usage of the OS filesystem buffer cache.
I'll expose the result for the size category that shows a different throughput:
- when all the working set fits in the PostgreSQL shared buffers
- when the working set is larger but sill fits in physical memory allocated for buffers, by PostgreSQL and by the OS.
- when the working set is larger than the memory available for file buffers in all caches.
I already give the big picture:
Working set fits in PostgreSQL shared buffers cache
The first iteration inserts 100MB and reads 12800 pages (which is 100MB with 8k pages) from the PostgreSQL shared buffer:
Sat 25 Nov 2023 09:33:22 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 104914944 | 100 MB | 2023-11-25 09:33:21.779638+00
(1 row)
Sat 25 Nov 2023 09:33:22 AM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on large (actual time=0.002..4.697 rows=12800 loops=1)
Buffers: shared hit=12800
Planning:
Buffers: shared hit=6
Planning Time: 0.051 ms
Execution Time: 5.207 ms
(6 rows)
I have read 12800/4.697=2725 pages per milliseconds
After a while, when the table is 14GB, still lower than the 16GB shared buffer, I see a similar pattern, reading 1894400 blocks from the shared buffers:
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 15523201024 | 14 GB | 2023-11-25 14:27:21.327071+00
(1 row)
Sat 25 Nov 2023 02:27:22 PM GMT (every 120s)
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on large (actual time=0.004..671.200 rows=1894400 loops=1)
Buffers: shared hit=1894400
Planning Time: 0.028 ms
Execution Time: 745.883 ms
(4 rows)
I have read 1894400/671.200=2822 pages per milliseconds
The throughput is similar when the read set is between 100MB and 14GB because those reads come from the sample place: they are higher than the CPU L2 cache and smaller than the PostgreSQL shared buffers. This is the throughput when reading from RAM.
The last scan that shows 100% shared buffers hits is with a table of 16GB, the size of the shared buffers:
Sat 25 Nov 2023 02:57:22 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 17096450048 | 16 GB | 2023-11-25 14:57:21.280085+00
(1 row)
Sat 25 Nov 2023 02:57:22 PM GMT (every 120s)
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on large (actual time=0.005..753.932 rows=2086400 loops=1)
Buffers: shared hit=2086400
Planning Time: 0.020 ms
Execution Time: 835.249 ms
(4 rows)
This is exactly what I expected. The table fits in the shared buffers, my only activity, besides the inserts used to grow the table, is reading this table. All this stays in userspace without any calls to the kernel (except a sync of the generated WAL to commit, but this is negligible here).
Working set larger than PostgreSQL shared buffers cache but the overhead fits in filesystem cache
With one additional 100M I start to see some buffer reads, which means that they were not found in the PostgreSQL cache and has to be read though a system call to the OS:
Sat 25 Nov 2023 02:59:22 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 17201332224 | 16 GB | 2023-11-25 14:59:21.277573+00
(1 row)
Sat 25 Nov 2023 02:59:22 PM GMT (every 120s)
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on large (actual time=0.004..989.347 rows=2099200 loops=1)
Buffers: shared hit=2096226 read=2974
Planning Time: 0.021 ms
Execution Time: 1091.084 ms
(4 rows)
I have read (2096226+2974)/989.347=2122 pages per milliseconds.
While adding more and more rows to the table, I see more and more reads, visible from the instance metrics with a decreasing Buffer Cache Hit Ratio:
Even if those are seen as "reads" from PostgreSQL, the OS doesn't see any reads until the table reaches 48GB:
Even if those reads from the filesystem cache also comes from RAM, they are slower and the scan decreases linearly to (2097074+4200526)/10975.859=574 pages per milliseconds when it reaches 48GB:
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 51603955712 | 48 GB | 2023-11-26 01:55:20.265367+00
(1 row)
Sun 26 Nov 2023 01:55:22 AM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on large (actual time=0.019..10975.859 rows=6297600 loops=1)
Buffers: shared hit=2097074 read=4200526
Planning Time: 0.027 ms
Execution Time: 11240.704 ms
(4 rows)
This is the last iteration reading more than 100 pages per second.
16GB (2097074*8/1024/1024 pages) were read from the shared buffers and 32GB (4200526*8/1024/1024 pages) were read from the filesystem cache. As I have seen previously that I can read those 16GB from shared buffers in 753.932 milliseconds, I estimate that the rate of the 32GB reads from the filesystem cache was 4200526/(10975.859-753.932)=411 pages per second.
Then, with a small 100MB more, the response time is 10 times slower.
Working set larger than the total size of both caches
Here is the next iteration where we read only (2097074+4213326)/103981.422=61 pages per millisecond:
Sun 26 Nov 2023 01:57:22 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 51708837888 | 48 GB | 2023-11-26 01:57:20.261515+00
(1 row)
Sun 26 Nov 2023 01:57:22 AM GMT (every 120s)
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on large (actual time=0.605..103981.422 rows=6310400 loops=1)
Buffers: shared hit=2097074 read=4213326
Planning Time: 0.019 ms
Execution Time: 104339.804 ms
(4 rows)
At this point, each scan wipes-out the whole cache. The math is simple from the explain (analyze, buffers)
output:
- shared hit: 2097074*8/1024/1024=16GB is the size of the PostgreSQL shared buffers
- 4213326*8/1024/1024=32GB is the size of the OS filesystem cache When we read a bit more that the total, 48GB, we blow out the entire filesystem cache and the reads are physical reads.
This observation confirms that I have a filesystem cache of 32GB and that the effective cache size is correctly set.
Storage Performance and Instance Metrics
I do the same maths to get the throughput for those physical reads: 4213326/(103981.422-753.932)=41 pages per millisecond.
This is roughly what I see from the IOPS metric with 34K IOPS:
The I/O size is the page size, 8K, but that doesn't match what the metrics are showing in terms of Read Thoughput which shows 284.699 KB / second. I think we do 1000 times more: 34000 IOPS * 8 KB = 272 MB/s. Either I missed something or the values displayed are wrong.
Note that when creating the instance, it was defined as "Performance tier: 300K IOPS":
As I'm scanning a single table from a single session, I don't reach this maximum but run at 35K IOPS with an average of 1 millisecond latency:
To summarize
I have conducted several tests where I added 100MB of data to a table and then scanned it. In order to compare the scan time with the number of block reads, I gathered the execution plan using EXPLAIN (ANALYZE, BUFFERS). During the tests, I monitored the cache hits in PostgreSQL shared buffer cache and the reads to the OS filesystem. I also observed that the memory usage on the OS remained at 80% throughout the test, even when the table scan read more than 64GB (which is the physical memory limit):
The Buffer Cache Hit Ratio was 100% for smaller tables that were less than 16GB in size, which is the size of shared_buffers
. As the size of the table increased, the Buffer Cache Hit Ratio decreased proportionally to the additional size that had to be scanned. This resulted in slower response times, as additional reads had the overhead of system calls to the filesystem cache. When the table size exceeded 48GB, inserts were slower and reads were 10 times slower. This suggests that the filesystem buffer cache size was 32GB (48GB - 16GB). We can also deduce that this filesystem cache is included in the 80% Memory Used, which is not typical for default Linux filesystems.
You can estimate the sizes of caches by measuring the response time when scanning different working sizes, even without having any information about the infrastructure. Although a real application is different from a single Seq Scan, the concept remains the same: your application has a frequently accessed working set, and you want to avoid storage access for those.
Posted on November 27, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024
November 29, 2024