Future PostgreSQL: improvement to the replication protocol
Ronan Dunklau
Posted on January 13, 2022
Today's post is about a new PostgreSQL feature (coming in PostgreSQL 15) developed by Aiven to solve a reliability problem we sometimes run into with replication. The READ_REPLICATION_SLOT
command makes a physical replication client, particularly pg_receivewal
, more reliable.
At Aiven's OSPO office, we dedicate our time to the open source products Aiven offers or develops, and we were very happy to collaborate on this feature.
PostgreSQL replication surprises
This story starts with our own PgHoard, a PITR backup tool for PostgreSQL. PgHoard offers several methods to archive the WAL (Write Ahead Log), including pg_receivewal
, a small application shipping with PostgreSQL which connects to a PostgreSQL cluster using the physical replication protocol to stream WAL as they are produced, optionally keeping track of the position on the server using a replication slot.
We noticed that we could occasionally lose some WAL when PgHoard is restarted on another machine. This happens because pg_receivewal
determines its start position like this:
- Look at what segments are present in the archive folder, and resume from the latest archived one.
- If the archive folder is empty, start from the current server LSN position.
The problem is, in our case, since pg_receivewal
was started on another server, it didn't have access to the archive folder directly so it resumed from the current LSN (Log Sequence Number) position.
You can observe this in action by launching pg_receivewal
, stopping it after some WAL have been archived, then relaunching it with its directory empty:
❯ sudo -u postgres pg_receivewal -D /tmp/wals -v -s pg_receivewal_slot
pg_receivewal: starting log streaming at 47/E5000000 (timeline 1)
pg_receivewal: finished segment at 47/E6000000 (timeline 1)
pg_receivewal: received interrupt signal, exiting
pg_receivewal: not renaming "0000000100000047000000E6.partial", segment is not complete
pg_receivewal: received interrupt signal, exiting
❯ sudo rm /tmp/wals/* -rf
# Wait for some WAL to be generated on the server, and relaunch pg_receivewal
❯ sudo -u postgres pg_receivewal -D /tmp/wals -v -s pg_receivewal_slot
pg_receivewal: starting log streaming at 47/E9000000 (timeline 1)
We notice here that the segments 0000000100000047000000E7
and 0000000100000047000000E8
have not been archived at all, jumping from 47/E6000000
to 47/E9000000
What should happen here is resuming from the replication slot's restart_lsn
, which is kept on the server.
New feature: READ_REPLICATION_SLOT
command
Prior to version 15 of PostgreSQL, the only way to know a replication slot's position was to query the pg_replication_slots
view, which means regular SQL queries need to be issued. This is fine for logical replication, as logical replication connection are bound to a database and are allowed to perform SQL queries. But in the case of physical replication connections, we do not have this possibility
That meant that a physical replication client had no way to know the state of it's replication slot, except by opening a separate, non-replication connection, which is not something we can do with the pg_receivewal
application.
The READ_REPLICATION_SLOT
command aims to fulfill this gap, by allowing a replication connection to read the current restart_lsn
and timeline associated
to a replication slot:
❯ psql postgres -d "replication=1"
psql (15devel)
Type "help" for help.
postgres=# READ_REPLICATION_SLOT slot1;
slot_type | restart_lsn | restart_tli
-----------+-------------+-------------
physical | 0/1501000 | 1
(1 row)
With that new command, pg_receivewal
can now determine its starting LSN slightly differently:
- Look at what segments are present in the archive folder, and resume from the latest archived one.
- If the archive folder is empty and a replication slot is used, issue a
READ_REPLICATION_SLOT
command to read the state from the server. - Otherwise, start streaming from the server's current LSN.
This new command is currently only supported for physical replication slots, and returns limited information but that could evolve in future if needed.
Already the feature will help with the "surprises" we encountered when using pg_receivewal
, and will also benefit other replication clients by making them more reliable. It has been accepted into the project and will be available in the version 15 release.
Next steps
- Learn more about Aiven for PostgreSQL, or sign up for our free trial.
- Check out the PgHoard project on GitHub.
- Find out what else our OSPO (Open Source Program Office) is up to.
- Follow us on Twitter to keep up with more posts like this.
Posted on January 13, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024