Dmitry Romanoff
Posted on September 21, 2022
1. Introduction to Migrating MySQL to PostgreSQL.
In this post I'll describe the process how to migrate database(s) from MySQL to PostgreSQL.
It is essential to create a reliable, and stable migration process with zero data loss and minimal downtime, capable of migrating DBs of tens/hundreds of GBs, and guarantee that the existing applications can work transparently with the migrated DBs.
The MySQL2PG process based on the pgloader open-source utility [https://github.com/dimitri/pgloader] to address this demand.
2. What are the components/topology to migrate DBs from MySQL to PostgreSQL?
The basic topology of the DB migration process includes the Source database (MySQL), the Destination database (PostgreSQL), and the Migration machine. The pgloader runs from the Migration machine, reading data from the Source database (MySQL) and writing it to the Destination database (PostgreSQL).
The Migration machine should have direct, reliable, and high throughput connectivity to the source and destination databases. The typical configuration of the Migration machine includes 16 CPUs and 64 GB RAM, network bandwidth of at least 10 Gbps, and storage of at least 500GB.
To guarantee compatibility of DB objects, their structure, and correspondence to specific data types, the DB migration process runs first DDL scripts suitable to a particular type/version of a product. It then copies data into the prepared in advance target DB schema.
The pgloader isn’t exporting data into a dump, placing the dump on the Migration machine, or occupying storage space by DBs data. It creates a kind of pipeline process in the Migration machine RAM and reads data selecting it from the Source DB and copying it into the Destination DB.
3. What would be a typical OS for the Migration machine?
OS: Ubuntu 20.04
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
4. What is recommended to install on the Migration machine?
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install postgresql-13
sudo apt-get install -y pgloader
sudo apt-get install mysql-client
5. How to run pgloader to migrate a DB from MySQL to PostgreSQL?
Create pgloader configuration file:
cat pgloader.conf
LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name;
Run pgloader:
pgloader pgloader.conf
6. How to run MySQL2PG pgloader as a background process?
To avoid possible interruption of the MySQL2PG migration process it’s recommended to run pgloader as a background process. The procedure includes the following steps.
Create pgloader configuration file:
cat pgloader.conf
LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name;
Create a shell script to run pgloader as a background process:
cat run_pgloader_background.sh
nohup pgloader pgloader.conf 2>&1 &
Run the script:
run_pgloader_background.sh
The run trace will be placed on file:
nohup.out
NOTES:
Usually
src_db_user = dest_db_user,
src_db_pwd = dest_db_pwd,
src_db_name = dest_db_name
- What does a typical output trace of pgloader look like?
2022-01-08T20:47:55.046000+02:00 LOG report summary reset
table name errors read imported bytes total time read write
fetch meta data 0 278 278 0.808s
Create Schemas 0 0 0 0.018s
Create SQL Types 0 0 0 0.010s
Create tables 0 144 144 1.137s
Set Table OIDs 0 72 72 0.011s
my_db.my_table_number_1 0 319387354 319387354 38.0 GB 37m40.265s 37m40.242s 28m41.397s
my_db.my_table_number_2 0 15748659 15748659 1.9 GB 1m50.060s 1m50.039s 1m22.053s
my_db.my_table_number_3 0 3989089 3989089 336.5 MB 20.059s 20.009s 14.337s
(etc)
COPY Threads Completion 0 4 4 45m10.569s
Create Indexes 0 165 165 1h2m48.214s
Index Build Completion 0 165 165 6m12.345s
Reset Sequences 0 1 1 0.123s
Primary Keys 0 69 69 0.789s
Create Foreign Keys 0 41 41 1m23.456s
Create Triggers 0 0 0 0.005s
Set Search Path 0 1 1 0.012s
Install Comments 0 0 0 0.000s
Total import time ✓ 363230932 363230932 123.4 GB 1h23m45.678s
- How to ensure migration has finished successfully? How do I ensure zero data loss?
Verify trace of the pgloader: it should have zero errors, and the read and imported values should be matched.
9. How do I exclude tables from the pgloader process?
In case the source DB contains tables that can be safely excluded from the migration process, this can be done via the “EXCLUDING TABLE NAMES MATCHING” configuration of pgloader.
Example:
LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name
EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three';
- The “Heap exhausted, game over” issue.
The “Heap exhausted, game over” message from the pgloader may indicate that the pgloader default dedicated 4GB RAM is not enough for a given DB migration run.
The solution in such a case would be to use a pgloader compiled with a more extensive DYNSIZE parameter. The DYNSIZE parameter allows modifying the default amount of memory the pgloader image will allow itself to use when running through data.
11. How to compile the pgloader to use more RAM?
The pgloader utility is not using all available RAM on the host machine. The amount of RAM the pgloader can use is defined in the compilation stage. To dedicate more RAM a customized version of pgloader should be compiled. The following step-by-step guide illustrates how to compile pgloader with customized parameter DYNSIZE. The DYNSIZE parameter allows modification of the pgloader image when running through data.
pgloader compilation
Build pgloader from sources
(1) Check OS:
dima@dima-VirtualBox:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
dima@dima-VirtualBox:~$
(2) Install git:
sudo apt update
sudo apt install git
(3) Clone pgloader:
mkdir my_pgloader
cd my_pgloader/
git clone https://github.com/dimitri/pgloader.git
(4) install packages necessary for build:
sudo apt-get install sbcl unzip libsqlite3-dev make curl gawk freetds-dev libzip-dev
(5) make build:
make DYNSIZE=10240 pgloader
(6) make outputs a ./build/bin/pgloader file for us to use.
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ pwd
/home/dima/my_pgloader/pgloader/build/bin
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ls -rtogla
total 69160
-rw-rw-r-- 1 70 Jan 7 17:17 .gitignore
drwxrwxr-x 5 4096 Jan 7 17:29 ..
-rwxr-xr-x 1 41918800 Jan 7 17:29 buildapp.sbcl
-rwxr-xr-x 1 29036008 Jan 7 17:29 pgloader
drwxrwxr-x 2 4096 Jan 7 17:29 .
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$
cp pgloader pgloader_dima_dynsize_10240
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ls -rtogla
total 97396
-rw-rw-r-- 1 70 Jan 7 17:17 .gitignore
drwxrwxr-x 5 4096 Jan 7 17:29 ..
-rwxr-xr-x 1 41918800 Jan 7 17:29 buildapp.sbcl
-rwxr-xr-x 1 29036008 Jan 7 17:29 pgloader
-rwxr-xr-x 1 29036008 Jan 7 17:33 pgloader_dima_dynsize_10240
drwxrwxr-x 2 4096 Jan 7 17:33 .
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ./pgloader_dima_dynsize_10240 --version
pgloader version "3.6.a94a0a3"
compiled with SBCL 2.0.1.debian
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$
- How to manage the “Connection reset by peer” pgloader issue?
The “Connection reset by peer” error message during the pgloader run usually indicates timeouts of the pgloader connecting to MySQL [ the Source DB ].
To manage this issue, we recommend adding the following parameters to the pgloader configuration script:
SET MySQL PARAMETERS
net_read_timeout = '5000',
net_write_timeout = '5000'
- How to improve the duration of the pgloader migration process?
To improve the duration of the pgloader migration process there are a few approaches:
– Scale up source and destination databases
– Ensure no-load/heavy activity / enough storage, RAM, connections, and resources both on the source and on the destination databases
– Use pgloader compiled with customized parameter DYNSIZE
– Scale up the Migration machine to allow more RAM
– Run pgloader with the following parameters:
SET PostgreSQL PARAMETERS
maintenance_work_mem to '512MB',
work_mem to '48MB'
- How to configure pgloader to run stable and reliably on big databases / on big amounts of data?
– Use the pgloader compiled with customized DYNSIZE parameter. It will allow dedicating more amount of RAM to the pgloader process
– Use the following pgloader configuration:
cat pgloader.conf
FROM LOAD DATABASE
FROM mysql://db_user:db_pwd@src_db_host/db_name?sslmode=<...>
INTO postgresql://db_user:db_pwd@dest_db_host/db_name
WITH
data only, create no indexes,
workers = 8, concurrency = 1,
multiple readers per thread, rows per range = 10000,
batch rows = 10000
SET PostgreSQL PARAMETERS
maintenance_work_mem to '512MB',
work_mem to '48MB'
SET MySQL PARAMETERS
net_read_timeout = '5000',
net_write_timeout = '5000'
EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three'
ALTER SCHEMA 'db_schema_name' RENAME TO 'public';
- What will a minimal pgloader configuration look like?
FROM LOAD DATABASE
FROM mysql://db_user:db_pwd@src_db_host/db_name?sslmode=<...>
INTO postgresql://db_user:db_pwd@dest_db_host/db_name
with
batch size = 2048 kB,
batch rows = 2000,
prefetch rows = 2000
SET PostgreSQL PARAMETERS
maintenance_work_mem to '512MB',
work_mem to '48MB'
SET MySQL PARAMETERS
net_read_timeout = '5000',
net_write_timeout = '5000'
EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three'
ALTER SCHEMA 'db_schema_name' RENAME TO 'public';
- Why is it important to monitor the source, destination databases, and Migration machine before, during, and after the process?
To guarantee the stable, reliable, and optimal work of the pgloader it’s vital to monitor the source and the destination databases before, during, and after migration. The DB’s machines should be strong enough, and they need to have enough resources to pgloader intensive activity. Both source and destination DBs machines shouldn’t be loaded by heavy activity or intensive CPU operations or high Read or(and) Writes.
Note: any load or heavy activity on the Source or the Destination DB machine directly impacts the DB migration’s performance and success.
In addition, it’s important to monitor the CPU, IOPs, memory, and network of the Migration machine during the process.
17. Importance of the split to the Source DB, the Destination DB, and the Migration machines.
It’s important to keep the Source DB Instance, the Destination DB Instance, and the Migration machine separately. Combining these components will impact DB migration performance and stability.
18. Can it be queried the destination PostgreSQL DB during the migration process? Will we see the progress of DB migration by querying the copied tables “on the fly”?
Although we can see live sessions on the Destination PostgreSQL DB during the migration, the actual data is committed only at the very end of the migration.
Querying in the middle of pgloader run any populated tables is useless: their data placed on dirty, not committed yet blocks. Only once pgloader has finished can we see the actual records on target DB querying “select * from .”
Posted on September 21, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.