How to Migrate from MySQL to PostgreSQL RDBMS?

1. Introduction to Migrating MySQL to PostgreSQL.

In this post I'll describe the process how to migrate database(s) from MySQL to PostgreSQL.

It is essential to create a reliable, and stable migration process with zero data loss and minimal downtime, capable of migrating DBs of tens/hundreds of GBs, and guarantee that the existing applications can work transparently with the migrated DBs.

The MySQL2PG process based on the pgloader open-source utility [https://github.com/dimitri/pgloader] to address this demand.

2. What are the components/topology to migrate DBs from MySQL to PostgreSQL?

The basic topology of the DB migration process includes the Source database (MySQL), the Destination database (PostgreSQL), and the Migration machine. The pgloader runs from the Migration machine, reading data from the Source database (MySQL) and writing it to the Destination database (PostgreSQL).

The Migration machine should have direct, reliable, and high throughput connectivity to the source and destination databases. The typical configuration of the Migration machine includes 16 CPUs and 64 GB RAM, network bandwidth of at least 10 Gbps, and storage of at least 500GB.

To guarantee compatibility of DB objects, their structure, and correspondence to specific data types, the DB migration process runs first DDL scripts suitable to a particular type/version of a product. It then copies data into the prepared in advance target DB schema.

The pgloader isn’t exporting data into a dump, placing the dump on the Migration machine, or occupying storage space by DBs data. It creates a kind of pipeline process in the Migration machine RAM and reads data selecting it from the Source DB and copying it into the Destination DB.

3. What would be a typical OS for the Migration machine?

OS: Ubuntu 20.04

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

4. What is recommended to install on the Migration machine?



sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install postgresql-13
sudo apt-get install -y pgloader
sudo apt-get install mysql-client

5. How to run pgloader to migrate a DB from MySQL to PostgreSQL?

Create pgloader configuration file:



cat pgloader.conf

LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name;

Run pgloader:



pgloader pgloader.conf

6. How to run MySQL2PG pgloader as a background process?

To avoid possible interruption of the MySQL2PG migration process it’s recommended to run pgloader as a background process. The procedure includes the following steps.

Create pgloader configuration file:



cat pgloader.conf

LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name;

Create a shell script to run pgloader as a background process:



cat run_pgloader_background.sh

nohup pgloader pgloader.conf 2>&1 &

Run the script:



run_pgloader_background.sh

The run trace will be placed on file:



nohup.out

NOTES:

Usually



src_db_user = dest_db_user,

src_db_pwd = dest_db_pwd,

src_db_name = dest_db_name

What does a typical output trace of pgloader look like?




2022-01-08T20:47:55.046000+02:00 LOG report summary reset

                              table name     errors       read   imported      bytes      total time       read      write



                     fetch meta data          0        278        278                     0.808s    
                      Create Schemas          0          0          0                     0.018s    
                    Create SQL Types          0          0          0                     0.010s    
                       Create tables          0        144        144                     1.137s    
                      Set Table OIDs          0         72         72                     0.011s    



    

    






             my_db.my_table_number_1          0  319387354  319387354    38.0 GB      37m40.265s  37m40.242s  28m41.397s
             my_db.my_table_number_2          0   15748659   15748659     1.9 GB       1m50.060s  1m50.039s  1m22.053s
             my_db.my_table_number_3          0    3989089    3989089   336.5 MB         20.059s    20.009s    14.337s

    (etc)



    

    






             COPY Threads Completion          0          4          4                 45m10.569s    
                      Create Indexes          0        165        165                1h2m48.214s    
              Index Build Completion          0        165        165                  6m12.345s    
                     Reset Sequences          0          1          1                     0.123s    
                        Primary Keys          0         69         69                     0.789s    
                 Create Foreign Keys          0         41         41                  1m23.456s    
                     Create Triggers          0          0          0                     0.005s    
                     Set Search Path          0          1          1                     0.012s    
                    Install Comments          0          0          0                     0.000s    



    

    






                   Total import time          ✓  363230932  363230932   123.4 GB    1h23m45.678s

How to ensure migration has finished successfully? How do I ensure zero data loss?

Verify trace of the pgloader: it should have zero errors, and the read and imported values should be matched.

9. How do I exclude tables from the pgloader process?

In case the source DB contains tables that can be safely excluded from the migration process, this can be done via the “EXCLUDING TABLE NAMES MATCHING” configuration of pgloader.

Example:



LOAD DATABASE

FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>

INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name

EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three';

The “Heap exhausted, game over” issue.

The “Heap exhausted, game over” message from the pgloader may indicate that the pgloader default dedicated 4GB RAM is not enough for a given DB migration run.

The solution in such a case would be to use a pgloader compiled with a more extensive DYNSIZE parameter. The DYNSIZE parameter allows modifying the default amount of memory the pgloader image will allow itself to use when running through data.

11. How to compile the pgloader to use more RAM?

The pgloader utility is not using all available RAM on the host machine. The amount of RAM the pgloader can use is defined in the compilation stage. To dedicate more RAM a customized version of pgloader should be compiled. The following step-by-step guide illustrates how to compile pgloader with customized parameter DYNSIZE. The DYNSIZE parameter allows modification of the pgloader image when running through data.



  
  
  pgloader compilation


Build pgloader from sources

(1) Check OS:

dima@dima-VirtualBox:~$ lsb_release  -a

No LSB modules are available.

Distributor ID: Ubuntu

Description:    Ubuntu 20.04.3 LTS

Release:        20.04

Codename:       focal

dima@dima-VirtualBox:~$

(2) Install git:

sudo apt update

sudo apt install git

(3) Clone pgloader:

mkdir my_pgloader

cd my_pgloader/

git clone https://github.com/dimitri/pgloader.git

(4) install packages necessary for build:

sudo apt-get install sbcl unzip libsqlite3-dev make curl gawk freetds-dev libzip-dev

(5) make build:

make DYNSIZE=10240 pgloader

(6) make outputs a ./build/bin/pgloader file for us to use.

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ pwd

/home/dima/my_pgloader/pgloader/build/bin

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ls -rtogla

total 69160

-rw-rw-r-- 1       70 Jan  7 17:17 .gitignore

drwxrwxr-x 5     4096 Jan  7 17:29 ..

-rwxr-xr-x 1 41918800 Jan  7 17:29 buildapp.sbcl

-rwxr-xr-x 1 29036008 Jan  7 17:29 pgloader

drwxrwxr-x 2     4096 Jan  7 17:29 .

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$

cp pgloader pgloader_dima_dynsize_10240

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ls -rtogla

total 97396

-rw-rw-r-- 1       70 Jan  7 17:17 .gitignore

drwxrwxr-x 5     4096 Jan  7 17:29 ..

-rwxr-xr-x 1 41918800 Jan  7 17:29 buildapp.sbcl

-rwxr-xr-x 1 29036008 Jan  7 17:29 pgloader

-rwxr-xr-x 1 29036008 Jan  7 17:33 pgloader_dima_dynsize_10240

drwxrwxr-x 2     4096 Jan  7 17:33 .

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ./pgloader_dima_dynsize_10240 --version

pgloader version "3.6.a94a0a3"

compiled with SBCL 2.0.1.debian

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$

How to manage the “Connection reset by peer” pgloader issue?

The “Connection reset by peer” error message during the pgloader run usually indicates timeouts of the pgloader connecting to MySQL [ the Source DB ].

To manage this issue, we recommend adding the following parameters to the pgloader configuration script:



SET MySQL PARAMETERS

net_read_timeout = '5000',

net_write_timeout = '5000'

How to improve the duration of the pgloader migration process?

To improve the duration of the pgloader migration process there are a few approaches:

– Scale up source and destination databases
– Ensure no-load/heavy activity / enough storage, RAM, connections, and resources both on the source and on the destination databases
– Use pgloader compiled with customized parameter DYNSIZE
– Scale up the Migration machine to allow more RAM
– Run pgloader with the following parameters:



SET PostgreSQL PARAMETERS

maintenance_work_mem to '512MB',

work_mem to '48MB'

How to configure pgloader to run stable and reliably on big databases / on big amounts of data?

– Use the pgloader compiled with customized DYNSIZE parameter. It will allow dedicating more amount of RAM to the pgloader process
– Use the following pgloader configuration:



cat pgloader.conf

FROM LOAD DATABASE

FROM mysql://db_user:db_pwd@src_db_host/db_name?sslmode=<...>

INTO postgresql://db_user:db_pwd@dest_db_host/db_name

WITH

data only, create no indexes,

workers = 8, concurrency = 1,

multiple readers per thread, rows per range = 10000,

batch rows = 10000

SET PostgreSQL PARAMETERS

maintenance_work_mem to '512MB',

work_mem to '48MB'

SET MySQL PARAMETERS

net_read_timeout = '5000',

net_write_timeout = '5000'

EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three'

ALTER SCHEMA 'db_schema_name' RENAME TO 'public';

What will a minimal pgloader configuration look like?




FROM LOAD DATABASE

FROM mysql://db_user:db_pwd@src_db_host/db_name?sslmode=<...>

INTO postgresql://db_user:db_pwd@dest_db_host/db_name

with

batch size = 2048 kB,

batch rows = 2000,

prefetch rows = 2000

SET PostgreSQL PARAMETERS

maintenance_work_mem to '512MB',

work_mem to '48MB'

SET MySQL PARAMETERS

net_read_timeout = '5000',

net_write_timeout = '5000'

EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three'

ALTER SCHEMA 'db_schema_name' RENAME TO 'public';

Why is it important to monitor the source, destination databases, and Migration machine before, during, and after the process?

To guarantee the stable, reliable, and optimal work of the pgloader it’s vital to monitor the source and the destination databases before, during, and after migration. The DB’s machines should be strong enough, and they need to have enough resources to pgloader intensive activity. Both source and destination DBs machines shouldn’t be loaded by heavy activity or intensive CPU operations or high Read or(and) Writes.

Note: any load or heavy activity on the Source or the Destination DB machine directly impacts the DB migration’s performance and success.

In addition, it’s important to monitor the CPU, IOPs, memory, and network of the Migration machine during the process.

17. Importance of the split to the Source DB, the Destination DB, and the Migration machines.

It’s important to keep the Source DB Instance, the Destination DB Instance, and the Migration machine separately. Combining these components will impact DB migration performance and stability.

18. Can it be queried the destination PostgreSQL DB during the migration process? Will we see the progress of DB migration by querying the copied tables “on the fly”?

Although we can see live sessions on the Destination PostgreSQL DB during the migration, the actual data is committed only at the very end of the migration.

Querying in the middle of pgloader run any populated tables is useless: their data placed on dirty, not committed yet blocks. Only once pgloader has finished can we see the actual records on target DB querying “select * from .”

Blog