From Zero to Hero: Disaster Recovery for PostgreSQL with Streaming Replication in Kubernetes

In today’s digital landscape, disaster recovery is essential for any business. As our dependence on data grows, the impact of system outages or data loss becomes more severe, leading to major business interruptions and financial setbacks.

Managing disaster recovery becomes even more complex with multi-cloud or multi-regional PostgreSQL deployments. Percona Operators offer a solution to simplify this process for PostgreSQL clusters running on Kubernetes. The Operator allows businesses to handle multi-cloud or hybrid-cloud PostgreSQL deployments effortlessly, ensuring that crucial data remains accessible and secure, no matter the circumstances.

This article will guide you through setting up disaster recovery using Percona Operator for PostgreSQL and streaming replication.

Design

The design is simple:

Two sites - main and DR (disaster recovery).
- It can be two regions, data centers or even two namespaces
In each site we have an Operator and a PostgreSQL cluster
- In the DR site the cluster is in Standby mode
- We set up streaming replication between these two clusters

Set it up

All examples in this blog post are as usual available in blog-data/pg-k8s-streaming-dr github repository.

Prerequisites:

Kubernetes cluster or clusters (depending on your topology)
Percona Operator for PostgreSQL deployed.
- See quickstart guides.
- Or just use the bundle.yaml that I have in the repository above:



kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/pg-k8s-streaming-dr/bundle.yaml

Primary

The only specific thing for the Main cluster is that you need to expose it, so that standby can connect to the primary node. To expose the primary node, use the spec.expose section:



spec:
  ...
  expose:
    type: ClusterIP

Use a Service type of your choice. In my case, I have two clusters in different namespaces, so ClusterIP is sufficient. Deploy the cluster as usual:



kubectl apply -f main-cr.yaml -n main-pg

The service that you should use for connecting to standby is called <clustername>-ha (main-ha in my case):



main-ha          ClusterIP   10.118.227.214   <none>        5432/TCP   163m

Standby

TLS certificates

To get the replication working, the Standby cluster would need to authenticate with the Main one. To get there, both clusters must have certificates signed by the same certificate authority (CA). Default replication user _crunchyrepl will be used.

In the simplest case you can copy the certificates from the Main cluster. You need to look out for two files:

main-cluster-cert
main-replication-cert

Copy them to the namespace where DR cluster is going to be running and reference under spec.secrets (I renamed them replacing main with dr):



spec:
  secrets:
    customTLSSecret:
      name: dr-cluster-cert
    customReplicationTLSSecret:
      name: dr-replication-cert

If you are generating your own certificates, just remember the following rules:

Certificates for both Main and Standby clusters must be signed by the same CA
customReplicationTLSSecret must have a Common Name (CN) setting that matches _crunchyrepl, which is a default replication user.

Read more about certificates in the documentation.

Configuration

Apart from setting certificates correctly, you should also set standby configuration.



  standby:
    enabled: true
    host: main-ha.main-pg.svc

standby.enabled controls if it is a standby cluster or not
standby.host must point to the primary node of a Main cluster. In my case it is a main-ha service in another namespace.

Deploy the DR cluster:



kubectl apply -f dr-cr.yaml -n dr-pg

Verify

Once both clusters are up, you can verify that replication is working.

Insert some data into Main cluster
Connect to the DR cluster

To connect to the DR cluster, use the credentials that you used to connect to Main. This also verifies that the connection is working. You should see whatever data you have in the Main cluster in the Disaster Recovery.

Conclusion

Disaster recovery is crucial for maintaining business continuity in today's data-driven environment. Implementing a robust disaster recovery strategy for multi-cloud or multi-regional PostgreSQL deployments can be complex. However, the Percona Operator for PostgreSQL simplifies this process by enabling seamless management of PostgreSQL clusters on Kubernetes. By following the steps outlined in this article, you can set up disaster recovery using Percona Operator and streaming replication, ensuring your critical data remains secure and accessible. This approach not only provides peace of mind but also safeguards against significant business disruptions and financial losses.

At Percona, we are aiming to provide the best open source databases and tooling possible. As the next level of simplification and user experience for databases on Kubernetes, we recently released Percona Everest (currently in Beta). It is a cloud native database platform with a slick UI. It deploys and manages databases on Kubernetes for you without the need to look into YAML manifests.

Try Percona Operator for PostgreSQL | Try Percona Everest

Blog