Oracle to Postgres : Happily Ever After edition

mayur555b

mayur

Posted on October 19, 2024

Oracle to Postgres : Happily Ever After edition

Image description

Prelude :
The launch of the Oracle Autonomous Database ignited a lively debate about the future of the database administrator (DBA) role.

Image description

Oracle envisioned a world where the DBA would become obsolete with a snap of their fingers, just like Thanos.

However, Postgres said to Oracle “Hold my beer”.

Image description

Few years before that time, I made the leap from the Oracle world to Postgres, seeing Postgres as the Galactus of databases (The Eater of database worlds), ready to devour the competition.

Image description
PG : Eater of the database worlds.

Image description
Postgres Ecosystem : Credits to Vonng (Author of PIGSTY)

Image description
Credits : @dotnetschizo

The signs were everywhere. The two communities couldn’t have been more different. Oracle is flashy and glamorous, driven by the latest marketing buzz. In contrast, the Postgres community is a bubbling cauldron of innovation, with contributions flowing in from passionate developers around the globe. It was obvious to me that Postgres’s extreme extensibility would outshine any single company’s vision.

The ”Ever After” :

Image description

There’s been plenty of chatter about migrating from Oracle to Postgres, but most of the focus is on the planning and execution phases. Rarely do we hear about what happens afterward, like the happily-ever-after in a Disney movie.

Having been involved in numerous Oracle-to-Postgres migrations, I often find myself called in to address performance issues or to fix critical elements overlooked during the migration. So, I thought, why not have a blog or a talk dedicated to the “ever after” of these migrations? Let’s explore what happens once the dust settles!

Watch out for those Materialized Views

Image description

Oracle’s optimizer (planner), boasts a magical query rewrite option with materialized views. If the optimizer detects that any part of an SQL query matches a materialized view definition, it can seamlessly replace that portion of the SQL text with the materialized view. This clever trick saves significant time and resources typically spent on repetitive complex joins and aggregates on large tables.

Image description

Unfortunately, Postgres doesn’t have an equivalent to this Mview rewrite feature in its planner. This means that while migration tools can help convert code and schema, they can’t detect or provide alternatives to such internal rewrites happening within the database engine.

Imagine you’re celebrating a smooth Oracle to Postgres migration over the weekend, only to have your pager go off due to a critical incident in the weekly production batch. Oracle developers often leverage this shortcut for query optimization in batch processes or weekend-monthly jobs, where retrieving the most recent data isn’t a priority. Although daily OLTP loads might not benefit significantly from Mview query rewrites, less frequent or off-peak tasks often do. This is one reason why such issues can be hard to detect during pre-go-live smoke tests.

It’s always better to identify query rewrites while still on Oracle, before the migration, rather than scrambling to optimize a heavy batch process over the weekend post-migration to Postgres. You can easily enable or disable Mview rewrite, allowing you to establish a baseline during load testing that covers all application aspects with the rewrite on, and then conduct another test with the Mview rewrite disabled. This process will highlight any outliers, helping you identify which modules require performance optimization on Postgres. These optimizations can then be tackled in a stress-test environment, ensuring a smoother transition and fewer surprises post-migration.

Multi-node RAC <> Multi-node Patroni cluster

Migration tooling and planning often overlook this critical aspect of database architecture. A three-node Oracle RAC setup doesn’t directly translate to a three-node Postgres Patroni cluster, typically used as a high-availability (HA) solution. The main reason is that all three nodes in an Oracle RAC handle write traffic, whereas Postgres replicas are read-only.

Image description

This becomes a headache if your application heavily relies on global temporary tables, as Postgres replicas can’t drop, create, or truncate these tables. PGTT extension could help regarding reducing code conversion effort by mimicking Oracle’s global temporary table implementation in Postgres. However, it still creates a temporary table on the fly, the first time it is used in a session. Due to this even with PGTT you need primary for a code with temp table usage to work. Apart from being bottleneck in offloading some of the traffic to replica, frequent drop-create of temp tables causes Postgres catalog tables to bloat.

Oracle RAC implementations are usually designed to minimize global cache fusion (internode communication), meaning the application is logically modularized into multiple services, each interacting with a dedicated node. This setup is not just for high availability but also for load balancing.

Even if we employ load balancers on the Postgres side, there’s always a limit to how much load can be offloaded to replicas due to their read-only nature. Therefore, when converting a multi-node RAC to a Postgres HA cluster, it’s crucial to anticipate the additional load on the primary node and upgrade the hardware accordingly. This might mean better CPUs, higher throughput instance and higher limits in the connection pool. Moreover, concurrency-related issues might arise, necessitating thorough load testing before going live. So, brace for the load, test rigorously, and ensure your primary node is up to the challenge!

During the PG-Ibiza conference, someone asked why we don’t use logical replication to create a multi-active, multi-node cluster as a replacement for Oracle RAC. While it’s possible to use a flavor of Bi-Directional Replication (EDB-BDR,PgEdge,AWS Pg_Active) or a custom setup, you would need to manage conflicts, split-brain scenarios, logical replication bugs, and resynchronization issues. This blog by depesz’s highlights pain of logical replication at scale. This means transitioning from the relatively low-maintenance environment of Oracle RAC to a multi-active setup that requires significantly more attention and care. For obvious reasons, this approach is unlikely to receive immediate approval from CTOs or infrastructure managers at large companies. However, if you have a single cluster to manage, it is feasible to pursue this solution.

Scalar Subquery Caching

Oracle introduced the fascinating feature of scalar subquery caching starting with version 11g. This nifty feature helps mitigate developer mistakes by optimizing slow subqueries in production. As a result, suboptimal code in Oracle could go unnoticed for ages. But once you migrate to Postgres, it can hit you like a wrecking ball. If these issues arise in parts of the code triggered by specific datasets, they can be tough to detect during load testing.

Let me share an example. We had a tiny application with a small 150 MB database, running a single query most of the time. Despite its small size, it managed to bring an AWS c5.2xlarge instance (8 vCPUs, 16 GB RAM, SSD io1 storage) to its knees. There were frequent CPU spikes, the application encountered numerous timeout errors, and the server appeared overloaded. Developers blamed Postgres, pointing out that the same application had run smoothly on Oracle for months, even though the tables involved were only a few MBs in size.

Upon examining the problematic query, I found it to be a simple SELECT statement with a few Common Table Expressions (CTEs) referenced multiple times as scalar subqueries. The explain plan revealed that one particular CTE was the root cause. Thanks to Depesz’s tool, it was clear that most of the time was spent on the CTE scan. This happened because Postgres executed the CTE scan for every row produced by the main query. In Oracle, the same CTE scan was cached via scalar subquery caching, allowing it to narrowly escape detection for a long time.
Since this issue occurred on Postgres 13, we couldn’t use the enable_memoize option available in Postgres 14, which provides a similar feature. However, even with enable_memoize, we’ve noticed performance degradation in other databases and haven’t seen significant benefits in production systems. Also enable_memoize still has performance issues.

Solution was pretty simple, break large sql into tiny parts using temp table and index temp table of problematic CTE. There was further option to put this in an immutable function call and utilize caching in Postgres but developers seem happy with just temp table indexing.

I had to obfuscate plan, tables and query due to company security policy so below pictures are messy.

Offending query was select on a view and partial view definition:

Image description

You could see cte_policy_status being put in multiple scalar subqueries above.

Plan is big 300+ lines so I would just copy relevant part:

Image description

Image description

Problematic query took around 3–4 minutes per execution and there were multiple sessions concurrently executing it. You can see cte_policy_status being main bottleneck clearly, thanks to depesz’s tool.

Image description

Image description
Happily Ever After Version

Problematic query finished in milliseconds after splitting it using temp tables & indexing/analyzing temp tables, logical io reduced significantly and everyone lived happily ever after.

SQL Plan Baselines and Hints

Image description

Oracle’s dbms_spm guarantees SQL execution plan stability, a feature we lack in Postgres for locking down a specific execution plan for a SQL statement. Postgres does have pg_hint_plan, but it’s not quite the same as dbms_spm. After migration, SQLs with spm baselines could surprise you. However, most of these baselines are created to address missing statistics or incorrect timings for collecting stats. This can be resolved by reviewing the process flow and running ANALYZE on tables after any bulk data loads. For the remaining minor cases, pg_hint_plan can be a lifesaver. One example is the transactional outbox pattern. implementation in microservice architecture.

Image description

In this pattern, every transaction includes a corresponding insert or update in the outbox. The data volume continuously fluctuates due to deletions based on polling frequency, and the table is constantly read by a message relay. This is a perfect recipe for disaster as it generates many dead tuples. Autoanalyze may collect stats when the outbox table is empty, resulting in subsequent full scans. Tiny cloud instances provisioned for microservices can quickly hit I/O, throughput, and CPU limits. Increasing autovacuum frequency helps but at the cost of instance throughput and CPU capacity. Even so, with many such full scan queries running on a bloated outbox table, the database can become overloaded.

A simple index hint via pg_hint_plan on the outbox table’s queries can save DBAs from many sleepless nights. By proactively managing these potential pitfalls, you can ensure smoother operations and keep your database running efficiently.

Image description

Image description

To Err is human, To exception divine.

Image description

The reason for this distinction is evident in the PostgreSQL documentation, showcased in all its glory.

Oracle and PostgreSQL behave differently in their approach to exceptions. In Oracle, one can perform various mental gymnastics within the exception handling section, utilizing statement-level rollbacks and commits. While I won’t delve into the functional differences here, it’s worth noting that when developers attempt to replicate Oracle’s behavior in PostgreSQL, they often resort to savepoints or unnecessarily complicate the exception handling logic. Exception handling is internally via savepoint.

Using savepoints to mimic exception handling code can lead to catastrophic subtransaction SLRU lock wait events, potentially bringing your application to a complete halt.

The famous gitlab incident….

Image description

Their ruby code roughly translates to below flow.

Image description

Here it shows performance cliff caused by SLRU wait.

Image description

The good news is that starting with PostgreSQL 17, you can increase cache sizes related to SLRU, such as subtransaction_buffers, multixact_member_buffers, and notify_buffers, which helps mitigate the risk of such performance cliffs.

Additionally, you can use clever techniques to avoid using exceptions altogether.
For example: Instead of pk violation exception section you could use on conflict clause.

Image description

Added benefit, on conflict .. do nothing reduces dead tuples.

Last but not the least

At PGConf-EU 2023, I gave a lightning talk about the hidden objects that can cause portability and potential licensing issues when using cloud vendor-provided tools for Oracle (self-hosted) to Postgres (cloud) migration. I was surprised to find that many DBAs were unaware of these issues in cloud environments.

Image description

Image description

In just a year, the community has already come up with a solution. ExtensionMigrationAssistant is an open-source tool designed to detect cloud vendor-specific objects and functions in a migrated Postgres schema. Additionally, the creators have established a company, DataCloudGaze, to assist with end-to-end Oracle to Postgres migrations. This rapid response is a testament to Postgres’s vibrant and innovative community.

Image description

So, despite all the challenges moving out of Oracle, rest assured that Postgres’s dynamic community has got your back, constantly evolving and coming up with solutions to make your transition smoother and more efficient.

References :

https://blogs.oracle.com/infrastructure/post/how-the-autonomous-database-will-change-the-dba-rolefor-the-better

https://docs.oracle.com/en/database/oracle/oracle-database/19/racad/introduction-to-oracle-rac.html#GUID-D04AA2A7-2E68-4C5C-BD6E-36C62427B98E

https://explain.depesz.com/

https://www.datacloudgaze.com/post/extension-migration-assistance

https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/

https://microservices.io/patterns/data/transactional-outbox.html?source=post_page-----11a3c27c011f--------------------------------

https://www.depesz.com/2023/06/04/a-tale-about-incomplete-upgrade-from-postgresql-12-to-14/

💖 💪 🙅 🚩
mayur555b
mayur

Posted on October 19, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related