Troubleshooting PostgreSQL High CPU Usage - Part 2

adammetis

Adam Furmanek

Posted on September 20, 2023

Troubleshooting PostgreSQL High CPU Usage - Part 2

Last time we saw what could be the reason for high CPU usage. In this blog post we are going to see some troubleshooting scenarios and queries that can help us identify and fix the issue.

How to troubleshoot - overview

We need to keep in mind that the high CPU usage can be caused by various unrelated reasons. That could be just an expensive operation, or some memory issues, or false sharing, or whatever else. We saw multiple reasons in the previous part of this series.

We need to check multiple things and verify our assumptions at each stage. It’s worth thinking about recent deployments, weekly patterns, differences in running configuration, locale-specific details, running backend processes (CRON jobs, daemons, autovacuum), and other stuff.

In the rest of this post we are going to see various things that we can try.

Active and idle connections

First thing to check is how many connections we have in the database.

Number of connections

We can use the following query:

WITH connections AS (
    SELECT
     COUNT(*) AS total_connections,
     SUM(CASE WHEN state != 'idle' THEN 1 ELSE 0 END) AS non_idle_connections
    FROM pg_stat_activity
),
max_connections AS (
    SELECT setting AS max_connections
    FROM pg_settings
    WHERE name = 'max_connections'
)
SELECT
    C.total_connections,
    C.non_idle_connections,
    MC.max_connections,
    ROUND((100 * C.total_connections::NUMERIC / MC.max_connections::NUMERIC), 2) AS connections_utilization_percentage
FROM connections AS C
CROSS JOIN max_connections AS MC
Enter fullscreen mode Exit fullscreen mode
+--------------------+-----------------------+------------------+------------------------------------+
| total_connections  | non_idle_connections  | max_connections  | connections_utilization_percentage |
+--------------------+-----------------------+------------------+------------------------------------+
|                 8  |                    1  |             100  |                                8.0 |
+--------------------+-----------------------+------------------+------------------------------------+
Enter fullscreen mode Exit fullscreen mode

We can see the number of connections and how many of them are idle. You can also check number of maximum available connections with the following query:

show max_connections;
Enter fullscreen mode Exit fullscreen mode

This will give you the number of connections, like below:

+-----------------+
| max_connections |
+-----------------+
|             100 |
+-----------------+
Enter fullscreen mode Exit fullscreen mode

Typically, the default is one hundred connections. You can always configure it if needed to match your needs, however, that will require rebooting your PostgreSQL instance. The more connections you have, the higher the memory usage is, so try to tune it to just keep it right above what you need

Connections per database

Next thing to check is which databases are used by the connections. You can use the following query:

SELECT
    datname AS database_name,
    COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname
ORDER BY 2
Enter fullscreen mode Exit fullscreen mode

Example output:

+----------------+----------------------+
| database_name  | non_idle_connections |
+----------------+----------------------+
| demo           |                    1 |
+----------------+----------------------+
Enter fullscreen mode Exit fullscreen mode

You can also get similar result with executed queries:

SELECT
    datname AS database_name,
    query,
    COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname, query
ORDER BY 3 DESC
Enter fullscreen mode Exit fullscreen mode

The query gives the following output:

+----------------+------------------------------+----------------------+
| database_name  |            query             | non_idle_connections |
+----------------+------------------------------+----------------------+
| demo           | SELECT * FROM demo.bookings  |                    1 |
+----------------+------------------------------+----------------------+
Enter fullscreen mode Exit fullscreen mode

This gives you running queries at a given point in time. However, just running the query may not be a problem. We probably want to find long queries, so do that with the following query:

SELECT
    datname AS database_name,
    query,
    now() - query_start AS runtime
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY 3 DESC
Enter fullscreen mode Exit fullscreen mode

Example output:

+----------------+------------------------------+-----------------+
| database_name  |            query             |     runtime     |
+----------------+------------------------------+-----------------+
| demo           | SELECT * FROM demo.bookings  | 00:00:10.145824 |
+----------------+------------------------------+-----------------+
Enter fullscreen mode Exit fullscreen mode

This shows you clearly which queries are running for a long time. Probably the one that runs for many seconds (minutes/hours) is worth investigating further.

Connections waiting for a lock

You can see connections that are waiting:

SELECT A.pid, A.query
FROM pg_stat_activity AS A
JOIN pg_locks AS L on L.pid = A.pid
WHERE L.granted = false
GROUP BY 1, 2
Enter fullscreen mode Exit fullscreen mode

This gives the following result

+--------+-----------------------------+
|  pid   |            query            |
+--------+-----------------------------+
| 27621  | SELECT * FROM demo.bookings |
+--------+-----------------------------+
Enter fullscreen mode Exit fullscreen mode

By verifying this you can figure out which queries are slowed down because of other transactions.

Queries

Now we can focus on queries to figure out which ones are slow.

Queries that run often

Let’s see how many times a given query has been executed:

SELECT SUBSTR(query, 1, 100) AS query, calls, total_exec_time, total_exec_time / calls AS call_time
FROM pg_stat_statements
ORDER BY 4 DESC
Enter fullscreen mode Exit fullscreen mode

This gives the following sample output:


+-------------------------------------------------------------------------------------------------------+--------+---------------------+--------------------+
|                                                query                                                  | calls  |  total_exec_time    |     call_time      |
+-------------------------------------------------------------------------------------------------------+--------+---------------------+--------------------+
| EXPLAIN (ANALYZE, BUFFERS, TIMING, VERBOSE, COSTS, FORMAT JSON) SELECT "tconst", "titlet              |     2  |        3970.404005  |       1985.2020025 |
| SELECT "tconst", "titletype", "primarytitle", "originaltitle", "isadult", "startyear", "endyear", "r  |     2  | 3950.9494290000002  | 1975.4747145000001 |
+-------------------------------------------------------------------------------------------------------+--------+---------------------+--------------------+
Enter fullscreen mode Exit fullscreen mode

The query uses statistics from pg_stat_statements extension.

Metis can show you queries with statistics easily:

Image description

Long transactions

You can see long transactions with the following:

SELECT now() - xact_start AS transaction_time, query
FROM pg_stat_activity
WHERE state IN ('idle in transaction','active')
ORDER BY 1 DESC
Enter fullscreen mode Exit fullscreen mode

Output:

+-------------------+-----------------------------+
| transaction_time  |            query            |
+-------------------+-----------------------------+
| 00:00:07          | SELECT * FROM demo.bookings |
+-------------------+-----------------------------+
Enter fullscreen mode Exit fullscreen mode

This can show you which transactions are long. They are most likely contributing to high CPU usage due to a lot of work they need to do.

Metis can show you the runtime of each query you execute:

Image description

Database hygiene

Things may lead to an ongoing degradation of the database. Let’s see some of them.

Statistics

Outdated statistics can cause high CPU usage as well. The query planner may generate non-efficient query plans that lead to much longer execution. You can find tables that have outdated statistics with this query (change the schema name for your needs):

SELECT
    schemaname,
    relname,
    DATE_TRUNC('minute', last_analyze) last_analyze,
    DATE_TRUNC('minute', last_autoanalyze) last_autoanalyze
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY last_analyze DESC NULLS FIRST, last_autoanalyze DESC NULLS FIRST;
Enter fullscreen mode Exit fullscreen mode

Output:

+-------------+-----------------+---------------+---------------------+
| schemaname  |    relname      | last_analyze  |  last_autoanalyze   |
+-------------+-----------------+---------------+---------------------+
| bookings    | aircrafts_data  | (null)        | (null)              |
| bookings    | ticket_flights  | (null)        | 2023-04-06 11:13:00 |
| bookings    | tickets         | (null)        | 2023-04-06 11:12:00 |
+-------------+-----------------+---------------+---------------------+
Enter fullscreen mode Exit fullscreen mode

You can update your statistics manually with this query:

ANALYZE TABLE_NAME
Enter fullscreen mode Exit fullscreen mode

Dead rows

Deleted entities may stay on the drive as dead rows to be removed later on. PostgreSQL removes these entities during the vacuuming process. You can see how many rows are dead with the following query (change your schema name):

SELECT
    schemaname,
    relname,  
    n_dead_tup,
    DATE_TRUNC('minute', last_vacuum) last_vacuum,
    DATE_TRUNC('minute', last_autovacuum) last_autovacuum
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY n_dead_tup desc;
Enter fullscreen mode Exit fullscreen mode

This gives the following output:

+-------------+------------------+-------------+--------------+---------------------+
| schemaname  |     relname      | n_dead_tup  | last_vacuum  |   last_autovacuum   |
+-------------+------------------+-------------+--------------+---------------------+
| bookings    | boarding_passes  |          0  | (null)       | 2023-04-06 11:11:00 |
| bookings    | tickets          |          0  | (null)       | 2023-04-06 11:11:00 |
| bookings    | airports_data    |          0  | (null)       | (null)              |
+-------------+------------------+-------------+--------------+---------------------+
Enter fullscreen mode Exit fullscreen mode

You can vacuum your table with the following query:

VACUUM TABLE_NAME
Enter fullscreen mode Exit fullscreen mode

Metis can show you the number of dead rows easily:

Image description

Indexes

Unused indexes may lead to a higher number of operations. Database needs to update the table and also to update all the indexes.

You can see how indexes are used with the following query:

SELECT
    idstat.relname AS table_name,
    indexrelname AS index_name,
    idstat.idx_scan AS index_scans_count,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    tabstat.idx_scan AS table_reads_index_count,
    tabstat.seq_scan AS table_reads_seq_count,
    tabstat.seq_scan + tabstat.idx_scan AS table_reads_count,
    n_tup_upd + n_tup_ins + n_tup_del AS table_writes_count,
    pg_size_pretty(pg_relation_size(idstat.relid)) AS table_size
FROM pg_stat_user_indexes AS idstat
JOIN pg_indexes ON idstat.indexrelname = pg_indexes.indexname AND idstat.schemaname = pg_indexes.schemaname
JOIN pg_stat_user_tables AS tabstat ON idstat.relid = tabstat.relid
WHERE indexdef !~* 'unique'
ORDER BY idstat.idx_scan DESC, pg_relation_size(indexrelid) DESC
Enter fullscreen mode Exit fullscreen mode

Output:

+------------------------+--------------------+--------------------+-------------+--------------------------+------------------------+--------------------+---------------------+------------+
|      table_name        |    index_name      | index_scans_count  | index_size  | table_reads_index_count  | table_reqds_seq_count  | table_reads_count  | table_writes_count  | table_size |
+------------------------+--------------------+--------------------+-------------+--------------------------+------------------------+--------------------+---------------------+------------+
| title_ratings_indexed  | title_ratings_idx  |                 0  | 28 MB       |                       0  |                     2  |                 2  |            1297467  | 74 MB      |
+------------------------+--------------------+--------------------+-------------+--------------------------+------------------------+--------------------+---------------------+------------+
Enter fullscreen mode Exit fullscreen mode

Metis can show you unused indexes easily:

Image description

Make sure that your index is not used. It is possible that it’s not used in your primary database, but is used in the replica. Double check all your databases.

You may need to clear history of index usage to get meaningful results in case an index was used before but isn’t anymore. You can do that with the following query:

SELECT pg_stat_reset()
Enter fullscreen mode Exit fullscreen mode

Summary

We have seen reasons for high CPU usage and how to investigate. We can always rely on automated tools like Metis to identify all the issues for us. In the next parts we are going to see how Metis actually helped with issues in real life scenarios.

💖 💪 🙅 🚩
adammetis
Adam Furmanek

Posted on September 20, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related