Counting faster with Postgres

rettx

Arctype Team

Posted on May 4, 2022

Counting faster with Postgres

Count queries are probably the most-used aggregate queries in a relational database. Counting is a fundamental operation required for many CRUD applications like sorting, pagination, and searching. But counting records can become terribly slow as your dataset grows. Fortunately, there are strategies for dealing with this problem. This article will show you a few approaches.

Data setup

Since we are just exploring count queries, we do not need an extensive data setup. We can just create a simple table with just one column. You can do that with the commands below:



CREATE TABLE sample_data (id int);
INSERT INTO sample_data SELECT generate_series(1, 200000);


Enter fullscreen mode Exit fullscreen mode

Sample Data Prep

Simple count query

Let's understand the query plan of a simple count query.

Simple Count Query Plan

This runs a rudimentary Seq Scan on the table. If you are not familiar with reading query plans, then I would highly recommend reading the article linked below:

To understand about Scans in general, you might also find this article helpful:

Parallel processing for counts

PostgreSQL does not use parallel processing since the row count is too small, and using parallel workers might slow the counting process. Let's insert more data into our sample table and enable PostgreSQL to use more workers.



INSERT INTO sample_data SELECT generate_series(200000, 4000000);
SET max_parallel_workers_per_gather = 4;


Enter fullscreen mode Exit fullscreen mode

And then try to analyze the plan, as shown below:

This uses three workers out of the four we have configured. Parallel queries are a no-brainer. They just throw hardware at the problem to enable faster query execution. But in a transactional database, we cannot simply rely on parallel queries - as you can see in this example, it still takes 342 milliseconds.

Why indexes don't help with plain count

The first thing that any database user would do is add an index to speed up a query. Let's try that here.



CREATE INDEX id_btree ON sample_data USING BTREE(id)


Enter fullscreen mode Exit fullscreen mode

Unlike other queries, the index does not help here.

Counting With An Index

This is because the count has to touch all of the rows of the table. An index helps if there is a where clause but otherwise, scanning the index will be slow. We can understand that when we turn off Seq scan.



SET enable_seqscan = OFF;


Enter fullscreen mode Exit fullscreen mode

Slow Index Only Scan

Counting nodes in a BTree data structure takes O(n), where n is the number of rows and also takes extra memory - O(h) where h is the height of the tree. There are also increased access costs by directly accessing the index, due to which a sequential scan over normal tuples/table data is preferable.

Counting with the Where clause

Once there is a where clause involved, then it is very easy for the planner to use the index since it cuts down a lot of rows from the result. Let's consider a full index and a partial index.

Full index

Since we already have a full index, a count query with a where clause on the id the column would be very fast.



SELECT COUNT(id) FROM sample_data WHERE id = 200000


Enter fullscreen mode Exit fullscreen mode

Partial Index

Partial index

If we are going to count rows only with a specific where clause, then we can use a partial index.



CREATE INDEX id_btree ON sample_data USING BTREE(id) WHERE id = 200000;


Enter fullscreen mode Exit fullscreen mode

Partial Index

Partial indexes are faster, easier to cache because of their size, and easier to maintain.

Distinct vs. duplicate counts

By default, count query counts everything, including duplicates. Let's touch upon distinct, which is often used alongside count.



SELECT DISTINCT(id) FROM sample_data


Enter fullscreen mode Exit fullscreen mode

Distinct

This command uses an index-only scan but still takes around 3.5 seconds. The speed depends on many factors, including cardinality, size of the table, and whether the index is cached. Now, let's try to count the number of unique rows in the table.



SELECT COUNT(DISTINCT(id)) FROM sample_data


Enter fullscreen mode Exit fullscreen mode

Count Distinct

This command does not use the index and resorts to sequential scan. As a planner, you'd be correct in choosing a sequential scan, since walking over an index completely is more costly, as we already saw above.

The rule of thumb in relational database systems is to never deal with distinct queries. We can never fully get away from distinct, but with proper database modeling (moving duplicate rows to a different table and using a one-to-many relationship, storing unique counts separately, etc.), a query with a where clause will be faster.

Approximate counting

In most real-world use cases, we would never want the exact row count of a table. We might want to count specific rows with a where clause (which can be solved with an index), but never a full row count. A typical OLTP workload does not add a million rows in a day. Maybe a few thousand rows split across different time windows. There are ways in PostgreSQL to get an approximate row count instead of an actual row count and still be ok with the business use case.

Getting approx counts with statistics

The cost-based query planner in PostgreSQL uses the statistics information for planning queries. We can utilize this information to give an approximate row count.



SELECT
    reltuples::bigint AS estimate
FROM
    pg_class
WHERE
    oid = 'public.sample_data' ::regclass;


Enter fullscreen mode Exit fullscreen mode

This will return the row count as 4000001. We can also run a full count query to verify that this row count is accurate. This might not be accurate in production workloads depending on when the VACCUM ANALYZE runs. We see that in the next section.

Keeping the statistics up to date

Discussing the internals of the Vaccuum process is beyond the scope of this blog, but for most OLTP workloads, the default Auto Vacuum settings work just fine. If there is any huge data load on the table, we can do manual Vacuum Analyze <table_name> if required. The Vacuum process does what it is named after. It cleans up and updates the statistics table with the latest and most accurate information.

Re-thinking the problem and external solutions

Keeping accurate table counts need not necessarily be a database problem alone. Various other ways can be used to keep a count. Let's go through a few simple ideas.

Using a cache at the application level

If the application is a simple two-tier app with only UI and some form of backend, then we can use a caching layer such as EH Cache or Cache tools to maintain the count of rows as and when we insert it. These caches can be backed by persistence so that the data is not lost. Caches are lightweight and pretty fast. Alternatively, one can store the count in the database itself. The key feature of this approach is that the trigger to update the count is the application's responsibility.

Use a variable with the help of triggers/hooks.

If you are not familiar with hooks or triggers, these articles will give you a good starting point on understanding them:

Using a trigger or a hook, we can maintain a variable in either a Postgres table or outside of the system. Choosing this strategy comes down to how your database is setup, and is generally suited for applications that have a lot of downstream systems for consumption. The trigger or hook is, in general, more reliable and suited for more complex applications.

A cache like Redis

In a microservice architecture where many services talk to the database, storing all row count-related information for many tables will be a bottleneck since there can be hundreds of microservices and a lot of databases. It can also result in synchronization-related issues as well.

Redis is a system that suits such kind of N-Tier architecture. It is:

  • Fast.
  • Thread-safe
  • Scalable (using sharding)
  • Persistent (can be enabled)

All individual services can call Redis based on the Saga pattern or API Composition pattern to update the values.

It is very widely used as a cache in the microservices world, but we must keep in mind that bringing one more external system into the picture leads to an increased learning curve, maintenance, and troubleshooting. If you do not have a complex N-Tier system, then you are better off using more simpler solutions.

Conclusion

We saw in detail how count & distinct queries operate underneath the surface and how they can be made faster. To summarize:

💖 💪 🙅 🚩
rettx
Arctype Team

Posted on May 4, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related