Boosting PostgreSQL Performance: Optimising Queries with the != Operator
Sasi Kumar T
Posted on September 25, 2024
Introduction:
PostgreSQL is known for its flexibility and efficiency in handling a variety of queries. However, certain query patterns, such as those involving the !=
operator, can lead to performance challenges.
In this post, we'll discuss how to optimize queries involving !=
and demonstrate the performance improvements achieved by creating a partial index.
The Performance Challenge with !=
Using the !=
operator can make it difficult for PostgreSQL to utilize indexes efficiently. This often results in slower query performance because PostgreSQL may need to scan a larger portion of the table or perform more complex operations.
Consider the query:
SELECT count(*)
FROM "ticket"
WHERE "assignedTo" = 'sasikumar' AND
"businessId" = 'B00154' AND
"statusId" != 2;
While this query seems simple, it presents a challenge to PostgreSQL’s query planner. The !=
operator cannot efficiently use traditional B-tree indexes because it involves excluding a value, rather than narrowing down to a specific match. When indexes can’t be used effectively, PostgreSQL may have to resort to a full table scan, which can severely degrade performance, especially on large datasets.
The execution plan of the above query is as follows
Aggregate (cost=34899.83..34899.84 rows=1 width=8) (actual time=21912.045..21912.047 rows=1 loops=1)
-> Index Only Scan using ticket_businessId_assignedto_statusid_idx on ticket (cost=0.56..34899.13 rows=282 width=0) (actual time=21912.039..21912.040 rows=0 loops=1)
Index Cond: (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text))
Filter: ("statusId" <> 2)
Rows Removed by Filter: 25617
Heap Fetches: 23630
Planning Time: 0.394 ms
Execution Time: 230.119 ms
If you are not familiar with what’s happening dont worry, read below
Execution Plan Breakdown:
1. Aggregate
Aggregate (cost=34899.83..34899.84 rows=1 width=8) (actual time=21912.045..21912.047 rows=1 loops=1)
- Aggregate: This represents the final step where the aggregation (counting rows in this case) takes place.
-
Cost: The estimated cost to execute is
34899.83
to34899.84
. The cost is an internal PostgreSQL unit representing the estimated computational effort. - Rows: The planner estimates that 1 row will be output after aggregation.
- Width: The width of the output row is 8 bytes.
-
Actual Time: The actual time taken is
21912.045 ms
to21912.047 ms
, indicating the duration to perform the aggregation. - Loops: This is executed once, as it is the final aggregation step.
2. Index Only Scan
-> Index Only Scan using ticket_businessId_assignedto_statusid_idx on ticket (cost=0.56..34899.13 rows=282 width=0) (actual time=21912.039..21912.040 rows=0 loops=1)
- Index Only Scan: This indicates that an index scan is used to access the data. Since the scan is “only” on the index, it means no heap (table) access is needed initially.
-
Index Name:
ticket_businessId_assignedto_statusid_idx
is the index being used. -
Cost: The cost to perform the index scan ranges from
0.56
to34899.13
. This range estimates the computational effort required to access the index. - Rows: The planner estimates that 282 rows will match the index condition.
-
Actual Time: The actual time taken for the index scan is
21912.039 ms
to21912.040 ms
. -
Rows: Despite the estimate, no rows matched the index condition in reality (
rows=0
).
3. Index Conditions and Filters
Index Cond: (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text))
Filter: ("statusId" <> 2)
Rows Removed by Filter: 25617
Heap Fetches: 23630
-
Index Cond: The index condition is that both
businessId
andassignedTo
match the given values. This is used to narrow down the potential rows from the index. -
Filter: After applying the index condition, the query filters out rows where
statusId
is not equal to 2. -
Rows Removed by Filter: 25,617 rows were removed because they did not meet the filter condition
statusId <> 2
. - Heap Fetches: Even though the scan is index-only, PostgreSQL sometimes needs to fetch additional data from the heap to validate rows or retrieve data not included in the index. Here, 23,630 heap fetches occurred.
Key Points
-
Index Usage: The query uses an index-only scan to access rows based on
businessId
andassignedTo
. This is efficient as it avoids accessing the table's heap initially. -
Filtering: The filter (
statusId <> 2
) is applied after retrieving rows from the index. Since the filter was not selective enough, a large number of rows (25,617) were filtered out. - Heap Fetches: The number of heap fetches (23,630) suggests that many rows needed additional data from the heap, despite the use of an index-only scan.
One thing I want you the reader to know here is that, the statusId
field has cardinality of 5, having values ranging from 0
to 4
Now, based on this information we do the following
Approach 1: Query Rewrite
SELECT count(*)
FROM "ticket"
WHERE "assignedTo" = 'sasikumar' AND
"businessId" = 'B00154' AND
("statusId" > 2 OR "statusId" < 2)
The filter is modified to to get statusId’s > 2 and statusId’s < 2 which is the mathematical equivalent of statusId != 2
.
The execution plan of the above query is as follows
Aggregate (cost=1142.07..1142.08 rows=1 width=8) (actual time=691.201..691.204 rows=1 loops=1)
-> Bitmap Heap Scan on ticket (cost=28.77..1141.37 rows=281 width=0) (actual time=691.195..691.197 rows=0 loops=1)
Recheck Cond: ((("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text) AND ("statusId" > 2)) OR (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text) AND ("statusId" < 2)))
Heap Blocks: exact=574
-> BitmapOr (cost=28.77..28.77 rows=282 width=0) (actual time=23.966..23.968 rows=0 loops=1)
-> Bitmap Index Scan on ticket_businessId_assignedto_statusid_idx (cost=0.00..14.37 rows=145 width=0) (actual time=23.864..23.864 rows=628 loops=1)
Planning Time: 1.249 ms
Execution Time: 4.881 ms
Summary of Execution
-
Bitmap Index Scan: The query starts with scanning the index
ticket_businessId_assignedto_statusid_idx
to find rows where thebusinessId
andassignedTo
match, and then thestatusId
condition is applied. This index scan is efficient in finding rows but does not directly give the full row data. -
BitmapOr: This combines the results from multiple bitmap index scans. Here, it combines results where
statusId
is either greater than 2 or less than 2. -
Bitmap Heap Scan: Using the results from the
BitmapOr
, it reads the actual rows from the heap and applies the conditions again. This step verifies the rows from the index scan and ensures they meet thestatusId
conditions. - Aggregate: Finally, it aggregates the results from the heap scan, counting the number of rows that satisfy the conditions.
Congrats! You might have noticed that the first query returned 1 row, while the second one returned 0 rows. That's not a mistake—it's just because these queries were run on a live database, so the results are not the same.
Approach 2: Using Partial Index
A partial index is an index that only includes rows meeting a specific condition defined in the WHERE
clause of the CREATE INDEX
statement. This allows you to create an index that is smaller and more efficient by excluding rows that don't match the condition.
Characteristics of Partial Indexes:
-
Selective Indexing:
- A partial index is created with a condition, which means it only indexes the rows that satisfy this condition. For example, you might only index rows where a column value is greater than a certain threshold.
-
Reduced Index Size:
- Because it only indexes a subset of the table, a partial index is usually smaller and consumes less disk space than a full index. This can make index scans faster and reduce memory usage.
-
Improved Performance for Specific Queries:
- Partial indexes are beneficial for queries that frequently filter on specific conditions. They can speed up these queries by avoiding the overhead of indexing all rows in the table.
In our case, we need to create a index with fields (businessId, assignedTo) where statusId is ≠ 2. The index creation command is as follows
CREATE INDEX CONCURRENTLY IF NOT EXISTS
ticket_businessId_assignedto_statusid_ne_2_idx
ON ticket ("businessId", "assignedTo")
WHERE ticket."statusId" != 2;
SELECT count(*) FROM "ticket"
WHERE "assignedTo" = 'sasikumar'
AND "businessId" = 'B00154'
AND "statusId"!=2;
Aggregate (cost=599.68..599.69 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
-> Index Only Scan using ticket_businessId_assignedto_statusid_ne_2_idx on ticket (cost=0.42..598.97 rows=283 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text))
Heap Fetches: 0
Planning Time: 0.198 ms
Execution Time: 0.052 ms
Execution Plan Breakdown:
-
Aggregate:
-
Actual Time:
0.022..0.023 ms
— The actual time taken for the aggregate operation to complete is between 0.022 and 0.023 milliseconds. -
Rows:
1
— The aggregate operation returned 1 row. -
Loops:
1
— This aggregate operation ran once.
-
Actual Time:
-
Index Only Scan:
-
Actual Time:
0.019..0.019 ms
— The actual time taken for the index scan is 0.019 milliseconds. -
Rows:
283
— The planner estimates that 283 rows would be returned by the index scan. -
Width:
0
— The width of the rows (data returned) is 0, indicating that only the index entries are being scanned, not the table rows. -
Index Cond:
("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text)
— The condition used to filter rows in the index scan.
The index-only scan uses the
ticket_businessId_assignedto_statusid_ne_2_idx
index to locate rows wherebusinessId
andassignedTo
match the specified values. It retrieves the index entries without needing to access the actual table rows because all necessary data is available in the index. -
Actual Time:
-
Heap Fetches:
-
Heap Fetches:
0
— Indicates that no heap fetches were needed. Since the index contains all required data, PostgreSQL did not need to access the actual table rows (the heap) for additional information.
-
Heap Fetches:
-
Planning Time:
-
Planning Time:
0.198 ms
— The time taken by PostgreSQL to plan the query execution, including generating the execution plan.
-
Planning Time:
-
Execution Time:
-
Execution Time:
0.052 ms
— The total time taken to execute the query and retrieve the result.
-
Execution Time:
Originally, the query used a full index
ticket_businessId_assignedto_statusid_idx
, which was 727 MB. Now, it's using a partial index that's only 3.17 MB.
In database terms, this means faster index scans, reduced memory usage when loading the index, and improved query performance.
Conclusion:
In the world of PostgreSQL, optimizing queries can sometimes feel like navigating a maze, especially when dealing with operators like !=
. As we explored, such queries can often lead to performance hurdles due to inefficient index utilization and the need for table scans.
Key Takeaways:
-
Challenges with
!=
: Queries that use the!=
operator can be less efficient because they force PostgreSQL to perform broader scans and potentially filter out a significant number of rows. This was evident in our initial example, where the query needed to filter out many rows, leading to a high number of heap fetches and considerable execution time. -
Query Rewrite Efficiency: By reformulating the query to use
statusId > 2 OR statusId < 2
, we leveraged PostgreSQL's ability to optimize these conditions more effectively. This approach helped reduce the execution time significantly, demonstrating that a slight change in query logic can yield substantial performance improvements. -
The Power of Partial Indexes: The use of partial indexes can be a game-changer. By creating an index specifically for rows where
statusId != 2
, we reduced the index size and optimized the scan process. This specialized index not only sped up the query but also eliminated the need for additional heap fetches, showcasing how targeted indexing can enhance performance.
About Me
I’m a backend developer with extensive experience in designing and optimizing scalable backend systems. My expertise includes tackling complex performance challenges. I’ve led numerous database performance initiatives and have also been deeply involved in system design and revamping existing systems. My focus is on enhancing efficiency, ensuring reliability, and delivering robust solutions that scale effectively.
Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.
Posted on September 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 25, 2024