10 Tips to Optimize PostgreSQL Queries in Your Django Project

philippegablain

Philippe

Posted on February 22, 2023

10 Tips to Optimize PostgreSQL Queries in Your Django Project

This article showcases optimization techniques that had great results at GitGuardian. You’ll find basic and advanced optimization techniques using Django and PostgreSQL, yet most of the principles we discuss are general enough to be applied to other frameworks or relational databases. However, a basic understanding of database transactions and how ORMs work is necessary to get the most out of it.

But don’t believe me, test them yourself! We provide a fully functional Django Playground where you can run all the queries shown in this article.

Here it is: https://github.com/GitGuardian/blog-playground

Optimization is a state of mind

Database theories provide a lot of good practices that we should always follow… at first. Because sometimes, as the usage of your app grows, those practices may reveal insufficient or may not perform as expected.

Always remember, there is no magic recipe to get the best out of your database in all cases. As you’ll read later in this article, some tips will be super effective in some cases and not perform well in others. As your application keeps living (and hopefully growing), you may get from one to another, reverting to the optimization that made you proud a few months ago. Don’t see that as a failure; it was helpful for a time. What’s important here is to continue to monitor, test, measure, and improve.

So, the first and most important tip is that optimization is not a one-off task. It’s a never-ending process.

A good method to iterate fast

Django ORM is a super handy tool to persist and query your business concepts using pythonic code instead of SQL. But, you need to see the generated SQL to understand how your code will execute. Many tools provide this ability, but as you search for the best optimization, the closer you stay to your code, the better it is. Testing queries directly in Django Shell (or a Notebook) may finally be the fastest way to iterate and find the best solution.

Did you know that Django allows you to display all queries done by a code block?

Let’s consider the following code:

from books.models import Person
from django.db import connection, reset_queries 

reset_queries() 
qs = Person.objects.only("id") 
person = qs.first() 
print("SQL Query", qs[:10].query) 
print("PostgreSQL query: ", connection.queries[0]) # needs DEBUG=True 
print("pg explain analyze:", qs[:10].explain(ANALYZE=True))
Enter fullscreen mode Exit fullscreen mode

And its output:

SQL Query SELECT "books_person"."id" FROM "books_person" LIMIT 10
PostgreSQL query:  {'sql': 'SELECT "books_person"."id" FROM "books_person" ORDER BY "books_person"."id" ASC LIMIT 1', 'time': '0.000'}
pg explain analyze: Limit  (cost=0.00..0.20 rows=10 width=8) (actual time=0.013..0.016 rows=10 loops=1)
  ->  Seq Scan on books_person  (cost=0.00..20251.18 rows=1000018 width=8) (actual time=0.012..0.014 rows=10 loops=1)
Planning Time: 0.100 ms
Execution Time: 0.034 ms
Enter fullscreen mode Exit fullscreen mode

Django gives you access to the SQL query related to a Queryset, but you can also list queries executed at the connection level, and for the lucky PostgreSQL users, you can even get the executed plan for the query, and the time of execution. That’s an excellent starting point for comparing ideas on how to optimize a query.

Be aware that the production is not like your local machine

In the previous example, we extracted the execution plan executed by PostgreSQL. But if you compare the same code executed on your local machine and your production, you may have very different results.

Why? As your tables grow, PostgreSQL optimizes them(or not!), and—depending on the database cluster’s many settings—the planner may make different decisions. For example, if a table is small, the cost of using an index may not be smaller than doing a full scan. So on your local machine with three users in your database, you can get different results than in your production instance with millions (let’s think great!).

On your local machine, you should only be concerned by the number of queries you do and how big/complex they are. Leave the in-depth execution plan analysis for queries executed on the prod (or a prod-like) environment. Even there, be careful about your tables’ statistics. If the tables change often, statistics may not be up to date and can lead to poor decisions from the planner (you can activate auto vacuum to ensure your stats are always updated). For instance, with wrong statistics, the planner may “think” that your table is small and use a full scan, as in fact, the table is full of new rows, and using the index is the right choice.

Select only what you need

Now, let’s talk about some basic good practices that should always be followed. Running a query from your app can be broken down into 3 main tasks:

  • Sending the query to the database
  • Executing the query
  • Sending back the results

We often focus on the second part which can be memory and CPU bound, depending on the query and data, but the first and third parts, which are more network bound, are important too.

You may argue that the query will always be small… and maybe you are too confident about how ORMs work. You may also have forgotten the time you filtered using a list:

Person.objects.filter(email__in=all_population_of_north_america)
Enter fullscreen mode Exit fullscreen mode

In this example, the generated SQL itself can be huge. Even if your table is small and no row matches the query, the time to send the query will be very long. In the worst case, it could even raise an error.

Most of the time though, the result is often a lot bigger than the query. So if you only need to work on a small set of fields, don’t hesitate to use only() or defer()

Person.objects.filter(name='tolstoy').defer(bio)  # will not retrieve Tolstoy's bio
Person.objects.filter(name='tolstoy').only("name", "email") # will only retrieve name and email
Enter fullscreen mode Exit fullscreen mode

In the second example, be aware that if you want to access a field that is not listed in only(), Django will execute another query to retrieve it. If you do that in a loop, that’s many more queries. Dumping connection.queries can be useful to identify such cases and decide whether you should add a field to the only() clause.

After receiving the data, Django will instantiate models for each row returned. If you want to get some field values, use values() or values_list() to save a lot of time.

To be complete on the subject, and if you plan to serialize the result of your query, be aware that the more you retrieve, the more you will have to transform (and maybe send to some client). Pagination, when applicable, will make your queries run faster, and will also significantly reduce the time to process the result of the query and to send it to the client.

Index what you’re searching for

In databases, indexes use optimized data structures to avoid looking for every row of a table when searching for those matching a certain condition. Although indexes can significantly increase the size of your database, they are a cornerstone database performance, saving a lot of time and CPU.

That’s a well-known SQL good practice to index columns that are used in WHEREORDER BY, and GROUP BY clauses, and you can easily find where you should add indexes by looking at the costly “Seq scans” (aka full scans) in an execution plan. As a minimum, always index your primary keys, foreign keys (Django does it automatically), and all the fields you often use in filtering.

That’s not all. When creating an index, you should carefully choose its index type; otherwise, it could be less effective than expected, or worse PostgreSQL may not even use it! For example, HASH indexes can only operate on a = comparison and do not affect an ORDER BY, while B-TREE indexes will also be used on LIKE, inequalities, and sorting operations. GIN index will operate on complex data types like array or jsonb…

Finally, be careful: functions can sabotage your indexes! Let’s take an example:

Let’s say we want to search Persons by name. At first, we naively think we’ll be fine indexing the name field…

class Person(models.Model):
    email = models.CharField(max_length=255)
    name = models.CharField(max_length=255, index=True)
Enter fullscreen mode Exit fullscreen mode

But now imagine we want to do a case-insensitive search with this filter:

Person.objects.filter(name__iexact='tolstoy').only('email')
Enter fullscreen mode Exit fullscreen mode

Which translates to:

SELECT email FROM books_person WHERE UPPER(name) = UPPER('tolstoy');
Execution Time: 4740.735 ms
Enter fullscreen mode Exit fullscreen mode

In this case, our regular index is useless because we are using the UPPER function on the index! Instead, we should define the index as:

class Person(models.Model):
    email = models.CharField(max_length=255)
    name = models.CharField(max_length=255)

    class Meta:
        indexes = [
            Index(
                Upper('name'),
                name='books_person_name_upper_idx',
            ),
        ]
Enter fullscreen mode Exit fullscreen mode

Or in SQL:

CREATE INDEX users_user_name_upper_idx on books_person (UPPER(name)); 
Execution Time: 470.121 ms
Enter fullscreen mode Exit fullscreen mode

Select_related and prefetch_related are not always the best match

When working on a Django Model, you often need to access its relations. To demonstrate that, we create a second model, Book, which is related to a Person through an author's foreign key and a readers many to many relation (a person can read many books, a book can be read by many persons. I hope it will be the case for this blog post).

Class Book(models.Model):
   title = models.CharField(max_length=256)
   author = models.ForeignKey(Person, on_delete=models.CASCADE, related_name="writings")
   readers = models.ManyToManyField(Person, related_name="readings")
Enter fullscreen mode Exit fullscreen mode

If you have to iterate through Books and access information about their author, you could simply do:

for book in Book.objects.all():
    author = book.author
Enter fullscreen mode Exit fullscreen mode

This is generally a bad idea as it will generate N+1 queries for N books. Instead, you should rather use select_related in order to minimize queries:

SELECT "books_book"."id", "books_book"."title", "books_book"."author_id", "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_book" INNER JOIN "books_person" ON ("books_book"."author_id" = "books_person"."id") 
Enter fullscreen mode Exit fullscreen mode

prefetch_related is also a viable option, for OneToMany or ManyToMany relations:

Person.objects.prefetch_related("writings").all()
Enter fullscreen mode Exit fullscreen mode

In this case, only two queries are generated:

SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person"
'SELECT "books_book"."id", "books_book"."title", "books_book"."author_id" FROM "books_book" WHERE "books_book"."author_id" IN ( …)

Query duration: 1.0s
Total time: 8.12s
Enter fullscreen mode Exit fullscreen mode

But note that in the second query, the IN clause contains the full list of all Person.id selected in the first query, so the query can be huge. If you implement this solution on a big dataset, you may find performances very disappointing.

result = {} 
for user in User.objects.prefetch_related("writings")[:100_000]:    
    result[user.email] = [book.title for book in user.writings.all()]
print("query duration", sum(float(query["time"]) for query in connection.queries))

Query duration 0.9s
Enter fullscreen mode Exit fullscreen mode

The output of this code should tell you that the total query duration is small compared to the total time of execution. There are mainly 2 reasons for that:

  • Django must fetch the complete list of users in order to get the necessary user.id to build the second query
  • Each time you access a user, Django will need to look for matching books in the second query

Django gives you a way to optimize the second point:

“When using to_attr the prefetched result is stored in a list. This can provide a significant speed improvement over traditional prefetch_related calls which store the cached result within a QuerySet instance”

from django.db.models import Prefetch

result = { 
    user.email: [book.title for book in user.prefetched_writings] 
    for user in User.objects.prefetch_related( 
        Prefetch("writings", to_attr="prefetched_writings") 
    ) 
}
Enter fullscreen mode Exit fullscreen mode

But, if you need only to access some fields of the relation, the performances can be significantly better by using aggregation instead of prefetch_related:

from django.contrib.postgres.aggregates import ArrayAgg 

result = { 
    user.email: user.writings_titles 
    for user in User.objects.annotate(
        writings_titles=ArrayAgg("writings__title")
    ) 
} 
Enter fullscreen mode Exit fullscreen mode

Which will generate only one query and Django’s work will be reduced to the minimum:

SELECT "users_user"."id", "users_user"."email", "users_user"."name", ARRAY_AGG("books_book"."title" ) AS "writings_titles" FROM "users_user" LEFT OUTER JOIN "books_book" ON ("users_user"."id" = "books_book"."author_id") GROUP BY "users_user"."id"

Query duration 0.6s
Total time: 1.65s
Enter fullscreen mode Exit fullscreen mode

Limiting back & forths between your database and your Django backend is generally a good idea as your relational database will perform better than Django to join and aggregate data from several tables. You will save some network traffic and memory because there is no need to fetch and keep intermediate data for prefetch_related.

Still, want the extra mile? As in the previous example, we want emails and titles, meaning we don’t need to wake up the full Models. Here value_list is just what we need, allowing us to work directly on Python objects

result = { 
    user.email: user.writings_titles 
    for user in User.objects.annotate( 
        writings_titles=ArrayAgg("writings__title") 
    ).values_list("email", "writings_titles", named=True) 
}

Query duration 0.25s
Total time: 0.55s
Enter fullscreen mode Exit fullscreen mode

Aggregations vs Subqueries

In the previous example, we saw that we could rely on the power of PostgreSQL to save our life and RAM (select what’s more important for you). But if there is something to get from this article, it is that what works well in some cases may not perform in another. Always keep an eye on how queries are performing and stay open-minded.

Let’s take the case of subqueries. They are often seen as bad practice as database planners are not able to generate a single optimized execution plan from them. But in some cases, they can give surprisingly good results.

In our example, if we want to get all books published by our authors as well as their total count of readers, we will end with this kind of query:

writers_stats_qs = Person.objects.annotate(
    writings_title=ArrayAgg("writings__title"),
    readers_count=Count("writings__readers"),
).values_list("name", "bio", "writings_title", "readers_count")
Enter fullscreen mode Exit fullscreen mode

Or in SQL:

SELECT "books_person"."email", ARRAY_AGG("books_book"."title" ) AS "writings_title", COUNT("books_book_readers"."person_id") AS "readers_count" FROM "books_person" LEFT OUTER JOIN "books_book" ON ("books_person"."id" = "books_book"."author_id") LEFT OUTER JOIN "books_book_readers" ON ("books_book"."id" = "books_book_readers"."book_id") GROUP BY "books_person"."id"

Query duration 68.92s
Enter fullscreen mode Exit fullscreen mode

At some point, when a query needs to JOIN many tables or perform GROUP BY, the database will need a lot of memory to create all temporary structures that will allow resolving the query. It will also need to load all needed indexes, check for unicity on all temporary data, etc… So sometimes, using subqueries is worth a try, especially if they allow replacing a huge complex query with smaller ones, a lot easier to solve.

writings_subquery = Book.objects.filter(author_id=OuterRef("id")).values(
    "title"
)
readers_subquery = (
    Book.objects.filter(author_id=OuterRef("id"))
    .values("author_id")
    .values(count=Count("readers__id"))[:1]
)
writers_stats_qs = Person.objects.annotate(
    writings_title=ArraySubquery(writings_subquery), readers_count=readers_subquery
).values_list("name", "bio", "writings_title", "readers_count")
Enter fullscreen mode Exit fullscreen mode

This query will generate the SQL:

SELECT "books_person"."name", "books_person"."bio", ARRAY(SELECT U0."title" FROM "books_book" U0 WHERE U0."author_id" = ("books_person"."id")) AS "writings_title", (SELECT COUNT(U2."person_id") AS "count" FROM "books_book" U0 LEFT OUTER JOIN "books_book_readers" U2 ON (U0."id" = U2."book_id") WHERE U0."author_id" = ("books_person"."id") GROUP BY U0."author_id" LIMIT 1) AS "readers_count" FROM "books_person"

Query duration 18.5s
Enter fullscreen mode Exit fullscreen mode

Big writes, but not too big

Until now, we talked about querying the database but write operations may also lead to performance issues when done on a large scale.

In Django, bulk_create()bulk_update(), and Queryset.update() allow to pack writes in grouped queries. This is far more efficient than iterating through a list of objects and saving each of them individually.

If you are working on a large dataset, it goes without saying that many issues can happen. First, the query size can be problematic. As mentioned earlier, a huge query will consume a lot of memory in your Django process and isn’t optimal on the database side, either. For that reason, bulk methods allow passing a batch_size to break the big query into smaller bits, meaning that many small batches of this size will be executed one after another.

Then, another possible bottleneck is the transaction. Transactions are great to ensure that the data stays consistent. But when inserting or updating a lot of data, the database will have to maintain all the transaction changes in a specific space, and may also lock tables to ensure no race condition is happening during the transaction. This is good for consistency but bad for concurrent queries waiting for the transaction to release the locks.

Therefore if atomicity is not mandatory for the whole data you’re writing, it could be a good idea to break your bulk writes into smaller transactions.

from itertools import grouper
from django.db import transaction

CHUNK_SIZE = 500

for books_chunk in grouper(big_list_of_books, CHUNK_SIZE):
    with transaction.atomic():
        Book.objects.bulk_create(books_chunk)
Enter fullscreen mode Exit fullscreen mode

In this example, books are created in chunks of 500, and each bulk_create() is isolated in a transaction.

Save your RAM

We often think optimization has to do with CPU usage. But most of the time, optimization is more memory bound. Databases are more efficient when they can resolve a query without needing to read or write to the disk. The cost of accessing big data structures is important in Python too, even in memory. That’s why we should avoid, as much as possible, creating lists from querysets. We have a much more memory-efficient option: iterators. The good news is that you can transform one into the other almost for free:

def iter_over_persons(): 
    for person in Person.objects.all():
        pass
Enter fullscreen mode Exit fullscreen mode

RAM consumption: 1770.63671875 MB

def iter_over_persons_with_iterator(): 
    for person in Person.objects.iterator(): 
       pass
Enter fullscreen mode Exit fullscreen mode

RAM consumption: 114.70703125 MB

Note that using an iterator with prefetching wasn’t possible before Django 4.1. Since this version, the documentation states:

If you use iterator() to run the query, prefetch_related() calls will only be observed if a value for chunk_size is provided.

Monitor your performances

As we repeated throughout this article, optimization is a long game. This is not the first thing you should think about when coding a new feature —there is even the well-known mantra: _avoid premature optimization—_but rather a thing you should prepare. Observability is definitely a must-have to detect bottlenecks, timeouts, locks, and all those nasty problems that are inseparable from the success of your great product.

We have seen how to optimize at the query level, but you first need to identify queries you should work on.

At GitGuardian, we use Silk profiler on our development environments, and we use a Database APM scrapping PostgreSQL slow queries to identify bad performers. We also monitor database locks and our instances’ vitals (CPU, memory, io, …)

With all this tooling, we can quickly spot issues and work on them. Django queries’ optimization is only one way to resolve those issues. We sometimes change our models’ architectures or add asynchronous processes to compute some values, like we did to manage permissions at scale. This is another story, yet the state of mind is the same: never stop improving performances, never stop improving user experience.

To sum up our tips

  • Keep an open mind; optimization is a long game, and what is good one day may become bad a few months later
  • Create an environment where you can experiment easily and get immediate feedback
  • Be aware of the differences between your production and your test playground
  • Be careful of the amount of data you send or get from the database
  • Create indexes, and make sure they are used the way they are supposed to
  • Minimize the count of queries your code will generate and be aware of the overhead of processing on the Django side...
  • ...but sometimes many subqueries  can perform better than a single complex query
  • Find the best chunk size for your writes to get short and efficient transactions
  • Measure RAM consumption on the Django side for a better scaling
  • Use monitoring to detect performance bottlenecks at the earliest
💖 💪 🙅 🚩
philippegablain
Philippe

Posted on February 22, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024

Modern C++ for LeetCode 🧑‍💻🚀
leetcode Modern C++ for LeetCode 🧑‍💻🚀

November 29, 2024