Django Queries Optimization

Have you ever asked yourself, is this the best way of doing this query? or is there a more efficient way?

When we have a project with one or more recurrent or long blocking queries we need to take into big consideration that this query might slow down our application or even make the database crash.

Some of this times the trick is to use the available options of the ORM and some times the best way is to write a SQL customized query, but how to know what’s the best in each case?

In this article I’ll show you a series of best practices recommended by Django documentation and by experience so you will have an idea of some of the options out there to optimize your code!

Before deciding what to use you have to know that you may want to optimize the queries for speed or for memory or both, depending on your requirements. But sometimes optimizing for one will affect the other. Also, the tasks that are processed by the database might not have the same resources cost than those made by a Python process. It is up to you to decide what your priorities are and where the balance must lie so you can improve the server resources according to it.

Having the Following Schema

class User(models.Model):
  id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
  name = models.TextField()
  email = models.TextField()
  age = models.PositiveSmallIntegerField()
  team = models.ForeignKey(Team, on_delete=models.PROTECT, related_name='users', null=True)

class Team(models.Model):
  location = models.TextField()
  name = models.TextField()

First than nothing you have to understand that Querysets are Lazy, but what does it means?

It means that the act of creating or declaring a Queryset doesn’t involve any database activity. So you can add filters and conditions line after line and the Queryset won’t be executed until you actually ask for the information, for example:

user_query = User.objects.filter(name="Ana") # Filter the users by the name "Ana"
user_query = user_query.filter(age__lte=50) # Then get only the users where the age is less than 50
user_query = user_query.exclude(email__isnull=True) # And exclude the users that doesn't have an email
print(user_query)

Now in these lines of code, the database gets only one hit in the line print(user_query) which is very useful when you want to add a lot of conditions before the execution but be careful because there are some other methods that might hit the database right away, like:

# Iteration
for user in User.objects.all():
  print(user.name)

# Slicing
User.objects.all()[:10]

# Pickling or Caching
pickle_str = pickle.dumps(user_queryset)
user_queryset = pickle.loads(pickle_str)
len(user_queryset._result_cache)   

# Methods like repr(), len(), list(), bool()
 user_list = list(User.objects.all())
 # or a boolean condition like:
 if User.objects.filter(name="Ana"):
   print("There is at least one user who's name is Ana")

It also means that you have to distinguish which queries are cached so it won’t hit the database again and which will no matter if the query is repeated for example:

# Create the query filtering users with email myemail@gmail.com
users = User.objects.get(email="myemail@gmail.com")
users.name # Hit the database and retrieve the name value
users.name # cached version, no database access  

# Create the query filtering teams named tigers
team = Team.objects.get(name="tigers")
team.users.all() # query performed
team.users.all() # query performed again

So if you need to iterate over a huge dataset and then perform some action with the values a posible solution is to use an iterator().

for user in User.objects.exclude(email__isnull=True).iterator():
   print(user.name)

It will hit the database just once fetching the matching rows but use iterator() with caution, and make sure that your code is organized to avoid repeated evaluation of the same huge Queryset.

A way to avoid the excess of memory usage when the dataset is huge is to execute a boolean condition before populating the whole dataset.

user_queryset = User.objects.all()
 # The first hit to the database confirms if at least one object exists.
 if user_queryset.exists():
   # Another database hit to start fetching the rows in batches.
   for user in user_queryset.iterator():
     print(user.name)

But don’t abuse of the exists(), count() methods because each time you call them a new hit is performed.

We also could retrieve everything at once if we know we’ll certainly need it, to do this we can use prefetch_related() or select_related() so we can get the related fields in other objects and the correct object all at once.

Team.objects.all().prefetch_related('users')

This will return a Queryset that will automatically retrieve, in a single batch, related objects for each of the specified lookups.

Now you have select_related() that returns a Queryset that will look for the foreign-key relationships, selecting additional related-object data when it executes its query. This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries, for example if you need all the user for the team.

# Hits the database
 team_queryset = Team.objects.get(name="tigers")
 # Hits the database again to get the related User objects
 user = team_queryset.user

So rather than doing that you can do this:

# Hits the database
 team_queryset = Team.objects.select_related("user").get(name="tigers")
 # The second line doesn't hit the database, because team_queryset.user  has been previously prepopulated
 user = team_queryset.user

Another way to make a query faster is to retrieve individual objects using a unique, indexed column. As in any other ORM, the query will be quicker because of the underlying database index. Also, the query runs much slower when there are multiple objects matching the filter; having a unique constraint on the column guarantees this will never happen.

user_queryset = User.objects.get(id=10)

Don’t order results if you don’t care. Sometimes Django orders the dataset even if we didn’t ask for that but we can disable it on a Queryset by calling order_by() with no parameters.

user_queryset = User.objects.all().order_by()

Use defer() and only() if you only need certain fields on a dataset to avoid loading all of the fields. Note that if you do use them later, the ORM will have to go and get them in a separate query, making this a pessimization if you use it inappropriately.

# This query retrieves all of the rows on the dataset but only the "name"  field
 user_queryset = User.objects.all().only("name")

Logging All SQL Queries

A pretty useful tool is the ORM logger, this will enable you to inspect all of the queries and also get the time of the execution. To enable this you’ll need to add a few things to your LOGGING configuration in your settings.py. First, you’ll need a handler. The example below logs everything at DEBUG level to the console.

LOGGING = {  
   'handlers': {  
     'console': {  
       'level': 'DEBUG',  
       'class': 'logging.StreamHandler',  
     },  
   },  
 }

Next, you’ll need to add a logger that logs to this handler:

LOGGING = {
   'loggers': {
     'django.db.backends': {
      'level': 'DEBUG',
      'handlers': ['console'],
     },
  },
}

Once the above is set up, you should see a stream of SQL queries in your console output every time you hit the database.
Conclusion

There isn’t a perfect query or a general answer to utilizing the ORM. You should first seek to make your code clear and then work on optimizing it and take into big consideration that some practices may speed up the database processes but spend lots of memory so, be careful and analyze what is the best for your app and your resources.

Originally posted in https://icodemag.com/django-queries-optimization/

Blog

Django Queries Optimization

Valery C. Briz

Having the Following Schema

Logging All SQL Queries

Join Our Newsletter. No Spam, Only the good stuff.

Related