Book Review-The Temple of Django Database Performance
Modasser Billah
Posted on August 5, 2020
I recently bought an interesting tech book themed on fantasy adventure. And it was money well spent. The Temple of Django Database Performance was a great read.
The only way to master Django is to master database performance. The difference between junior, mid-level and senior Django engineers is usually a function of this mastery.
The author starts his book with the bold claim above. But I'd say the content applies beyond Django and helped me get more low-level with databases than before. I encourage django devs to read this book. If you need a glimpse into it, here's my notes from the book:
Profiling: The Yardstick
- Before you start your ritual of performance tuning, you must be initiated in profiling. In human terms, profiling is all about measuring your results. Without profiling you have no way to measure if your tuning pleased the database deity.
- APMs or Application Performance Monitoring tools are your friend here. Think New Relic, Datadog or more recently Sentry. The book touches on details in New Relic, but I'll skip that part as that is too application specific. You can get up to speed with any of these quickly.
- Slow Query Logs can help as well but not at the level of APMs.
- For locally producing issues and gauging performance, use Django Debug Toolbar and/or Silk
- Query Plan: The query plans details how a database intends to execute a query. Get comfortable reading these. The book is chalk-full of examples of query plans.
Indexes: The Magic Wand
- Without indexes, databases have to do sequential searches every time. O(n) every time, oh no! š
- B- Tree to the rescue. Indexes are generally B-trees. B-tree maintains a tree whose leaf nodes are pointers to the disk locations of table records.
Fun fact
š©ļø Engineers at Boeing invented the B-tree data structure. But no one knows what theB
inB-tree
stands for.
- Databases store data in units called
pages
.Page size
is also used for index storage. In theory, the speed of the index is proportional to page size. But it has a lot of external factors like storage medium and application. So, when in doubt or in the mood to tinker, profile!
Postgres Tip
š You can see indexes defined on a table by executing\d <table name>
command in thepsql
shell.
ā Django ORM question: How do you add an index to your Django model?
š” Specify indexes
field inside the Meta
class:
Class Event(models.Model):
... # omitted for brevity
name = models.CharField(max_length=255)
...
...
class Meta:
indexes = [
models.Index(fields=['name'],
name='my_first_index')
]
Note
šµ Thename
attribute of the index is optional but helpful for later use.Postgres Tip
š Postgres locks the table for write operations by default during the indexing process. This can take a long time on large tables and cause headaches for your live system. Way around? Use,CREATE INDEX CONCURRENTLY
āļø This is called an Online DDL operation.
- DDL: Data Definition Language => modifies the schema of the database.
- DML: Data Manipulation Language => acts on data
- In relational databases, both DDL and DML are just two tribes of SQL commands.
- MySQL InnoDB engine doesn't lock by default.
Postgres Tip
š Force it to update statistics byVACUUM ANALYZE
, then check performance withEXPLAIN ANALYZE
šØ In general, indexes slow down write operations.
- For write-heavy applications, you need LSM Trees or SSTables instead of B-trees. Checkout Cassandra and RocksDB for these. If you're using Django+MySQL, check out MyRocks from Facebook.
Most columns do well with B-tree indexes. However, if you are storing exotic data, you may benefit from alternative index types
- GIN indexes for JSON data, Postgres
- GiST indexes for spatial data, Postgres
- BRIN indexes for very large tables, Postgres
- Hash indexes for in-memory tables, MySQL
- R-tree indexes for spatial data, MySQL
- GIN = Generalized Inverted Index
- Works well with
contains
logic for JSON data and full-text search - Only available in Postgres
- Works well with
- Indexes are not a silver bullet (well, almost)
- The query planner may ignore an index and not use it! š®
ā Why?
š” There can be many reasons. For appetisers, check your query plans after introducing an index to confirm whether the query plan chose to use the index.
- Tuning parameters/configs of the database can help, too. This article has some nice pointers about this in Postgres.
- If everything fails and indexes don't heal you, it may be a sign that you need to move some work out of the database and into your application, denormalise or move it into an asynchronous task.
- Covering Indexes: An index is said to cover a query if the database can return all the data needed for the query directly from the index, without having to search through the table data.
ā How do you create a Covering Index?
š” Multi-column indexes
CREATE INDEX CONCURRENTLY le_index_name
ON le_table_name(le_attribute_1, le_attribute_2);
Note
ā Order of the attributes matter! An index on(id, name)
is not the same as the index on(name, id)
š” Using the INCLUDE parameter
CREATE INDEX CONCURRENTLY le_index_name
ON le_table_name(le_attribute)
INCLUDE (le_another_attribute);
-
Partial Indexes: An index with a
WHERE
clause is a partial index. It limits which rows in the table get added to the index.
CREATE INDEX le_index_name
ON le_table_name(le_attribute)
INCLUDE (le_another_attribute)
WHERE le_attribute = 'le_frequently_used_value';
Clustering
If the table changes infrequently and most of your queries filter on the same set of columns, you have another option: Clustering.
- Clustering restructures the table data on disk in accordance to the structure in the index.
- Clustering is a one-time operation. Postgres does not maintain the structure of the table, so the structuring gets violated with new writes.
- It also requires an
ACCESS EXCLUSIVE
lock on the table, which blocks reads and writes. - So, before you say I do to Clustering, read the quote above again and decide whether you should.
MySQL Tip
š Tables are automatically clustered by primary keys in MySQL InnoDB engine. And no, you can not cluster by a secondary index.
Querying: The Weapons and the AMMOs
- The
n+1 problem
- 1 query fetches
n
records - Then you do 1 query for each of those records to fetch some related data.
-
1
(the originalsinquery) + (One additional query for each record)1 * n
=n + 1
- 1 query fetches
-
Healing potion?
š” Django uses lazy fetching, that's why it doesn't fetch related models in the original query.
š” Force it to be not lazy, use
select_related
.select_related
makes a JOIN to get data for each field name you pass in.
events = Event.objects.select_related('user').all()
Note
ā Joins can negatively impact performance. So, like any other super power, useselect_related
with care and only when you need to.
-
select_related
only works with single value relationships likeForeignKey
.
š¤ What about many-to-many
and many-to-one
relationship (i.e, reversing a ForeignKey relationship)?
š” prefetch_related
to the rescue!
User.objects.all().prefetch_related('events')
Relevant Resource
š This article is a great place to start onselect_related
andprefetch_related
.
-
More ammunitions:
- Use
only()
to specify which columns/attributes of a table/model to fetch
User.objects.only('name')
-
defer()
is the opposite ofonly()
. Defer fetching all columns that you don't need immediately. -
values()
fetches data without instantiating any models. You get the data as a dict. - Consider using
values()
when you're willing to trade Django models for more speed and lower memory usage. - For large set of data use
iterator()
to read data in chunks. - Use
update()
when- Updating all objects in a query set to the same set of values
- The update does not need any signals or custom save() methods.
- use
bulk_update()
when- The query set is large
- Objects in the query set are updated to different values
- The update does not need any signals or custom save() methods
-
F()
expressions allow you to refer to a value of a column in the database without querying for that data. -
func()
,annotate
,aggregate
are some more advanced weaponry. Skipping details, just remember they exist and pick up as needed.
- Use
Relevant Resource
š Haki Benita's blog has some great articles on databases, Django ORM and Python.
Pagination
- Pagination if often overlooked when building a new system. But it bites you eventually when the data grows in size.
- Offset pagination
- Keyset pagination a.k.a infinite scroll
-
Offset Pagination
- Django comes with a
Paginator
class that does offset pagination off-the-shelf. -
Flaws
ā Results can be inconsistent. This can happen because of new insertions while a user is viewing a paginated page.
ā For large set of data, pages in the tail end can take longer to be queried.
- Django comes with a
-
Keyset Pagination
- The client sends a
keyset
representing the last item in a page (instead of an offset and a limit). The next page is then defined as the records following that last item in sorted order. -
keyset
is a set of columns that can be used to sort records in a stable order. - Make sure an index exists for the columns used in the
keyset
- DRF has it built-in, use
CursorPagination
! -
django-keyset-pagination
is another third party library for this. -
Flaws
ā Lack of tooling support in vanilla Django
ā Can not select arbitrary pages, only the next page or the previous page.
- The client sends a
I enjoyed the technical tidbits as much as I enjoyed the accompanying fantasy adventure. If you're interested to buy the book, head over to Spellbook Press website. Support indie publishers!
Cover Image Credit: Spellbook Press web page of the book
This article was originally published at my personal blog
Posted on August 5, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.