Django Postgres Text Search for 10M+ rows

avikki

AviKKi

Posted on August 11, 2020

Django Postgres Text Search for 10M+ rows

In a recent project I had to add a full text search functionality to an already existing Django project, below are notes of what challenges I encountered and how I solved them.

For easy reading I have listed down a brief walk-through and limitations I found, followed by a more detailed log.

Project Overview

This project involved loading a 30GB+ CSV into the database, that included information about books; and implementing full text search on those book's title, author, tags, categories.

Overall walk-through

  1. Adding in FTS with Postges is super easy -
    • Add django.contrib.postgres in installed_apps
    • perform search as following
Book.objects.filter(title__search='A little girl')
Enter fullscreen mode Exit fullscreen mode
  1. Indexing to increase performance
    • Add a SearchVectorField field to the model
  # for pre computed search vectors
  search_vector = SearchVectorField(null=True, blank=True)  
Enter fullscreen mode Exit fullscreen mode
  • Create Index
  class Meta(object):
      indexes = [ GinIndex(fields=['search_vector']) ]
Enter fullscreen mode Exit fullscreen mode
  1. Increase work_mem, default work_mem of Postgres is too low for M+ rows.
    edit work_mem in postgresql.config file and restart your db.
    A bit of sed command if you are using docker.

  2. Caching
    I cached whole webpage on a redis instance along with certain queries like result count(very heavy one), which will be repeat for every search page load, required overloading Paginator class and ListView class.

  3. Increase Shared Memory, generally not required but my docker container was running out of memory for some queries.

Limitations Found

  1. Complex queries would be really slow, example
    • sorting results based on similarity
    • sorting search results(aka ORDER BY) based on number of comments on an book, or any other non text column.
  2. A bit hard to tune
    • Doing a trade-off between relevant results and max possible results ( good for SEO ) requires complex queries which will take too loooong to process.

Detailed log TL;DR

Available options

There are two major ways of achieving this -

  • django haystack plugin

    With this you can integrate a search engine with your django application. You have several options like solr and elastic search for a search backend. These are really good at handing text search for a large amount of documents, but has overhead in form of server cost, development overhead etc.

  • Postgresql's full text search

    Postgres has a full text search feature, in sql you just have to add a WHERE clause and you have fully working text search, and on djangos side you can use .filter method. Although it is not a dedicated search application so has many shortcomings, for small applications it works great out of the box, but as database grows you'll have to do some tweaking.

Implementation

Config

add django.contrib.postgres to installed apps.

# settings.py
....
INSTALLED_APPS = [
    ...
    'django.contrib.postgres', # for fts search
    ...
]
...
Enter fullscreen mode Exit fullscreen mode

Model

from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Book(models.Model):
    title = models.CharField(max_length=300)
    poster_url = models.URLField()
    downloads = models.IntegerField()
    likes = models.IntegerField()
    comments_count = models.IntegerField()
    search_vector = SearchVectorField(null=True, blank=True)  # for pre computed serch vectors

    # tags, categories, authors remaining
    # raw data fields
    _tags = models.TextField(default="", blank=True)
    _categories = models.TextField(default="", blank=True)
    _authors = models.TextField(default="", blank=True)

    class Meta(object):
        indexes = [GinIndex(fields=['search_vector'])]
Enter fullscreen mode Exit fullscreen mode

Above is a typical Django ORM model, search_vector contains vector representation of book's title, tags, categories and authors; Postgres converts both the search query and textfields into vectors then compares them for a match, by pre-computing the search vector and indexing it with a GinIndex we are improving the query speed.

search_vector can be computed with below python code,

Book.objects.update(search_vector=SearchVector('title', '_tags', '_categories', '_authors'))
Enter fullscreen mode Exit fullscreen mode

Using authors, tags and categories as TextField helps in loading the huge CSV file faster.

View

view was implemented with generic ListView

Profiling

After this I used Django's debugging toolbar to have a look at the queries being performed, there were 2 major issues.

  1. Count(*) was slow for queries with ~100K+ results

Count(*) is an notoriously expensive operation in sql, you basically have to scan through whole table to do this, there are some workarounds like storing count separately, partial indexes, but nothing is applicable to our use case.

I cached the queries for this

  1. Query time was drastically more after a certain increase in number of search results.
💖 💪 🙅 🚩
avikki
AviKKi

Posted on August 11, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related