PostgreSQL Faceted navigation and search

vmihailenco

Vladimir Mihailenco

Posted on July 28, 2023

PostgreSQL Faceted navigation and search

Faceted search or faceted navigation allows users to narrow down search results by applying multiple filters generated from some attributes or tags. In this article we will implement faceted search using PostgreSQL full text search and ts_stat function.

Creating a table

Let's start by creating books table with a name, tags (attributes), and a text search vector:

CREATE TABLE books (
  id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  name varchar(1000),
  tags jsonb,
  tsv tsvector
);
Enter fullscreen mode Exit fullscreen mode

tsvector is a sorted list of distinct normalized words (lexemes) that are used for searching. You can create a tsvector using to_tsvector function:

SELECT to_tsvector('english', 'The Fat Rats');

   to_tsvector
-----------------
 'fat':2 'rat':3
Enter fullscreen mode Exit fullscreen mode

You can use to_tsvector when inserting rows to the table:

INSERT INTO books (name, tsv)
VALUES
  ('hello word', to_tsvector('english', 'hello world')),
  ('foo bar', to_tsvector('english', 'foo bar'))
RETURNING *;

 id |    name    | attrs |         tsv
----+------------+-------+---------------------
  1 | hello word |       | 'hello':1 'world':2
  2 | foo bar    |       | 'bar':2 'foo':1
Enter fullscreen mode Exit fullscreen mode

Once you have some data, you can search over books using a tsvector and a tsquery:

SELECT * FROM books
WHERE tsv @@ websearch_to_tsquery('english', 'hello');

 id |    name    | tags |         tsv
----+------------+------+---------------------
  1 | hello word |      | 'hello':1 'world':2
Enter fullscreen mode Exit fullscreen mode

That query can be slow if your dataset is large, but you can make it faster by adding an inverted index on tsv column:

CREATE INDEX books_tsv_idx ON books USING GIN (tsv);
Enter fullscreen mode Exit fullscreen mode

And check that PostgreSQL uses the index:

EXPLAIN ANALYZE
SELECT * FROM books
WHERE tsv @@ websearch_to_tsquery('english', 'hello');
Enter fullscreen mode Exit fullscreen mode

Creating facets from tags

We will be using the following dataset to test our queries:

- model: Book
  rows:
    - name: The Gods Themselves by Isaac Asimov
      tags:
        - moods:adventurous
        - moods:challenging
        - pace:medium
    - name: Legend by David Gemmell
      tags:
        - moods:adventurous
        - moods:emotional
        - pace:fast
    - name: Lord of Light by Roger Zelazny
      tags:
        - moods:adventurous
        - moods:challenging
        - pace:medium
    - name: The Name of the Wind by Patrick Rothfuss
      tags:
        - moods:adventurous
        - moods:mysterious
        - pace:medium
    - name: Hyperion by Dan Simmons
      tags:
        - moods:mysterious
        - moods:adventurous
        - pace:medium
Enter fullscreen mode Exit fullscreen mode

You can insert those books using the following query:

INSERT INTO "books" ("name", "tags", "tsv")
VALUES
  ('The Gods Themselves by Isaac Asimov', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
  ('Legend by David Gemmell', '["moods:adventurous","moods:emotional","pace:fast"]', array_to_tsvector('{"moods:adventurous","moods:emotional","pace:fast"}')),
  ('Lord of Light by Roger Zelazny', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
  ('The Name of the Wind by Patrick Rothfuss', '["moods:adventurous","moods:mysterious","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:mysterious","pace:medium"}')),
  ('Hyperion by Dan Simmons', '["moods:mysterious","moods:adventurous","pace:medium"]', array_to_tsvector('{"moods:mysterious","moods:adventurous","pace:medium"}'));
Enter fullscreen mode Exit fullscreen mode

And then filter books by tags:

SELECT name, tags FROM books WHERE tsv @@ 'moods\:mysterious'::tsquery;

                   name                   |                           tags
------------------------------------------+----------------------------------------------------------
 The Name of the Wind by Patrick Rothfuss | ["moods:adventurous", "moods:mysterious", "pace:medium"]
 Hyperion by Dan Simmons                  | ["moods:mysterious", "moods:adventurous", "pace:medium"]
Enter fullscreen mode Exit fullscreen mode

Constructing a facet

Let's start by defining a facet we are expecting to get in the end:

Attr Value Book count
moods adventurous 5
moods challenging 2
moods emotional 1
moods mysterious 2
pace fast 1
pace medium 4

We could easily achieve that result with the following query:

WITH tags AS (
  SELECT jsonb_array_elements_text(tags) AS tag
  FROM books
)
SELECT
  split_part(tag, ':', 1) AS attr,
  split_part(tag, ':', 2) AS value,
  count(*) AS count
FROM tags
GROUP by attr, value
ORDER BY attr, value, count DESC;

 attr  |    value    | count
-------+-------------+-------
 moods | adventurous |     5
 moods | challenging |     2
 moods | emotional   |     1
 moods | mysterious  |     2
 pace  | fast        |     1
 pace  | medium      |     4
Enter fullscreen mode Exit fullscreen mode

But it is rather slow and inefficient because we need to select all tags to build the facet. Can we do better? Yes, using ts_stat function to get the required data directly from the tsv column.

Retrieving document stats

The function ts_stat allows to retrieve document statistics that are maitained by PostgreSQL full text search engine in tsvector columns.

SELECT word, ndoc FROM ts_stat($$ SELECT tsv FROM books $$) ORDER BY word;

       word        | ndoc
-------------------+------
 moods:adventurous |    5
 moods:challenging |    2
 moods:emotional   |    1
 moods:mysterious  |    2
 pace:fast         |    1
 pace:medium       |    4
Enter fullscreen mode Exit fullscreen mode

As you can see, PostgreSQL already maintains the stats we need to build the facet only using the tsv column:

SELECT
  split_part(word, ':', 1) AS attr,
  split_part(word, ':', 2) AS value,
  ndoc AS count
FROM ts_stat($$ SELECT tsv FROM books $$)
ORDER BY word;

 attr  |    value    | count
-------+-------------+-------
 moods | adventurous |     5
 moods | challenging |     2
 moods | emotional   |     1
 moods | mysterious  |     2
 pace  | fast        |     1
 pace  | medium      |     4
Enter fullscreen mode Exit fullscreen mode

To build a refined facet, you can use a fast filter over the same tsv column that is covered by the index we created earlier:

SELECT
  split_part(word, ':', 1) AS attr,
  split_part(word, ':', 2) AS value,
  ndoc AS count
FROM ts_stat($$
  SELECT tsv FROM books
  WHERE tsv @@ 'pace\:fast'::tsquery
$$)
ORDER BY word;

 attr  |    value    | count
-------+-------------+-------
 moods | adventurous |     1
 moods | emotional   |     1
 pace  | fast        |     1
Enter fullscreen mode Exit fullscreen mode

Conclusion

PostgreSQL provides everything you need to build fast faceted search for datasets up to 1 million rows. With larger datasets the processing time becomes an issue and you may need to shard your database.

You can also check pg-faceted-search example that demonstrates how to implement faceted search using Go and Bun database client.

See also

💖 💪 🙅 🚩
vmihailenco
Vladimir Mihailenco

Posted on July 28, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related