moritz rieger
Posted on March 12, 2021
Google made us used to just type what comes into our mind when searching for something.
How can we achieve such a fuzzy search with a Postgres database?
There are basically three approaches.
- Pattern matching with
ILIKE '%searchterm%'
to find all occurences ofsearchterm
inside of a string. - Postgres full-text search
- Postgres pg_trgrm (trigram) extension
In this article, I will focus on the third approach, because I want a search so fuzzy that is insensitive to typing errors and it should match even if a word is not complete. These two criteria dismiss the first two approaches.
So let's dive into the third approach, the Trigrams!
Trigrams
Let's see what the postgres documentation tells us about trigrams:
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.
To get a grasp of what this means, check the show_trgrm()
method of the pg_trgrm extension.
But first, we need to enable the trigram extension.
CREATE EXTENSION pg_trgm;
SELECT show_trgm('Hello');
show_trgm
---------------------------------
{" h"," he",ell,hel,llo,"lo "}
SELECT show_trgm('Hello World');
show_trgm
---------------------------------------------------------------
{" h"," w"," he"," wo",ell,hel,"ld ",llo,"lo ",orl,rld,wor}
similarity
This similarity method returns a number from 0 to 1, which indicates how similar two strings are. A value of 1 means the strings are identical.
In our case, it is 0.5 since the first string has 6 trigrams in common with the second string, which has a total of 12 trigrams.
SELECT similarity('Hello', 'Hello world');
similarity
------------
0.5
But if you search for a small string in a larger string with the similarity function, the score will drop drastically. Because there are a lot of trigrams that are not in both strings.
SELECT similarity('Hello', 'Hello world, you wonderful planet!');
similarity
------------
0.19354838
Therefore Postgres provides two other functions which respect word boundaries in strings.
word similarity
The word similarity score reflects the highest matching similarity of a substring (word_similarity) respectively of a word (strict_word_similarity).
Note: No matter how long the second string gets, the (strict) word similarity will always be 1, because
Hello
matches exactly a whole word in the example.
SELECT
word_similarity('Hello', 'Hello world, you wonderful planet!'),
strict_word_similarity('Hello', 'Hello world, you wonderful planet!');
word_similarity | strict_word_similarity
-----------------+------------------------
1 | 1
The word similarity score can be understood as the greatest similarity between the first string and any substring of the second string. So said, it is useful for finding the similarity for parts of words.
The strict word similarity score is useful for finding the similarity of whole words.
The cool thing with trigram comparison is, that even if a word is misspelled, it has relatively high similarity.
SELECT similarity('wonderfull', 'wonderful'), strict_word_similarity('wonderfull', 'Hello World, you wonderful planet!'), word_similarity('wonderfull', 'Hello World, you wonderful planet!');
similarity | strict_word_similarity | word_similarity
------------+------------------------+-----------------
0.75 | 0.75 | 0.8181818
Example
So far so good, let's get our hands dirty.
Imagine you have a bookstore. Often customers ask you about a book they only know fragments of the title, or you don't know how to spell the author correctly. Sounds like a perfect fit for our fuzzy search!
For the sake of simplicity, let us generate a dataset. Yeah, I know this is not the most beautiful sample data for a search, but I hope you get the point.
CREATE TABLE books (
id TEXT,
title TEXT,
abstract TEXT,
author TEXT,
PRIMARY KEY(id)
);
INSERT INTO books (
id, title, abstract, author
)
SELECT
left(md5(i::text), 10),
left(md5(random()::text), 15),
md5(random()::text),
concat_ws(' ', left(md5(random()::text), 10), left(md5(random()::text), 10))
FROM generate_series(1, 100000) s(i);
SELECT * FROM books LIMIT 5;
id | title | abstract | author
------------+-----------------+----------------------------------+-----------------------
c4ca4238a0 | 06629ea8b04aaa7 | 9f0ad3e3542ecf1efbdddd98cca507a6 | 4dceb23e53 0f01b57e71
c81e728d9d | 0a284b57f95d997 | fdd45a7d9eda64c9deb4882ccbb42296 | f3fbf4ed2a ef99e869d2
eccbc87e4b | b464a24deba866e | 0eda8641e61327719906b493080cd96f | fca64a442c db03ca6331
a87ff679a2 | 6275469f9e41990 | f3771fb3afdb463d302d7849c38d5641 | fbe106a816 b4313ad5c3
e4da3b7fbb | b2e9f3a8bad3aec | a9f4b8432c1b6655ae8775dd5b904498 | 6df1b43a13 87c7a303a4
But wait. How can we search over multiple columns?
Good catch. To search over multiple columns we have to concatenate all columns that should be searched with the concat_ws
function.
Note: The
<%
operator returnstrue
if the word_similarity is above thepg_trgm.word_similarity_threshold
parameter.The
<%
operator has a commutator%<
.
So 'Hello' <% 'Hello World' = 'Hello World' %< 'Hello'
SHOW pg_trgm.word_similarity_threshold ;
pg_trgm.word_similarity_threshold
-----------------------------------
0.6
SELECT * FROM books WHERE '9c9f6' <% concat_ws(' ', title, abstract, author);
id | title | abstract | author
------------+-----------------+----------------------------------+-----------------------
2804d14b1b | 0152e58b57b94ff | 9c9f468fa58fc140427f7354f1a2b88e | b520b29fae ae1297ce25
c764c89e73 | 2f81a4878b09a36 | 667ddc5f403d9c96b43912628e1cb73c | ebab7d5c7c 9c9ffbfd82
5fa8101625 | 30071f6f0e9c9f6 | 5f3250393ac1b214340af81f239e7ca5 | ffec138ddb 3b9060d0c4
7cb394c278 | ce5ad25e38577fc | 805d0e2ca8081c6a8a178b1f0650c9f6 | 9c5332d434 cd79b0e802
9b39d07008 | a58ecb6f4e41c96 | 2df452e5bd0b1102733bbc89dd78e6ee | 9c9fb041fa 980becdc8f
(5 rows)
Time: 1245,955 ms (00:01,246)
Ouch, that took waaaaay too long. Now, it's time for GIN!
boost your search performance with GIN
Postgres has a special index for this use-case, it's called GIN (General Inverted Index).
"GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items."
Sounds like a perfect fit. But there is a little problem. We cannot create an index on the concat_ws function, since it's not immutable. We have to build an immutable function wrapper to create an index on the concatenated columns.
-- create immutable function wrapper for concat_ws
CREATE OR REPLACE FUNCTION f_immutable_concat_ws(t1 text, t2 text, t3 text)
RETURNS text AS
$func$
SELECT concat_ws(' ', t1, t2, t3)
$func$ LANGUAGE sql IMMUTABLE;
-- create a GIN index
CREATE INDEX search_gin_trgm_idx ON books
USING gin (f_immutable_concat_ws(title, abstract, author) gin_trgm_ops);
-- validate performance improvements
EXPLAIN ANALYZE SELECT * FROM books WHERE '9c9f6' <% f_immutable_concat_ws(titale, abstract, author);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=21.68..152.10 rows=100 width=82) (actual time=1.937..2.394 rows=5 loops=1)
Filter: ('9c9f6'::text <% f_immutable_concat_ws(title, abstract, author))
Rows Removed by Filter: 13
Heap Blocks: exact=18
-> Bitmap Index Scan on search_gin_trgm_idx (cost=0.00..21.65 rows=100 width=0) (actual time=1.653..1.663 rows=18 loops=1)
Index Cond: (f_immutable_concat_ws(title, abstract, author) %> '9c9f6'::text)
Planning Time: 4.487 ms
Execution Time: 2.533 ms
(8 rows)
Time: 9,965 ms
Yes, this was 125 times faster! 🏎
Where to go from here?
You could try:
-
strict_word_similarity
with<<%
- set another similarity threshold value for the GUC Parameters
- if you search-string contains multiple words, use the
WHERE
condition from above for every word and concatenate them withAND
The syntactic similarity is not enough for you?
Word2vec could be an option to compare the semantic similarity of words. Luckily there already is a postgres-word2vec extension for this.
I would love to hear about your solution for a fuzzy search on Postgres in the comments.
Cheers!
Posted on March 12, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.