A powerful full-text search in PostgreSQL in less than 20 lines
Leandro Proença
Posted on September 16, 2021
This blogpost will guide you to understand the fundamental pieces needed to implement a good enough full-text search using PostgreSQL.
Spoiler alert: for those curious people looking for a "okay, just show me a full-text search with ranking and fuzzy search in Postgres in less than 20 lines", so here you go:
SELECT
courses.id,
courses.title,
courses.description,
rank_title,
rank_description,
similarity
FROM
courses,
to_tsvector(courses.title || courses.description) document,
to_tsquery('sales') query,
NULLIF(ts_rank(to_tsvector(courses.title), query), 0) rank_title,
NULLIF(ts_rank(to_tsvector(courses.description), query), 0) rank_description,
SIMILARITY('sales', courses.title || courses.description) similarity
WHERE query @@ document OR similarity > 0
ORDER BY rank_title, rank_description, similarity DESC NULLS LAST
But if you need to understand what the heck is the above SQL statement doing, let me explain you a bit of context and FTS (Full-text search) fundamentals in PostgreSQL.
Context matters
A bunch of years ago I read this awesome blogpost called "Postgres full-text search is good enough". It's really worth reading, I could get many insights, since I was already using PostgreSQL as my standard database.
By the time, I was comfortable using ElasticSearch for text searching (and if we go even before that back to 2009, I have experience using Apache Lucene, from which ElasticSearch is based on).
However, managing ElasticSearch deployment is not easy. It requires a lot of patience and memory 🍪.
Then back to 2014 I wrote this article explaining the reasons why I decided to experiment on PG text search as well as showing a practical example in a Ruby application.
In this guide, I'll focus on a simpler yet powerful example using only SQL, so if you want to follow me in this adventure, make sure you have PostgreSQL installed.
That's the only requirement. No more tools to install or setup. Postgres solely.
Seeding data
In order to explain further the fundamentals of textual search, relevance and results ranking, we have to seed our database with real data and compare different search strategies.
Let's create a table called courses
containing only a title
and description
columns. Those columns will be our "searchable" columns in which we will perform a text search against:
CREATE TABLE courses
(id SERIAL PRIMARY KEY,
title VARCHAR(80) NOT NULL,
description VARCHAR(200) NOT NULL);
Next, we will populate the table with some dummy data:
INSERT INTO courses (title, description) VALUES
('Improve your sales skills', 'A complete course that will help you to improve your sales skills'),
('Intro to Computer Science', 'Understant how computers work'),
('Law 101', 'Have you ever wondered doing some Law?'),
('Natural Sciences the easy way', 'Your guide to understand the world'),
('Mathematics: a gentle introduction', 'Numbers are easy'),
('The crash course of Data Science', 'Be a data scientist in 5 weeks'),
('Sales crash course', 'Yet another course on Sales'),
('Java in a nutshell', 'Learn Java in 21 days'),
('Ruby programming language', 'DDH sales Ruby, but could you buy it?'),
('Sales matter', 'Really?'),
('History in 3 pages', 'Can you learn history in 3 pages?'),
('Mastering Git', 'Git history will no longer bother you'),
('Cooking like a boss', 'Be the next master chef'),
('Master Chef 3.0', 'Cooking revisited'),
('Functional Programming in a nutshell', 'Learn FP in 4 days');
Check the data was properly created:
SELECT * FROM courses;
Cool. Now, before going to "full-text search", let's perform a simple textual search as used in many SQL systems: pattern matching
Textual search using LIKE and ILIKE
Textual search using LIKE is pretty straightforward as doing:
SELECT
courses.id,
courses.title,
courses.description
FROM
courses
WHERE
courses.title LIKE '%java%' OR courses.description LIKE '%java%'
But it returned no results, since the LIKE
is case-sensitive, which means we have to specify the upcase letter as saved in the table:
...
courses.title LIKE '%Java%' OR courses.description LIKE '%Java%'
8 "Java in a nutshell" "Learn Java in 21 days"
We are lucky today, then let's use the ILIKE
which is case-insensitive, so there's no need to upcase as it will perform pattern matching on either capital and non-capital letters:
...
courses.title ILIKE '%java%' OR courses.description ILIKE '%java%'
8 "Java in a nutshell" "Learn Java in 21 days"
Considerations on LIKE/ILIKE
Many systems use the pattern matching feature to implement very simple text searches. It can be enough for many scenarios but the more the platform grows in demanding users, the more search needs to return better results, with a more accurate relevance and ranking.
According to the official Postgres documentation, the pattern matching LIKE | ILIKE
lacks essential properties required by modern systems:
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
They tend to be slow because there is no index support, so they must process all documents for every search.
There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words...
For a more practical example, and being that we give more relevance for the title over description, let's see in action the ILIKE
lacking such a requirement:
SELECT *
FROM courses
WHERE courses.title ILIKE '%sales%' OR courses.description ILIKE '%sales%'
We want the course 10, which contains the word "Sales" in its title, to appear before the course 9, which holds the word in its description.
Furthermore, which would be our order criteria? How about ordering by a "score" so we can build a rank of our results?
Full-text search for the rescue.
Full-text search in PostgreSQL
Full-text searching (FTS) allows documents to be preprocessed and an index saved for later rapid searching and ranking. Please refer to the official documentation which is quite complete and provides all the information needed to understand and implement a FTS.
The main building blocks for FTS in PG (Postgres) are:
-
tsvector
, which represents a searchable document -
tsquery
, which is the search query to perform against a document
tsvector
The to_tsvector
function parses an input text and converts it to the search type that represents a searchable document. For instance:
SELECT to_tsvector('Java in a nutshell')
...will give the following:
"'java':1 'nutshel':4"
- the result is a list of lexemes ready to be searched
- stop words ("in", "a", "the", etc) were removed
- the numbers are the position of the lexemes in the document:
java:1
starts at the 1st position whereasnutshell:4
starts at the 4th position
tsquery
The to_tsquery
function parses an input text and converts it to the search type that represents a query. For instance, the user wants to search "java in a nutshell":
SELECT to_tsquery('java & in & a & nutshell');
...will give the following:
"'java' & 'nutshel'"
- the result is a list of tokens ready to be queried
- stop words ("in", "a", "the", etc) were removed
So, how to match a query against a document?
The @@ operator
The @@
operator allows to match a query against a document and returns true or false. Simple as that.
/* true */
SELECT to_tsquery('java & in & a & nutshell') @@ to_tsvector('Java in a nutshell');
/* true */
SELECT to_tsquery('java') @@ to_tsvector('Java in a nutshell');
/* true */
SELECT to_tsquery('nutshell') @@ to_tsvector('Java in a nutshell');
/* false */
SELECT to_tsquery('batatas') @@ to_tsvector('Java in a nutshell');
Yay! As for now, we have the essential requirements to implement FTS on our courses table.
Search against courses
Let's perform the basic full-text search, looking for courses containing "java" in their titles:
SELECT *
FROM courses
WHERE to_tsquery('java') @@ to_tsvector(courses.title)
8 "Java in a nutshell" "Learn Java in 21 days"
Great. Let's perform the search "sales" against the title and description as well:
SELECT *
FROM courses
WHERE to_tsquery('sales') @@ to_tsvector(courses.title || courses.description)
or
SELECT *
FROM
courses,
to_tsvector(courses.title || courses.description) document
WHERE to_tsquery('sales') @@ document
At this moment, the results are similar to our ILIKE
version. Let's see where the FTS really shines.
ts_rank
The ts_rank
function takes the document and query as arguments attempts to measure how relevant are documents to a particular query.
SELECT
ts_rank(
to_tsvector('Java in a nutshell'),
to_tsquery('java')
)
"0.06079271"
Checking multiple variations:
/* 0.06079271 */
SELECT ts_rank(to_tsvector('Java in a nutshell'), to_tsquery('nutshell'))
/* 0 */
SELECT ts_rank(to_tsvector('Java in a nutshell'), to_tsquery('batatas'))
A more sophisticated simulating a rank of potential titles and descriptions:
SELECT
ts_rank(
to_tsvector('Java in a nutshell'),
to_tsquery('java')
) AS rank_title,
ts_rank(
to_tsvector('Learn in 21 days'),
to_tsquery('java')
) AS rank_description
Great! Now we have all the needed to implement a better textual search with proper ranking against our courses.
Search courses with ranking
Ranking results means that we have to split the document in different rankings so we can perform ordering accordingly.
We basically need to export the ranking fields:
SELECT
...
ts_rank(to_tsvector(courses.title), query) as rank_title,
ts_rank(to_tsvector(courses.description), query) as rank_description
...
...and perform the query against the whole document containing either title and description:
FROM
...
to_tsvector(courses.title || courses.description) document,
to_tsquery('sales') query
WHERE query @@ document
...
Then we are ready to perform the correct ordering:
...
ORDER BY rank_description, rank_title DESC
The implementation with ranking
So here we have a implementation of a full-text search with ranking in PostgreSQL in just 12 lines of SQL code:
SELECT
courses.id,
courses.title,
courses.description,
ts_rank(to_tsvector(courses.title), query) as rank_title,
ts_rank(to_tsvector(courses.description), query) as rank_description
FROM
courses,
to_tsvector(courses.title || courses.description) document,
to_tsquery('sales') query
WHERE query @@ document
ORDER BY rank_description, rank_title DESC
Is it fast?
Depending on the amount of data, such a query may face performance issues, since it needs to convert data to searchable documents on the fly.
Index for the rescue
Creating the proper index (a GIN index for text search), it can improve the performance by orders of magnitude.
In this project I made a proof of concept in which 12 millions of cities could be searched in just a few milliseconds. GIN index only. No materialized view needed.
Let's see how a GIN index could be created for our courses table:
CREATE INDEX courses_search_idx ON courses USING GIN (to_tsvector(courses.title || courses.description));
And we are good.
Just sit down, relax and enjoy a search across millions of courses with no performance issues.
How about fuzzy searching?
Fuzzy search, or "string approximation matching", is the technique used to calculate the approximation of two strings. It's commonly used to anticipate mispellings on the queries and so on.
Unfortunately, the Postgres built-in FTS doesn't support fuzzy searching, however, by using an extension, we can combine full-text search and fuzzy search in the same SQL query.
Let's create the extension:
CREATE EXTENSION pg_trgm
Now we can see the differences in action:
SELECT
to_tsquery('jova') @@ to_tsvector('Java in a nutshell') AS search,
SIMILARITY('jova', 'Java in a nutshell') as similarity
Note that searching for "jova" against a a text "Java in a nutshell", the full-text search match operator returns false whilst the SIMILARITY
function, provided by the pg_trgm
extension, returns a value 0.09
.
*In a scale from 0 to 1, similar strings tend to be close to 1. *
The final implementation, using ranking and fuzzy search
So here we have a implementation of a full-text search with ranking and fuzzy search in PostgreSQL in just 16 lines of SQL code:
SELECT
courses.id,
courses.title,
courses.description,
rank_title,
rank_description,
similarity
FROM
courses,
to_tsvector(courses.title || courses.description) document,
to_tsquery('curse') query,
NULLIF(ts_rank(to_tsvector(courses.title), query), 0) rank_title,
NULLIF(ts_rank(to_tsvector(courses.description), query), 0) rank_description,
SIMILARITY('curse', courses.title || courses.description) similarity
WHERE query @@ document OR similarity > 0
ORDER BY rank_title, rank_description, similarity DESC NULLS LAST
Highlights:
- we apply the similarity function against the title and description
- when the search has no match, we filter the results which have a similarity above 0
- added the
NULLIF
function so when therank_*
is zero, we cast the value toNULL
so the ordering can consider NULL values to be the last in ranking results
Conclusion
This guide was quite heavy but it covered only the very basics of full-text search in Postgres. In the official documentation you can see much more features and capabilities, such as highlight documents, weights, query tree, query rewrite, dictionaries, triggers and so on.
It's reliable and fast, which means that it can be used in a wide range of requirements, from simple search systems to complex ones. In case you already have PostgreSQL in your stack, it's worth considering to experiment on it before going to an external/expensive alternative which will demand more attention to operations complexity.
I hope you could enjoy this ride of implementing FTS in PostgreSQL. Happy searching!
Posted on September 16, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024