Marcell Cruz
Posted on June 29, 2021
- Creating Fake Realistic Data
- Testing Before The Index
- Creating The Index And Testing
- Expression Indexes
The first thing we need to do is to create a lot of fake data so we can test our indexex, it's also important to use real words dependending on the index that you're trying
to use, there's a clever way of doing this on linux, if you're not on linux you can download
a list of words here
Creating Fake Realistic Data
The first thing and maybe the most important part when dealing with indexes or any code for that matter is being able to test it, you need to play around with it to figure out what is the best index for your specific query, to test indexes we need a lot of data otherwise the query planner
won't use our index, if you wanna to know more about indexes you can read this
List Of Words
In order to create realistic data we need a list of words, on Linux you can grab the word list generated by the spell checker utility on /usr/share/dict/, the name of the file varies depending on the distribution, if you're not on linux you can grab your list of words here
Postgres Data Directory
Now that we have a list of words we need a way of reading this list into postgres so we can generate our data. the function pg_read_file can read anything inside the data directory, you can find out where your data directory is running
SHOW data_directory;
The result will probably be something like /var/lib/postgres/, this also dependends on your OS,
after getting the path you just need to copy your list of words to this directory so we can use pg_read_file
to read this list of words and generate our data.
Generating The Data
Let's create a table called notes with two fields, title and description
CREATE TABLE notes(id serial PRIMARY KEY, title text, description text);
We can then create a function that will feed us the words
CREATE OR REPLACE FUNCTION getNArrayS(el text[], count int) RETURNS text AS $$
SELECT string_agg(el[random()*(array_length(el,1)-1)+1], ' ') FROM generate_series(1,count) g(i)
$$
VOLATILE
LANGUAGE SQL;
Just copy and paste, press enter and things will work, I promise :P
Now we can create the data, it will take some time so go grab some tea meanwhile.
WITH t(ray) AS(
SELECT (string_to_array(pg_read_file('words.list')::text,E'\n'))
)
INSERT INTO notes(title, description)
SELECT getNArrayS(T.ray, 3), getNArrayS(T.ray, 3) FROM T, generate_series(1,100000);
We're creating one million records [insert dr evil meme here] with real words in two fields, not bad,
we can increase the amount of words in each field if you want you just need to change '3' in getNArrayS, 3 in this case is the number of words in each field, now we can test our indexes variations and easily change it depending on what query and index we want to test.
Testing Before The Index
We need to test our query now and see how fast it's, so we can take note and compare later on after we create our index, unfortunatly as far as I know postgres don't have a way of forcing the query planner to use a specific index or not, so we need to test first then compare the times later.
run
\timing
Now every query will show the time it took to run.
First let's test a simple like, we also need to choose one word from our word list ofc, I chose
asimov.
SELECT * FROM notes WHERE title ilike '%asi%';
Time: 615.921 ms
with an AND
SELECT * from notes where title ilike '%asi%' and description ilike '%gar%';
Time: 561.811 ms
with a OR
SELECT * from notes where title ilike '%asi%' or description ilike '%gar%';
Time: 1027.195 ms (00:01.027)
concating the two fields
select * from notes where coalesce(title, '') || ' ' || coalesce(description, '') ilike '%gar%';
Time: 810.602 ms
Ok, now let's create the index and compare the time afterwards for each query.
Creating The Index And Testing
Let's first create the trigrams extensions, more about this later.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
Now let's create an index just on the title field using the trigrams extension
CREATE INDEX notes_title_idx ON notes USING gin (title gin_trgm_ops);
Since it's very common to search for the first three letters of a word postgres has a trigrams(3 letters) that speed up just that, it creates a index with the first 3
letters of the words in the field so our queries run even faster, now let's test our index with the same query that we used before.
SELECT * FROM notes WHERE title ilike '%asi%';
Time: 211.155 ms
Almost three times faster, not bad, we can use EXPLAIN to see how the query planner is using our index
EXPLAIN SELECT * FROM notes WHERE title ilike '%asi%';
QUERY PLAN
------------------------------------------------------------------------------------
Bitmap Heap Scan on notes (cost=200.58..11682.82 rows=20204 width=58)
Recheck Cond: (title ~~* '%asi%'::text)
-> Bitmap Index Scan on notes_title_idx (cost=0.00..195.53 rows=20204 width=0)
Index Cond: (title ~~* '%asi%'::text)
We can see that the query is using our index notes_title_idx, subsequent runs will
return even faster results since the index is already in memory, the second run was 51.841 ms
more than 10x faster, the second run without the index woulnd't have made any difference.
The second query that we want to test has an AND using the description field, let's first run the query without a description index to see what happens.
SELECT * from notes where title ilike '%asi%' and description ilike '%gar%';
Time: 51.215 ms
Even though we don't have a index in description, our results are already 10x faster, that's because title is almost instant now, and the and query on description don't need to scan the whole table, just the result set of the title index, before creating the index in the description field let's run the query with the OR clause to see
how it compares to the and query
SELECT * from notes where title ilike '%asi%' or description ilike '%gar%';
Time: 1.059 ms
The same time as before, what happened? with an OR the query planner needs to scan the whole
table again so it decides to not use the index, if we run with an EXPLAIN we can see that
QUERY PLAN
--------------------------------------------------------------------------------
Gather (cost=1000.00..20295.83 rows=20302 width=58)
Workers Planned: 2
-> Parallel Seq Scan on notes (cost=0.00..17265.62 rows=8459 width=58)
Filter: ((title ~~* '%asi%'::text) OR (description ~~* '%gar%'::text))
The query is not using notes_title_idx because it would be even slower in this case, the reason is because
the disc would have to be hit three times, one for the index read, one for the title read and one for the description.
Now let's create the description index and see how it compares
CREATE INDEX notes_description_idx ON notes USING gin (description gin_trgm_ops);
After running the OR query again we have a much better result Time: 141.310 ms almost 10x faster again, we can use explain to see that both indexes are being used now
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on notes (cost=222.43..11745.76 rows=20302 width=58)
Recheck Cond: ((title ~~* '%asi%'::text) OR (description ~~* '%gar%'::text))
-> BitmapOr (cost=222.43..222.43 rows=20304 width=0)
-> Bitmap Index Scan on notes_title_idx (cost=0.00..195.53 rows=20204 width=0)
Index Cond: (title ~~* '%asi%'::text)
-> Bitmap Index Scan on notes_description_idx (cost=0.00..16.75 rows=100 width=0)
Index Cond: (description ~~* '%gar%'::text)
The last query that we need to improve is the one where we concatenate the two fields title and description and search them, we could use an index on both fields together or use one in each like we did in the previous example, let's do both and see what's faster.
To test for both we just need to run the query again since we already have two indexes one in each field
select * from notes where coalesce(title, '') || ' ' || coalesce(description, '') ilike '%gar%';
Time: 849.965 ms
Using the coalesce the query planer doesn't use any of our indexes, now let's create an index on both fields and see what happens
create index notes_title_description_idx ON notes USING gin (title gin_trgm_ops, description gin_trgm_ops);
Running the query again we get the same result, in order to speed up a query like this is much better to change the query to use a OR and to use a expression index.
SELECT * from notes where title ilike '%gar%' or description ilike '%gar%';
with an OR we get 126 ms and the query planer is using our multicolumn index notes_title_description_idx that we just created, the results of using an multicolumn index are better but not by a lot and depending on how many fields you have in your table the index won't even be used, now let's try to create a concatenated index using the two fields that mimics the query that we tried.
Expression Indexes
CREATE INDEX notes_coales_title_description_idx ON notes USING gin ((coalesce(title, '') || ' ' || coalesce(description, '')) gin_trgm_ops);
We need the extra parenteshis for the expression, now let's run the query again
select * from notes where coalesce(title, '') || ' ' || coalesce(description, '') ilike '%gar%';
Time: 103.706 ms
10x faster again, postgres is smart enough to use our new index notes_coales_title_description_idx, so that was faster than using the two different indexes in each field with an OR and faster than multicolumn index with an OR.
The most important thing we can take from these tests is that is hard to predict what postgres will do so you should always test your indexes with the exact query that you're going to use, you should try different approaches to figure out what's the fastest for your query and don't be afraid of creating different indexes for different queries, and also using indexes for your queries makes them a LOT faster, 10x faster for all these queries that we tried, so you should definitely use them.
Posted on June 29, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.