Millions record query - with and without index

docvominh

Pham Duc Minh

Posted on March 24, 2023

Millions record query - with and without index

Just found out the movie page imdb database on imdb
It's a large dataset that contains Movie information use PosgresSQL

Quickly import the name_basics table, curious about what it looks like

image

A simple table with People's names and some other information

Count the table return 12.382.531 records, but what makes me surprise is table has no primary key or index

Wow, it will take a long time to search...

Right away try to search for my favorite actor Jean Reno, think about taking a coffee in the waiting time (find one in 12 million)

SELECT *
    FROM name_basics
WHERE "primaryName" = 'Jean Reno'
  AND "primaryProfession" = 'actor';
Enter fullscreen mode Exit fullscreen mode

The result is another surprise

1 row retrieved starting from 1 in 923 ms (execution: 908 ms, fetching: 15 ms)

Is it because the table is simple? Or postgres has some cache?
Query a non-index text field in 12 million just take 1 second

What do we need index for? This is a really good performance

But I also want to see how performance go with index, so I add a simple index on primaryName field

create index "primaryName_index" on name_basics ("primaryName");
Enter fullscreen mode Exit fullscreen mode

And re-try the select query

1 row retrieved starting from 1 in 34 ms (execution: 6 ms, fetching: 28 ms)

image

It's a huge different 900ms vs 6ms (for SQL execute), 150 times faster without index

Conclusion

The Index is magic! but it is not free, when query for index size

SELECT i.relname "Table Name",indexrelname "Index Name",
 pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
 pg_size_pretty(pg_relation_size(relid)) as "Table Size",
 pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
 reltuples::bigint "Estimated table row count"
 FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid
 WHERE i.relname='name_basics'
Enter fullscreen mode Exit fullscreen mode

image

It takes ~ 400MB storage (a single-column index)
and also memory when processing.

Anyway, the performance it brings is no doubt.

[UPDATE]
After some research, I found some tips about performance

What is a good latency for API?
Generally, APIs that are considered high-performing have an average response time between 0.1 and one second. At this speed, end users will likely not experience any interruption. At around one to two seconds, users begin to notice some delay.

So take one second only for query database is a need to improve

My machine: Apple M1, Postgres 11 on docker

💖 💪 🙅 🚩
docvominh
Pham Duc Minh

Posted on March 24, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related