Database indexes: how good are them?

caue_moyano

Caue Moyano

Posted on June 28, 2024

Database indexes: how good are them?

What are indexes

Database indexes play an essential role in terms of the performance of a relational database. Imagine trying to find a chapter of a book without looking at the table of contents - it would be an arduous task. Thank goodness we have a table of content that we can use and get what we want. Similarly, indexes act as shortcuts, allowing faster and more efficient data retrieval, avoiding the need to search the entire table for a value. But like everything in life, indexes have pros and cons. In this article, we will go over the positive and negative sides of database indexes, breaking down complex concepts into easy-to-understand examples. By the end, you'll have a clear understanding of when and how to use indexes to optimize your database performance.

Making your queries fast

Indexes are like shortcuts in your database, or in other words, a map to help you find what you're looking for. They provide a structured and organized way to locate data efficiently and quickly. There are many variations of indexes, the most common being B-tree, hash index, and composite, among others. In this article, we will focus on the B-tree index, which is very common in relational databases such as MySQL and PostgreSQL.

Let's see indexes in action!
The examples in this article were created in the DB Fiddle online editor using PostgreSQL v15.

First, let's create a simple table with four columns: SaleID, CustomerID, SaleDate, and Amount.

CREATE TABLE Sales (
    SaleID SERIAL PRIMARY KEY,
    CustomerID INT,
    SaleDate DATE,
    Amount DECIMAL
);
Enter fullscreen mode Exit fullscreen mode

Next, we will insert 1 million records into this table to simulate a table with a significant amount of data.

INSERT INTO Sales (CustomerID, SaleDate, Amount)
SELECT
    (RANDOM() * 1000)::INT, -- Random number between 0 and 999
    CURRENT_DATE - INTERVAL '1 day' * (RANDOM() * 365)::INT, -- Random date within the past 365 days
(RANDOM() * 1000)::DECIMAL -- Random decimal between 0 and 999.999
FROM
    generate_series(1, 1000000); -- Generates a set of numbers from 1 to 1,000,000

Enter fullscreen mode Exit fullscreen mode

Now, let's look for all sales that have CustomerID = 500.

EXPLAIN ANALYZE SELECT * FROM Sales WHERE CustomerID = 500;
Enter fullscreen mode Exit fullscreen mode

Let's analyze the query execution plan and find out how long it would take to execute.

Amongst other information, we have the key metric we are interested in:

...
Execution Time: 132.857 ms
Enter fullscreen mode Exit fullscreen mode

Now, let's add an index to the CustomerID column and redo the analysis.

CREATE INDEX idx_customerid ON Sales (CustomerID);
Enter fullscreen mode Exit fullscreen mode

Re-run the query:

EXPLAIN ANALYZE SELECT * FROM Sales WHERE CustomerID = 500;
Enter fullscreen mode Exit fullscreen mode

This time, the execution plan reveals:

...
Execution Time: 1.105 ms
Enter fullscreen mode Exit fullscreen mode

As we can see, the time for the second query dropped drastically from ~132ms to ~1.1ms! What a difference the index made to search performance.

Use with moderation

Although indexes can speed up data reading, they can negatively impact write operations. Most of the time, this shouldn't be a problem as long as the number of indexes on a table is small and manageable. However, let's explore how performance degrades when we create too many indexes and then perform insert, update, or delete operations.

Using the same example as before let's insert 10000 rows into the database and observe the performance:

EXPLAIN ANALYZE INSERT INTO Sales (CustomerID, SaleDate, Amount)
SELECT
    (RANDOM() * 1000)::INT,
    CURRENT_DATE - INTERVAL '1 day' * (RANDOM() * 365)::INT,
    (RANDOM() * 1000)::DECIMAL
FROM
    generate_series(1, 10000); -- Insert 10,000 rows
Enter fullscreen mode Exit fullscreen mode

This operation takes approximately 99ms to insert 10,000 rows into the table.

Now let's add a couple more of indexes to the Sales table on the SaleDate and Amount columns.

CREATE INDEX idx_saledate ON Sales (SaleDate);
CREATE INDEX idx_amount ON Sales (Amount);
Enter fullscreen mode Exit fullscreen mode

Let's insert another 10,000 rows into the table and see how long it takes to perform the operation now:

EXPLAIN ANALYZE INSERT INTO Sales (CustomerID, SaleDate, Amount)
SELECT
    (RANDOM() * 1000)::INT,
    CURRENT_DATE - INTERVAL '1 day' * (RANDOM() * 365)::INT,
    (RANDOM() * 1000)::DECIMAL
FROM
    generate_series(1, 10000); -- Insert 10,000 rows
Enter fullscreen mode Exit fullscreen mode

This operation now takes approximately 301ms to insert 10,000 rows into the table.

The insertion time has tripled just because we added two more indexes. Imagine the impact of multiple indexes on many columns in a table! Slower write operations are just one of the negative impacts caused by having too many indexes. Another important factor is storage space, which can increase significantly.

Wrapping up

Why does this happen? Well, indexes are a sort of maps that are created in memory or on disk, helping us locate the data we're looking for quickly during read operations. Every time we write, we also have to update these maps to reflect the current state of our data. This can involve rewriting a lot of data and updating many maps, leading to the performance decrease we observed when having many indexes. This also explain the extra storage usage.

Indexes should be used strategically. Here are some tips for using them:

  • Frequently Queried Columns: Index columns that are queried most often through the SELECT statement, primarily in the WHERE clause.
  • JOIN Operations: Index columns participating in the JOIN operation. This boosts the performance of queries because the desired rows can be located quickly to join the rows.
  • Unique Constraints: Unique constraints within columns can be well enforced with the help of the indexes. Data integrity is maintained along with faster access.
  • Sorting and Grouping: Index those columns used most often in ORDER BY or GROUP BY clauses. This would help to retrieve sorted data or results by grouping faster.

However, it's better not to index columns with:

  • Low-Selectivity Columns: Low-selectivity columns are not unique and, therefore, do not benefit much from indexing. A perfect example is a column that contains only a few distinct values, like "yes" or "no".
  • Frequent Updates: Unnecessary and frequent updating of indexes on columns can lead to massive index maintenance costs. If a column is frequently updated, its corresponding indexes must be updated as frequently. This is costly and will slow write operations down.
  • Small Tables: Indexes on small tables may not provide significant performance improvements. The overhead of maintaining the index might outweigh the benefits.

By strategically using indexes, we can optimize our database for both read and write operations, maintaining a balance that ensures efficient data retrieval without sacrificing performance on data modifications.

💖 💪 🙅 🚩
caue_moyano
Caue Moyano

Posted on June 28, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related