Optimising Database Performance: The Pitfalls of ‘SELECT * FROM’ Queries

Introduction
Database systems are the cornerstone of nearly every application, especially those handling substantial volumes of data. They store a myriad of user activities and details, encompassing everything from personal information to highly sensitive data such as credit card information, bookings, purchases, medical records, and many more. In today’s digital landscape, the ubiquity of database systems is undeniable, with many applications adopting polyglot persistence for enhanced functionality and scalability.
Within the domain of database querying, the “SELECT * FROM” statement is frequently used to retrieve all columns from a table. Despite its apparent simplicity, this approach can often introduce significant performance challenges, as we will soon explore. But before delving into these intricacies, let’s look at the underlying mechanisms of database systems and their storage engines.

Peeking Inside
Database systems organise data records into tables, collections, or relations, with each attribute or field belonging to a column typically storing data of the same type. For instance, in a table named “students”, all student matriculation numbers would be stored within a single column, sharing the same data type. A collection of values spread across multiple columns, yet logically connected to describe a real-world object or action, is referred to as a row. Rows are typically identifiable by a unique key or set of keys.
Databases can be categorised based on how they store data on disk, leading to two primary classifications: row-oriented databases and column-oriented databases.
Data within databases are stored in units known as pages, which are typically of fixed size. A page can keep either entire rows of records (characteristic of row-oriented databases) or whole columns of a table (typical of column-oriented databases). Row-oriented databases, such as Postgres and MySQL, are commonly used for Online Transaction Processing (OLTP). In contrast, column-oriented databases, like MonetDB, and Redshift are used for Online Analytical Processing (OLAP) tasks.

General Problems
Using “SELECT * FROM” queries effectively removes the database’s ability to optimise the query, as it cannot extract meaningful information from a query that retrieves all columns. Consequently, the database cannot leverage query optimisers to enhance performance, resulting in a full table scan. This holds irrespective of whether indexes are present on the table. For tables with many rows, this entails reading each page, decoding it, loading it into memory, and searching for the required data. This process continues until the query locates the desired data or examines all records in the table.
On-disk data is stored in binary format within pages, necessitating the database to interpret this binary data. Deciphering binary data is known as deserialisation or decoding (distinct from decryption). When executing a “SELECT * FROM” query, the database engine selects all columns of a specific row, leading to the decoding of the page where the row resides. This decoding process is CPU-intensive and may involve decoding large volumes of unnecessary data. Since a page can contain multiple rows, this can further escalate the volume of data that needs decoding into a usable format for the application.
Additionally, network transmission poses another challenge. The server and the database are often not located on the same network or server, potentially even in different geographical locations. They communicate over a network using protocols such as HTTP/S or TCP-based protocols. Upon decoding data from the disk, the database must encode it into a format compatible with the communication protocol for transmission between the database and the server. This encoding process incurs further CPU overhead. The larger the volume of data generated during encoding, the more segments or packets are transferred over the network. This can lead to various networking issues, including packet loss necessitating retransmission, network bandwidth consumption, and network congestion.
Upon reaching the server, the transmitted data requires decoding, which poses yet another CPU-intensive task. Furthermore, it unnecessarily consumes system memory, as not all the data is needed by the application but remains in memory.

Conclusion
In conclusion, executing a “SELECT * FROM” query entails numerous intricate processes, making it imperative to exercise caution and select only the fields essential for your task to avoid unnecessary overhead. While it’s true that for tables with few columns and simple data types, the impact of a “SELECT * FROM” query might be minimal, it remains a best practice to be judicious in selecting columns retrieved in queries.

The process involves decoding binary data, navigating through pages on disk, and transmitting data over networks, each contributing to CPU-intensive operations and potential network congestion. By being selective about the columns retrieved, unnecessary processing and data transmission can be minimised, leading to improved performance and resource utilisation.

While “SELECT * FROM” queries may seem convenient, their potential performance implications warrant thoughtful consideration. Prioritising selectivity in querying optimises database performance and contributes to overall system efficiency.

References
Database Internals: A Deep Dive into How Distributed Data Systems Work by Alex Petrov 1st Edition

7 Reasons Why Using SELECT * FROM TABLE in SQL Query Is a Bad Idea

Reasons why SELECT * is bad for SQL performance

Why SELECT * in Postgres is a bad choice

Blog

Optimising Database Performance: The Pitfalls of ‘SELECT * FROM’ Queries

Amadu Jallow

Join Our Newsletter. No Spam, Only the good stuff.

Related