Essential Database Optimizations Before Re-Architecting Your Backend
Nir Tayeb
Posted on February 23, 2024
In the last post, I covered six ways ORMS made writing poorly performant code easy. But it is just a part of the story.
Ensuring your applications run efficiently is essential. Before diving into complex solutions like caching, microservices, or rewriting code in different languages, it’s vital to solidify the fundamentals. By focusing on six key pillars, you can significantly boost your application’s performance, achieving up to an 80–90% improvement. Let’s explore these often-overlooked foundational techniques that can dramatically enhance the performance of your database-driven applications.
Side notes:
- I may refer to RDBMS and SQL in the post, but these principles apply to NoSQL systems and beyond.
- This post was also published in my newsletter, containing the 6 pillars with more depth.
Let's start,
Database version
New major versions are coming with new features and may include performance improvements. It is 200% right if we refer to PostgreSQL.
Benchmark made by EDB shows a 47% performance enhancement between PostgreSQL v9.6 and v15 and a 10.4% upgrade from v12 to v15. V16 has as well some minor improvements. You can read more in this post by "benchant.com".
Hardware
The database is the heart of your application. If it doesn't have enough computing resources, your whole application will be waiting for these resources to be available.
CPU
For small-medium size, you need a minimum of 4 cores for parallelism of your workloads, connection concurrencies, and background jobs done by the DB (i.e., journaling). Scale up with these considerations in mind:
- Workload Characteristics: CPU-intensive workloads (e.g., complex calculations and heavy transactions) require more cores for better performance.
- Concurrency: Higher concurrent connections and transactions can increase CPU requirements.
- Database Size and Complexity: Larger databases or those with complex schemas and queries may benefit from additional CPU resources to handle the increased processing demands.
Memory
RAM impacts the database's ability to cache data, work with indexes, and process queries efficiently, significantly affecting overall performance. For small-medium sizes require at least 4GB of RAM and scale up by these considerations:
- Database Working Set Size: Ideally, your server should have enough RAM to hold the frequently accessed ("hot") portion of your database (tables and indexes) in memory to reduce disk I/O.
- Connection Overhead: PostgreSQL uses a multi-process architecture, where each connection has its dedicated process. More connections imply more RAM for connection overhead. As a rough estimate, plan for about 10 MB (or more) per connection.
- Workload Type: Read-heavy workloads benefit from more RAM for caching, reducing read operations from disk. Write-heavy workloads may benefit less from caching but require sufficient RAM for sorting and transaction logs.
Storage
The storage size varies from one application to another, but what is important here is the type of storage and its IO performance:
- Storage Type: SSDs should be your choice
- Provisioned IOPS: Consider using "provisioned IOPS" storage for high-performance workloads to ensure consistent and fast disk I/O.
- Storage Auto-scaling: Enable auto-scaling for storage to adjust as your database grows automatically.
Scaling up your system usually costs more money. Monitor your current resources' use closely and consider ways to decrease the use of the resources using query/structure optimizations (RAM/CPU/storage) before adding more resources.
Network proximity / Geographics:
It's best to keep your database and application server as close to each other as possible, network-wise so that they can communicate more efficiently. One way to do this is by provisioning them in the same region on your cloud service and putting them within the same private virtual network. Also, it's a good idea to use private IP addresses for communication between them rather than public IPs, which will help reduce the number of network hops and increase the speed of data transfer.
Connections Management
Opening a connection to the database is an expensive operation. It includes the DNS resolving, TCP handshake, TLS Handshake, authentication, and setting up internal resources such as session objects and watchdogs. So, you should be conscious about opening a connection to the DB and reusing them as much as possible.
The way to achieve it is by using connection pooling. It maintains a set number of connections and opens new ones until reaching a maximum value of connections. So, your application code should always have an open connection.
Amount of queries in a single request/transaction
Whether you use connection pooling or not, your database/pool has a limit on the maximum number of open connections. You should also verify the number of queries and their processing time so you won't get to the point where your code is waiting for a connection to be available instead of running your business logic.
What should you look for?
- The N+1 problem
- Querying too much unnecessary data:
- Filtering fetched data in code instead of in your queries.
- Aggregate fetched data in code instead of in a query (count/sum/averages/min/max/etc)
- Fetch fields without using them (look up big text fields or tables with a lot of columns)
- Use joins instead of multiple separate queries.
Optimize query processing time.
Fine-tune condition evaluation with indexes
Imagine you have a table of users containing ID, name, and birth date. Your app has a search function for all the users who have a Birthday today. Usually, the database will scan the table data row by row and filter all the records with this birthday. Imagine this table has 1M records.
Indexes are like a sidecar to your tables; they contain a "table of contents" in various data structures, so when you query data, it will be possible to access the records without scanning all table content. The index contains the data of one or more of your columns, but instead of using a list to hold the data, it uses other data structures like trees and a hash table and divides the column value into different parts so the search will be faster. full table scan is O(n), while index scan usually is O(Log(n)) or O(1), depends on the index.
So now, instead of scanning the entire table to find users with a birthday, it checks the index, gets pointers for the records in the table (or, more accurately, to the exact pages in the filesystem), and then retrieves the data.
To understand why indexing is crucial, what to index, and what kind of indexes are available, you can read here.
Fine-tune data fetching with Partitioning
If you deal with big data and need more than indexing, partitioning the data will let you only query the relevant portion. For example, if partitioning the data by "Year," you will query only data from 2024, and all 2023 won't be retrieved from the storage if you haven't explicitly asked for it (partition is like a different storage unit). If most queries are in filtering by 2024, it will dramatically enhance your query performance.
Partitioning has overhead in maintenance, and it can lead to a spike in lock manager waits and negatively impact user query performance. Increasing smaller tables and multiple indexes and partitions can generate many locks per query, causing contention problems.
Based on my experience, focusing on these 6 pillars can significantly enhance the performance of the application by 80-90%, even before implementing any additional solutions such as caching, rewriting specific parts of the application in more performant languages, dividing into microservices, or using NoSQL.
Posted on February 23, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
January 19, 2023