Sadeq Dousti
Posted on December 14, 2023
Today at PostgreSQL Conference Europe 2023, some key people addressed the future of PostgreSQL:
- Building for the Future and Betting Big on Postgres by Kevin Dallas, CEO of EDB.
- All the Postgres Things at Microsoft by Charles Feddersen, Director of Product Management - Postgres on Azure.
They both talked about how AI is influential, and how PostgreSQL can help. One major hot topic is around native storing and processing vector data, which AI heavily depends on.
There's nothing wrong with this approach, and in fact it's a quite welcome change. However, I see a gap here. While all the talk is about how PostgreSQL can help AI, the other way around is equally important: How AI can help PostgreSQL.
The rest of the post entertains this idea. I give one example that is so basic that it cannot really be deemed as AI, but I hope you get the idea and develop it further!
Interlude: The highest-scored question on StackOverflow
Many years ago, I thought of looking at the mostly voted question on StackOverflow:
This question was, and still is (as of this writing) on the top of the list. I read through the Q&A, and learned so much from it. I'll summarize it below, but make sure to check it out.
Consider this C snippet:
unsigned long sumConditionally(unsigned char data[],
int arraySize) {
unsigned long sum = 0;
for (int i = 0; i < arraySize; i++)
if (data[i] >= 128) sum += data[i];
return sum;
}
It traverses the whole data
array, and sums up entries that are 128 or greater.
Does it matter if I shuffle data
? From the functionality standpoint, no: Addition is commutative, so order does not matter. What about the performance standpoint? We expect no change in performance either: Regardless of the order, the loop repeats the same number of times, the comparison is performed once per loop, and the number of entries for which the comparison holds is independent of the order of data. So, the performance must be the same, right?
Wrong! The original poster observed a 6x improvement if the data is sorted. How on earth is that possible? Well, now you get why it's the highest-scored question.
The reasoning is quite simple: CPUs have a lot of optimizations in place. One such optimization is called "branch prediction", which essentially boils down to predicting which branch of an if
statement is going to be taken. When the array is sorted, the CPU predictions turn out to be correct most of the time: Initially, the CPU makes a random guess. If the guess is correct, it continues guessing the same. Otherwise, it changes its guess. As the image below depicts, for a sorted array, the guess is correct most of the time, and the CPU has to change it only once. (Of course, I'm oversimplifying!)
This is the most basic "reinforcement learning" (but it's so basic - some may argue that it's not an AI technique as the learning algorithm is prescribed).
Me: Define reinforcement learning like I'm 5.
ChatGPT: Reinforcement learning is like teaching a computer how to do something by giving it rewards when it does things right and punishments when it does things wrong. Just like how you learn to ride a bike better by getting a treat when you balance well and a small scold when you fall. Over time, the computer gets better at the task because it wants to get more rewards and avoid punishments. It's like a game where the computer tries to figure out the best way to do something by trial and error, learning from its mistakes.
What does this has to do with PostgreSQL
PostgreSQL optimizer uses a cost-based model to determine the least costly plan to execute a query. To do this, it uses some basic statistics about the data. If the statistics are not good enough, or the required statistics are not there, PostgreSQL often makes a bad guess, and thus picks a suboptimal plan.
The issue, IMHO, is that PostgreSQL never corrects its guess. Run the same query 1000 times (assuming the underlying data / statistics does not change), and it always results in the same plan. For instance, it may guess that the number of returned rows are 6, and then choose to apply a NESTED LOOP JOIN, while during the execution, it actually fetches 1200 rows, which is awful for the prescribed plan.
To see what I mean, read Franck Pachot's post, where he details the issue and suggests using extended statistics or a PostgreSQL extension to give it hints.
That kept me thinking for a while, and I guess a more futuristic approach is to equip the PostgreSQL planner with AI. In the most basic scenario, it can use something similar to the CPU's branch prediction algorithm: Record the guesses, compare against the actual rows fetched, and improve the guesses over time. But there are so many advanced machine learning algorithms that can be used here; it's just a matter of which one is efficient enough such that the planning time does not skyrocket, and the implementation does not bloat the PostgreSQL core. Alternatively, it can be added as a PostgreSQL extension.
I'm sure using AI inside PostgreSQL is not a new suggestion, and if I had searched, I would have found many others detailing it. However, I felt like writing about it myself 🙂
Posted on December 14, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.