Antonin J. (they/them)
Posted on February 1, 2020
I love SQL. It's one of the more powerful and beautiful languages in my opinion. It's been around for quite a while and I'm still learning new things. Ironically, my SQL course in college was the only useful CS course I've ever taken and the knowledge I gained there I still use today.
To fast forward, what was the problem? Eh...I can't talk specifics so I'll make up a story that is pretty close to reality.
The Premise
Let's imagine that you're writing a task manager and you just dumped 1 000 tasks into one of your projects. 1 000 tasks is a lot and unmanageable so you decide to split it into groups of 100. It'd be pretty easy to just write a Node script that connects to the database...but oh wait, that means you need to connect to the db somehow. And you don't want to create a new project, committing code to the existing codebase feels dirty, and your UI can't handle dealing with 1 000 tasks simultaneously.
But...but here's PgAdmin staring at you, already connected to the database. Can I do this with SQL? you ask yourself and because it's 4pm, you want to head home, and SQL is your favorite language, you do it.
Note: In the real world, there were 100 000 "tasks" so yeah, I can't imagine any non-paging UI handling 100 000 tasks like it's easy peasy.
Also! This was a collaborative effort with another engineer so when I say "I had to write it", it was actually me and my colleague :) Thanks, buddy!
The Final Query
DO $FN$
DECLARE
project_id integer
BEGIN
FOR counter IN 1..10 LOOP
RAISE NOTICE 'Counter: %', counter;
EXECUTE $$ INSERT INTO projects(name) VALUES ('Group ' || $1) RETURNING id $$
INTO project_id
USING counter;
EXECUTE $$ UPDATE tasks
SET project_id = $1
WHERE tasks.id IN (
SELECT tasks.id
FROM tasks
INNER JOIN projects ON projects.id = tasks.project_id
WHERE projects.name = $2
ORDER BY tasks.id ASC
LIMIT 100) $$
USING project_id, 'Client project';
RAISE NOTICE 'Finished Counter: %', counter;
END LOOP;
END;$FN$
Explanation
I don't want to go through the query line-by-line but I'd like to definitely explain the various "blocks" of the code so it makes more sense and you can see how SQL works.
Note that a lot of this is postgres-specific meaning that MySQL, MSSQL, or other SQL databases may not have access to parts of this syntax.
The DO $FN$
block
DO $FN$
-- code goes here
$FN$
Essentially, this is a "do" block like any other. The documentation for the DO block compares it to a self-executing anonymous function that returns void and takes no arguments.
The $FN$
is what's called a dollar-quoted string constant. In Postgres, there are several types of quotes:
- A single quote (
'string goes here'
) which means the contents are a string - A double quote (
"column_name"
) which lets us quote a problematic column or table name and use it safely - The dollar-quoted string (
$$ SELECT * FROM projects WHERE name = 'test'; $$
) which is just like a single quote string except you can use the other types of quotes within
What if you want to nest multiple quotes within each other? You can always escape them with \
BUT Postgres is amazing so it lets you designate custom tags within those dollars and nest as much as you want to:
$outer_quote$
$inner_quote$
$$ SELECT * FROM projects where name = 'test' $$
$inner_quote$
$outer_quote$
You don't have to declare these custom tags anywhere, they just work. Simply shove a word between the dollar signs.
Declare variables
The next block (within our dollar-quoted custom tag string) is the declare block:
-- dollar quote start
DECLARE
project_id integer;
-- rest of the code
This lets us declare variables that we'll want to use in the rest of the function. And yes, you have to declare them beforehand and yes, declare the type. I couldn't find the specific documentation for this statement (a DECLARE
statement outside of a DO
block has to do with cursors which is not what we're doing here).
Function body
DECLARE
-- variables
BEGIN
-- more code
END;
Essentially, the rest of the code up till this point was us preparing to run some code. We told Postgres we want to execute something via the DO
block and then we declared the variables we want to use via DECLARE
.
Everything that goes between BEGIN
and END
runs regular SQL statements.
For loop
FOR counter IN 1..10 LOOP
-- more code
END LOOP;
Yeah, you can do for loops in SQL! :) There is extensive loop and control documentation that actually lists this example!
The counter
is a variable that gets updated and we get access to inside the loop. The 1..10
creates the range (loop 1 through 10).
Note A seasoned Postgres and SQL veteran will tell me that doing the 1-10 math is unnecessary because...and get this...you can substitute 1..10
with a SQL query, creating a "for each" type of loop. eg. FOR task IN SELECT * FROM tasks LOOP
would loop through every task in the database.
Raise notice!
RAISE NOTICE 'Counter: %', counter;
Quick note about this, it's a log. That's what it is. Doing RAISE NOTICE
won't throw any errors. It works like a formatted string where %
will be substitute with the variable after the string. Just like in other languages, you can pass in multiple %
signs and then pass multiple variables after the end of the string to substitute.
It's not necessary, I just like seeing the logs :)
EXECUTE...INTO...USING (creating a new project)
EXECUTE $$ INSERT INTO projects(name) VALUES ('Group ' || $1) RETURNING id $$
INTO project_id
USING counter;
This statement will use the counter
variable to create a brand new project with the name Group
followed by the counter number and store the resulting project id in the project_id
variable.
I'll break this statement down a little more so as not to overwhelm. It was overwhelming for me at first and in fact, the original query was way hackier so I could avoid understanding the EXECUTE
statement in its entirety.
The EXECUTE
statement will execute any string it's given. We already know that $$
can be used in lieu of quotes. So INSERT INTO .... RETURNING id
is a string of SQL we want to run.
A couple of things here:
- The
$1
is for substitution (more on that when I cover theUSING
part). In our example, we use||
operator which is a string concatenation operator in Postgres. - the
RETURNING id
is a Postgres thing. You can insert a record (or bulk insert many records) and get theid
back right away. You can even do aRETURNING *
to get the entire new record.
The INTO
statement will store the result of the query in a variable we declared earlier. I didn't know this and kept trying to do project_id := EXECUTE $$ statement $$
which does not work :(. In our case, we store the returning id into project_id
.
The USING
statement is used for substitution. You can pass in a list of variables here. The $1
will be substituted by the first variable/value you provide the USING
statement. The second variable would be accessible via $2
and so on.
...phew.
The documentation for EXECUTE...USING calls this operation executing a dynamic statement. What's confusing is that...this is different from the EXECUTE
command...which executes prepared statements.
move up to 100 tasks to the new project
EXECUTE $$ UPDATE tasks
SET project_id = $1
WHERE tasks.id IN (
SELECT tasks.id
FROM documents
INNER JOIN projects ON projects.id = tasks.project_id
WHERE projects.name = $2
ORDER BY tasks.id ASC
LIMIT 100) $$
USING project_id, 'Client project';
I'm glad I got to explain the EXECUTE...INTO...USING
statement with the previous query because this one is way hairier.
First, let's look at the simplified query:
UPDATE tasks SET project_id = $1 WHERE tasks.id IN
(-- subquery)
We set the new project id to the project_id
we got in the previous EXECUTE statement on the tasks with IDs from the result is of the subquery.
Next the subquery:
SELECT tasks.id
FROM documents
INNER JOIN projects ON projects.id = tasks.project_id
WHERE projects.name = $2
ORDER BY tasks.id ASC
LIMIT 100
In the subquery, we look up a 100 tasks belonging to that one massive project which needs breaking up. We get the next 100 tasks in that project. We use $2
which gets substituted by the project name of the massive original project. We sort by tasks.id ASC
so that the tasks don't get thrown out of order during the move.
Lastly, the execute statement that wraps it all:
EXECUTE $$
-- query
$$
USING project_id, 'Client project';
As before, we execute a statement but this time we don't store the results of the query. We substitute 2 different variables which are accessible via $1
and $2
within the query.
Another raise notice
RAISE NOTICE 'Finished Counter: %', counter;
Same as before, this will let us know that a loop is done. I just like logs.
Do I need to know this?
Oh god no. Unless you're in the DB day in and day out, there's no reason for anyone to know this off the top of their head but it's good to know what's possible. I apply that philosophy when learning or using any technology. I knew I could do all of this with SQL because of my past experiences with SQL, I just needed some time with the Postgres documentation to learn how to make this exact task happen. And I think that kind of knowledge is super valuable.
Posted on February 1, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.