SQL for the Potterhead: Left Joins
kamya
Posted on February 18, 2020
The Fat Lady's portrait is updating her knowledge to make sure only authorized students have access to the Gryffindor dorm. Can we write a query that will inform us of all the students and pets that are allowed in the Gryffindor dormitory?
Let's break down the requirements. We need:
- All the students who are allowed in the Gryffindor dormitory
- Any pets that belong to students from Gryffindor
Let's start by using a set diagram to understand what data we need:
It looks like we need all the data in the left hand side table, along with any common data. Thinking about the join's available to us, this sounds a lot like a LEFT JOIN
. Let's write some SQL.
1.Start by extracting the data we need from the wizards
table. We want to filter the data to the students who are in the Gryffindor house. We can do that with a WHERE
clause.
postgres=# `SELECT * FROM wizard WHERE house = 'Gryffindor'`
id | name | house
----|--------------------|------------
1 | Neville Longbottom | Gryffindor
2 | Ronald Weasley | Gryffindor
3 | Harry Potter | Gryffindor
5 | Seamus Finnigan | Gryffindor
6 | Hermione Granger | Gryffindor
The query returned every student in the table except Malfoy, who is in Slytherin.
2.Now, let's write a JOIN to combine this data with that of the pet
table:
postgres=#
SELECT * FROM wizard LEFT JOIN pet ON wizard.id = pet.owner_id WHERE house = 'Gryffindor';
id | name | house | id | name | species | owner_id
----+--------------------+------------+----+-------------+---------+----------
1 | Neville Longbottom | Gryffindor | 2 | Trevor | toad | 1
2 | Ronald Weasley | Gryffindor | 1 | Scabbers | rat | 2
3 | Harry Potter | Gryffindor | 3 | Hedwig | owl | 3
6 | Hermione Granger | Gryffindor | 4 | Crookshanks | cat | 6
5 | Seamus Finnigan | Gryffindor | | | |
(5 rows)
I see all the student's in Gryffindor, and all the appropriate pets. But, we have too many columns.
3.Let's filter down to only select the data we need.
postgres= #
SELECT wizard.name, pet.name FROM wizard LEFT JOIN pet ON wizard.id = pet.owner_id WHERE house = 'Gryffindor';
name | name
--------------------+-------------
Neville Longbottom | Trevor
Ronald Weasley | Scabbers
Harry Potter | Hedwig
Hermione Granger | Crookshanks
Seamus Finnigan |
(5 rows)
This looks good, let's step through this query.
-
SELECT wizard.name, pet.name
: The SELECT clause specifies the fields we want to see in our results. We're only interested in the wizard's name, and the pet's name -
FROM wizard
: The first table, or the table on the left for our join -
LEFT JOIN pet
: the LEFT JOIN clause -
ON wizard.id = pet.owner_id
: We specify what we are joining ON. This is generally a field present in both tables, that has a foreign-key-esque relationship -
WHERE house = 'Gryffindor';
: We filter down our results to only include rows where the wizard's house is in Gryffindor. We don't need to specifywizard.house
, because thepet
table does not include that column.
Our table has very few rows, so we probably won't run into any performance issues. However, I generally like looking at the query planner and look at the execution plan. We can use postgres's EXPLAIN ANALYZE
clause for this.
postgres=#
EXPLAIN ANALYZE SELECT wizard.name, pet.name FROM wizard
LEFT JOIN pet ON wizard.id = pet.owner_id WHERE wizard.house = 'Gryffindor';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
1. Hash Right Join (cost=14.90..29.05 rows=3 width=236) (actual time=0.045..0.056 rows=5 loops=1)
2. Hash Cond: (pet.owner_id = wizard.id)
3. -> Seq Scan on pet (cost=0.00..13.00 rows=300 width=122) (actual time=0.006..0.008 rows=6 loops=1)
4. -> Hash (cost=14.88..14.88 rows=2 width=122) (actual time=0.030..0.030 rows=5 loops=1)
5. Buckets: 1024 Batches: 1 Memory Usage: 9kB
6. -> Seq Scan on wizard (cost=0.00..14.88 rows=2 width=122) (actual time=0.020..0.023 rows=5 loops=1)
7. Filter: ((house)::text = 'Gryffindor'::text)
8. Rows Removed by Filter: 1
Planning Time: 0.122 ms
Execution Time: 0.091 ms
(10 rows)
In general, the most indented piece of the EXPLAIN
output is executed first. Towards the bottom of the output, we see that we performed a scan on the wizards table, and filtered out all rows whose owners were not in Gryffindor. This means that we won't be scanning and joining unnecessary data.
You might see the Hash Right Join
at the start output, and be confused about the discrepency between our SQL syntax, and how the query planner wants to operate.
The answer can be found in Postgres's documentation.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.
This doesn't impact the data we will see, just how we will get it. Lines 1-4 specify the type of join Postgres is running (a hash join), and lists how the query planner will access the data. There is timing information in the output, which we will ignore for now.
- View this on github
- 1. Postgres docs for planner optimizer
Posted on February 18, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.