SQL for the Potterhead: Left Joins

shethiakamya

kamya

Posted on February 18, 2020

SQL for the Potterhead: Left Joins

The Fat Lady's portrait is updating her knowledge to make sure only authorized students have access to the Gryffindor dorm. Can we write a query that will inform us of all the students and pets that are allowed in the Gryffindor dormitory?

Let's break down the requirements. We need:

  1. All the students who are allowed in the Gryffindor dormitory
  2. Any pets that belong to students from Gryffindor

Let's start by using a set diagram to understand what data we need:
Venn diagram for join

It looks like we need all the data in the left hand side table, along with any common data. Thinking about the join's available to us, this sounds a lot like a LEFT JOIN. Let's write some SQL.

1.Start by extracting the data we need from the wizards table. We want to filter the data to the students who are in the Gryffindor house. We can do that with a WHERE clause.


postgres=# `SELECT * FROM wizard WHERE house = 'Gryffindor'` 

id |        name        |   house    
----|--------------------|------------
  1 | Neville Longbottom | Gryffindor
  2 | Ronald Weasley     | Gryffindor
  3 | Harry Potter       | Gryffindor
  5 | Seamus Finnigan    | Gryffindor
  6 | Hermione Granger   | Gryffindor

The query returned every student in the table except Malfoy, who is in Slytherin.

2.Now, let's write a JOIN to combine this data with that of the pet table:


postgres=#
SELECT * FROM wizard LEFT JOIN pet ON wizard.id = pet.owner_id WHERE house = 'Gryffindor'; 
 id |        name        |   house    | id |    name     | species | owner_id 
----+--------------------+------------+----+-------------+---------+----------
  1 | Neville Longbottom | Gryffindor |  2 | Trevor      | toad    |        1
  2 | Ronald Weasley     | Gryffindor |  1 | Scabbers    | rat     |        2
  3 | Harry Potter       | Gryffindor |  3 | Hedwig      | owl     |        3
  6 | Hermione Granger   | Gryffindor |  4 | Crookshanks | cat     |        6
  5 | Seamus Finnigan    | Gryffindor |    |             |         |         
(5 rows)

I see all the student's in Gryffindor, and all the appropriate pets. But, we have too many columns.

3.Let's filter down to only select the data we need.


postgres= #
SELECT wizard.name, pet.name FROM wizard LEFT JOIN pet ON wizard.id = pet.owner_id WHERE house = 'Gryffindor';
        name        |    name     
--------------------+-------------
 Neville Longbottom | Trevor
 Ronald Weasley     | Scabbers
 Harry Potter       | Hedwig
 Hermione Granger   | Crookshanks
 Seamus Finnigan    | 
(5 rows)

This looks good, let's step through this query.

  1. SELECT wizard.name, pet.name: The SELECT clause specifies the fields we want to see in our results. We're only interested in the wizard's name, and the pet's name
  2. FROM wizard : The first table, or the table on the left for our join
  3. LEFT JOIN pet : the LEFT JOIN clause
  4. ON wizard.id = pet.owner_id : We specify what we are joining ON. This is generally a field present in both tables, that has a foreign-key-esque relationship
  5. WHERE house = 'Gryffindor'; : We filter down our results to only include rows where the wizard's house is in Gryffindor. We don't need to specify wizard.house, because the pet table does not include that column.

Our table has very few rows, so we probably won't run into any performance issues. However, I generally like looking at the query planner and look at the execution plan. We can use postgres's EXPLAIN ANALYZE clause for this.

postgres=#
EXPLAIN ANALYZE SELECT wizard.name, pet.name FROM wizard
LEFT JOIN pet ON wizard.id = pet.owner_id WHERE wizard.house = 'Gryffindor';
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 1. Hash Right Join  (cost=14.90..29.05 rows=3 width=236) (actual time=0.045..0.056 rows=5 loops=1)
 2.   Hash Cond: (pet.owner_id = wizard.id)
 3.   ->  Seq Scan on pet  (cost=0.00..13.00 rows=300 width=122) (actual time=0.006..0.008 rows=6 loops=1)
 4.   ->  Hash  (cost=14.88..14.88 rows=2 width=122) (actual time=0.030..0.030 rows=5 loops=1)
 5.         Buckets: 1024  Batches: 1  Memory Usage: 9kB
 6.         ->  Seq Scan on wizard  (cost=0.00..14.88 rows=2 width=122) (actual time=0.020..0.023 rows=5 loops=1)
 7.              Filter: ((house)::text = 'Gryffindor'::text)
 8.              Rows Removed by Filter: 1
 Planning Time: 0.122 ms
 Execution Time: 0.091 ms
(10 rows)

In general, the most indented piece of the EXPLAIN output is executed first. Towards the bottom of the output, we see that we performed a scan on the wizards table, and filtered out all rows whose owners were not in Gryffindor. This means that we won't be scanning and joining unnecessary data.

You might see the Hash Right Join at the start output, and be confused about the discrepency between our SQL syntax, and how the query planner wants to operate.

The answer can be found in Postgres's documentation.

hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

This doesn't impact the data we will see, just how we will get it. Lines 1-4 specify the type of join Postgres is running (a hash join), and lists how the query planner will access the data. There is timing information in the output, which we will ignore for now.


  1. View this on github
  2. 1. Postgres docs for planner optimizer
💖 💪 🙅 🚩
shethiakamya
kamya

Posted on February 18, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024