This is sort of a continuation of this

SQL Subqueries in action

Chidiebere Ogujeiofor ・ Dec 13 '19 ・ 8 min read

You can generate the data I used in this blogpost from this GitHub Gist.

In that post, I talked about subqueries and how they work. How you can use them to answer complex questions. In this post, I would extend on that and talk about Common Table Expression(CTEs).

What are they

According to PostgreSQL Tutorial:

A common table expression is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE or DELETE. Common Table Expressions are temporary in the sense that they only exist during the execution of the query.

Once created the CTE can then be treated as a table thus you can perform JOINs, put them in a FROM, etc.

Compared to nested subqueries, CTEs have the following advantages:

Since they store data temporarily, they can improve performance than subqueries
They are more readable than nested subqueries

Single CTE in a query syntax

CTEs are created using the WITH keyword and are immediately followed by the main query. Below is an overall syntax for CTEs:

WITH <cte-name> AS (
   <the CTE query> 

)
<Final query>

For example, in the post on subqueries, we answered the question

In which courses did the student with an ID of 9 scores above his overall average? What was his score in those courses?

We can answer that question, using CTEs as following:

WITH student_average AS (

   SELECT ROUND(AVG(score),2) AS avg FROM results
     WHERE student_id=9
)

SELECT course, score, student_average.avg as average
    FROM results
JOIN student_average
    on TRUE
WHERE student_id=9 AND score > student_average.avg

ORDER BY course;

course	score	average
Biology	84	82.08
Computer Science	93	82.08
History	84	82.08
Physics	85	82.08
Statistics	85	82.08

With similar results as when we used subqueries in the previous post.

Multiple CTEs

It is possible to declare multiple CTEs such that one CTE may or may not call other CTEs. Again this technique helps make queries faster and readable as parts of the query are named.

Below is the overall structure of using multiple CTEs:

WITH cte1 AS (
   <SOME-sql-statement>
),

cte2 AS (
   <some-sql-that-may-or-may-not-use-cte1>
),

.
.
.,
cteN AS (
   <some-sql-that-may-or-may-not-previous-ctes>

)
<main/final-sql-statement>
;

For example, say we want to answer the following question:

What was the class average and which students' average was higher than the class average? What was each of those students' average?

We can answer that question by retrieving a table with student_ids, their average, the class average and filter only when the student_average is more than the class average. We can do this using the following query:

WITH 

each_student_avg  AS (
 SELECT student_id, ROUND(AVG(score),2) as student_overall_avg
        FROM results 
        GROUP BY results.student_id

), 
class_average AS (
 SELECT ROUND(AVG(score),2) as class_avg
        FROM results 
)

SELECT student_id, student_overall_avg, cls.class_avg
FROM each_student_avg AS esa
JOIN class_average  AS cls
    ON TRUE
WHERE student_overall_avg > class_avg  
;

This gives the following result:

student_id	student_overall_avg	class_avg
4	65.62	65.46
18	66.23	65.46
8	66.92	65.46
13	68.62	65.46
16	68.69	65.46
2	69.62	65.46
5	69.62	65.46
10	70.38	65.46
19	71.77	65.46
17	72.00	65.46
7	75.77	65.46
14	75.85	65.46
9	82.08	65.46
24	86.00	65.46

With this ResultSet we can see that the class average is 65.46 and the ids of the students whose average is above the class average.

After each CTE is declared (class_average and each_student_avg), their result is stored in memory. Thus, the calls to class_average and each_student_avg in the main query performs operations on the ResultSet stored in each of the CTE and not the whole results table.

Readability powers

Without CTEs here's how the previous query would look like:

SELECT student_id, student_overall_avg, cls.class_avg
FROM (
 SELECT student_id, ROUND(AVG(score),2) as student_overall_avg
        FROM results 
        GROUP BY results.student_id

) AS esa
JOIN (
 SELECT ROUND(AVG(score),2) as class_avg
        FROM results 
)  AS cls
    ON TRUE
WHERE student_overall_avg > class_avg;

You can agree that the query with CTEs improves the readability of the query.

A more complex question

CTEs are particularly handy in answering questions that are very complex and involve a lot of aggregation/reshaping of the data.

For example, say our school has a policy of placing students on Probation or Withdrawing their admission based on poor performance.

When a student gets below 60 in any course in a particular semester and they are not on probation from the previous term, then they are placed on probation for the next semester.

Also, if they get below 60 in any course and they were on probation from the previous term then their admission is withdrawn. In summary:

When a student gets below 60 in any course in the current semester and he got below 60 in any course in the previous semester then the admission of such a student should be Withdrawn

When a student gets below 60 in any course in the current semester and passed all his courses in the previous semester then such a student should be placed on probation for the next semester

Assuming that Semester 2 started from '2019-08-01' and ended in '2019-10-30' while Semester 1 started from '2019-05-01' and ended in '2019-07-30':

Which students that should be placed on placed on probation and which should have their admission withdrawn at the end of Semester 2? Which courses did they fail? What was their score in those courses and when did they take them?

We can answer the above question using a table that has student ID, a status that can be either 'Withdrawal' or 'Probation' and a column that contains all the courses the student got below 60 in both semesters, the date the course was taken and their score in a comma-separated list. Something like:

student_id	status	course_info
{id-1}	Probation	{score1} {course1} {date-taken1}, ...,{scoreN} {courseN} {date-takenN}
{id-2}	Withdrawal	{score1} {course1} {date-taken1}, ...,{scoreN} {courseN} {date-takenN}
.	.	.
.	.	.
.	.	.

Using CTEs, we begin with the WITH keyword:

WITH

Then we get all those that scored below 60 in semester 1 like so

semester_1_lt_60 AS (
--Retrieves all the data for semester 1 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 

),

Then we get all those that scored below 60 in semester 2 like so


semester_2_lt_60 AS (
--Retrieves all the data for semester 2 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 

),

We can now get all those due for withdrawals as those who failed courses in both semester 1 and 2

withdrawals as (
-- Retrieves all those that got below 60 in both semester 1 and 2
    SELECT s1.student_id
    FROM semester_1_lt_60 as s1
    JOIN semester_2_lt_60 AS s2
        ON s1.student_id = s2.student_id

),

Next, we can retrieve our Probation candidates as those that scored below 60 in semester 2 that are not in the withdrawal CTE:


probation AS (
-- Retrieves all those that failed only in semester 2 but not in both semester 1 and 2
-- This is essentially removing all those ids in  semester_2_lt_60 that are not in withdrawals 
    SELECT student_id
        FROM semester_2_lt_60
    WHERE student_id NOT IN
    (
        SELECT student_id 
        FROM withdrawals
    )
),

Now we create our final CTE which joins both Probation and Withdrawal CTEs:


id_with_status AS (
--Adds retrieves both data with the student status for the term
    SELECT 
        DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
    CASE 
        WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
        WHEN p.student_id IS NOT NULL THEN 'Probation'
    END AS status   FROM withdrawals as w
    FULL JOIN probation as p
        ON p.student_id = w.student_id 
    ORDER BY status

)

Finally, our main query simply queries the results table JOINs id_with_status and performs a string aggregation that returns all the courses that each of the students failed in both semesters with the name of the course, their score and the date the course ended


SELECT r.student_id,  id_status.status, STRING_AGG(
        CONCAT(r.score, ' ',r.course, ' ',  r.course_end_date), ','
    ) as course_info

FROM results AS r
JOIN id_with_status AS id_status
    ON r.student_id = id_status.student_id
WHERE
    r.score < 60 AND
    r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30' 
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;

Here is how it looks all together


WITH 


semester_1_lt_60 AS (
--Retrieves all the data for semester 1 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 

),

semester_2_lt_60 AS (
--Retrieves all the data for semester 2 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 

),

withdrawals as (
-- Retrieves all those that got below 60 in both semester 1 and 2
    SELECT s1.student_id
    FROM semester_1_lt_60 as s1
    JOIN semester_2_lt_60 AS s2
        ON s1.student_id = s2.student_id

),

probation AS (
-- Retrieves all those that failed only in semester 2 but not in both semester 1 and 2
-- This is essentially removing all those ids in  semester_2_lt_60 that are not in withdrawals 
    SELECT student_id
        FROM semester_2_lt_60
    WHERE student_id NOT IN
    (
        SELECT student_id 
        FROM withdrawals
    )
),

id_with_status AS (
--Adds retrieves both data with the student status for the term
    SELECT 
        DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
    CASE 
        WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
        WHEN p.student_id IS NOT NULL THEN 'Probation'
    END AS status   FROM withdrawals as w
    FULL JOIN probation as p
        ON p.student_id = w.student_id 
    ORDER BY status

)

SELECT r.student_id,  id_status.status, STRING_AGG(
        CONCAT(r.score, ' ',r.course, ' ',  r.course_end_date), ','
    ) as course_info

FROM results AS r
JOIN id_with_status AS id_status
    ON r.student_id = id_status.student_id
WHERE
    r.score < 60 AND
    r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30' 
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;

Below is the final result:

student_id	status	course_info
6	Probation	53 Statistics 2019-10-05
13	Probation	58 Statistics 2019-10-05
15	Probation	57 Statistics 2019-10-05
18	Probation	0 History 2019-10-05
20	Probation	40 Chemistry 2019-10-05,54 Statistics 2019-10-05,0 History 2019-10-05
23	Probation	36 Chemistry 2019-10-05,57 Statistics 2019-10-05
1	Withdrawal	0 Chemistry 2019-10-05,0 Mathematics 2019-05-09,14 Statistics 2019-10-05
12	Withdrawal	52 Statistics 2019-10-05,40 Computer Science 2019-05-09,44 Commerce 2019-05-09,47 Mathematics 2019-05-09
21	Withdrawal	0 Chemistry 2019-10-05,35 Computer Science 2019-05-09,48 Statistics 2019-10-05
22	Withdrawal	55 Commerce 2019-05-09,49 Statistics 2019-10-05,41 Chemistry 2019-10-05

Looking at the SQL query above you can easily see how we are storing data in memory and re-using them in subsequent CTEs and the main query thus breaking down the problem into smaller bits until we have enough info to generate/return our final ResultSet. This greatly improves performance as opposed to running through the table several times.

The query is also quite readable as we are naming parts of the query while using the CTE names.

I would like to make one last attempt to show you how readable CTEs are by showing you the nested subquery equivalent( Brace yourself, comrades!⛈⚡️🎇):



SELECT r.student_id,  id_status.status, STRING_AGG(
        CONCAT(r.score, ' ',r.course, ' ',  r.course_end_date), ','
    ) as course_info

FROM results AS r
JOIN (
    SELECT 
        DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
    CASE 
        WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
        WHEN p.student_id IS NOT NULL THEN 'Probation'
    END AS status   FROM (
        SELECT s1.student_id
        FROM (
            SELECT student_id, course_end_date,score FROM results
            WHERE  
                score < 60 AND 
                course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 
        ) as s1
        JOIN (
            SELECT student_id, course_end_date,score FROM results
            WHERE  
                score < 60 AND 
                course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
        ) AS s2
            ON s1.student_id = s2.student_id

    ) as w
    FULL JOIN (
        SELECT student_id
        FROM (
            SELECT student_id, course_end_date,score FROM results
            WHERE  
                score < 60 AND 
                course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 
        ) AS lt_60_in_semester_1
        WHERE student_id NOT IN
        (
            SELECT student_id 
            FROM (
                SELECT s1.student_id
                FROM (
                    SELECT student_id, course_end_date,score FROM results
                    WHERE  
                        score < 60 AND 
                        course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 
                ) as s1
                JOIN (
                    SELECT student_id, course_end_date,score FROM results
                    WHERE  
                        score < 60 AND 
                        course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 
                ) AS s2
                    ON s1.student_id = s2.student_id
            ) AS  withdrawal
        )

     )as p
        ON p.student_id = w.student_id 
    ORDER BY status
) AS id_status
    ON r.student_id = id_status.student_id
WHERE
    r.score < 60 AND
    r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30' 
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;

I believe you now get the point! 😉

Conclusion

In this blogpost:

we have looked at CTEs what they are and how they work
we have seen that CTEs are more readable than nested subqueries
we have also seen that CTEs can improve performance.
finally, that CTEs can be used to efficiently answer complex questions

I would like to conclude guys by urging you to look at the query that used CTEs to answer the question in the Complex Question section. How would you make it more efficient?