Daniel Fitzpatrick
Posted on March 14, 2022
SQL might be an unparalleled success story of narcissistic design, but it's a robust apparatus. Consequently, you've probably brushed against some rough edges.
The problem
Have you ever noticed that UNION
breaks DRY principles?
SELECT x.foo, x.bar, '' as baz FROM x
UNION
SELECT '' as foo, '' as bar, y.baz FROM y;
The code above requires more references to foo
, bar
, and baz
than necessary. Adding additional queries to the union is an exercise in masochism because this unnecessary code duplication increases exponentially.
If each query introduces just a single column, and you have 5 of them, that's 20 times need to repeat yourself! We can use HoneySQL to put this unruly child in timeout. 1
TL;DR
If you just want the code and don't want to follow along, I have uploaded a gist which includes a solution for the problem mentioned at the end of this post.
Spread some honey...sql
Translating the previous query into HoneySQL makes a good beginning.
(require '[honeysql.helpers :as hh]
'[honeysql.core :as hsql])
{:union
[(-> (hh/select :foo :bar ["" :baz])
(hh/from :x))
(-> (hh/select ["" :foo] ["" :bar] :baz)
(hh/from :y))]}
;=>
{:union
[{:select (:foo :bar ["" :baz]), :from (:x)}
{:select (["" :foo] ["" :bar] :baz), :from (:y)}]}
Where is the SQL? You can view it from the repl with (hsql/format *1)
.
HoneySQL displayed its first big win in the previous output. Did you spot it? It emits regular Clojure data structures, which we can easily manipulate with the standard library. It also allows us to describe our desired output succinctly.
{:union [{:select [:foo]}
{:select [:bar]}]}
;=>
{:union [{:select [:foo [nil :bar]]}
{:select [[nil :foo] :bar]}]}
As in the example above, we will supply nil
as the value for "back-filled" columns.
Columns in HoneySQL will commonly look like one of three patterns.
:column
:table.column
[:processing-stuff :column]
A function that parses out the column name we need for back-filling would look like this.
(defn get-column-name [col]
(cond
(vector? col) (second col)
(and (keyword? col)
(str/includes? (name col) ".")) (-> col name
(str/split #"\.")
second keyword)
(keyword? col) col))
A quick test proves that this works.
(get-column-name :baz)
;=> :baz
(get-column-name :bar.baz)
;=>:baz
(get-column-name [:foo.bar :baz])
;=>:baz
Next, we should contrive some way to get an ordered collection of all our column names. That will simplify everything else. Ironically this will require another set theory UNION, but we want to preserve element order.
(defn ordered-union
[cmp coll1 coll2]
(loop [xs [] coll1 coll1 coll2 coll2]
(cond
(and (empty? coll1) (empty? coll2)) xs
(empty? coll1) (vec (concat xs coll2))
(empty? coll2) (vec (concat xs coll1))
(cmp (first coll1) (first coll2)) (recur (conj xs (first coll2))
(rest coll1) (rest coll2))
:else (recur (conj xs (first coll1)) (rest coll1) coll2))))
Has it been a while since you last saw loop/recur?
The function above takes a comparator cmp
and two collections. It assumes that coll1
"comes before" coll2
- an assumption made visible in the last branch. The first three branches are obvious exit conditions. The fourth branch ensures that collisions are not duplicated in the resultset, thus enforcing the set theory aspect of our function.
We can employ the usual tricks to support more than two inputs. In this case, it's just a reduction over the inputs. Because ordered-union
is generally helpful outside of the scope of this article, feel free to add a [cmp coll1 coll2 & colls]
arity and repurpose.
We will apply this trick to get our complete ordered list of columns.
(defn get-all-columns-in-union [m]
(reduce
(fn [a b]
(ordered-union = a (map get-column-name b)))
[] (map :select (:union m))))
We now have something worth testing.
(get-all-columns-in-union
{:union [{:select [["first" :key] :a :b :c]}
{:select [["second" :key] :d :e :f]}
{:select [["third" :key] :g :h :i]}]})
;=> [:key :a :b :c :d :e :f :g :h :i]
A crucial insight is that we can use ordered-union
to rewrite the SELECT clauses. Check this out.
(defn format-nil-column [column] [nil column])
(ordered-union
#(= (get-column-name %1) (get-column-name %2))
(mapv format-nil-column [:key :a :b :c :d :e :f :g :h :i])
[["first" :key] :a :b :c])
;=> [["first" :key] :a :b :c [nil :d] [nil :e] [nil :f] [nil :g] [nil :h] [nil :i]]
(ordered-union
#(= (get-column-name %1) (get-column-name %2))
(mapv format-nil-column [:key :a :b :c :d :e :f :g :h :i])
[["second" :key] :d :e :f])
;=> [["second" :key] [nil :a] [nil :b] [nil :c] :d :e :f [nil :g] [nil :h] [nil :i]]
(ordered-union
#(= (get-column-name %1) (get-column-name %2))
(mapv format-nil-column [:key :a :b :c :d :e :f :g :h :i])
[["third" :key] :g :h :i])
;=> [["third" :key] [nil :a] [nil :b] [nil :c] [nil :d] [nil :e] [nil :f] :g :h :i]
I plugged our previous three SELECT statements into ordered-union
, and it returned what we wanted in each case.
Put it all together
We are now in the red zone. So, finally, let's wrap things up.
(defn add-columns [nil-columns provided-columns]
(ordered-union
#(= (get-column-name %1) (get-column-name %2))
nil-columns provided-columns))
(defn normalize-union
([m]
(let [cols (get-all-columns-in-union m)]
(update m :union (partial normalize-union cols))))
([columns queries]
(let [nil-columns (mapv format-nil-column columns)]
(mapv #(update % :select (partial add-columns nil-columns)) queries))))
To test
(normalize-union
{:union [{:select [["first" :key] :a :b :c]}
{:select [["second" :key] :d :e :f]}
{:select [["third" :key] :g :h :i]}]})
{:union
[{:select
[["first" :key]
:a
:b
:c
[nil :d]
[nil :e]
[nil :f]
[nil :g]
[nil :h]
[nil :i]]}
{:select
[["second" :key]
[nil :a]
[nil :b]
[nil :c]
:d
:e
:f
[nil :g]
[nil :h]
[nil :i]]}
{:select
[["third" :key]
[nil :a]
[nil :b]
[nil :c]
[nil :d]
[nil :e]
[nil :f]
:g
:h
:i]}]}
The devil in the details
It's time for me to confess a problem that I have been hesitant to mention.
Types.
Postgres will treat nil
as type TEXT for a UNION with more than two queries, and it forces us to care about this because column foo
must have a matching type across every row in the resultset.
I will offer some clues about resolving this but won't go into an in-depth solution. The solution is less work than you probably think. It would be best to start by allowing normalize-union
to accept an optional type-map
. It should look like this.
{:thing1 :integer
:thing2 :uuid
:thing3 :timestamp}
If you pass that map into format-nil-column
, then you can leverage honeysql.core/call
to force nil
to be the type you want. Something like this should do the trick.
(honeysql.core/call :cast nil (get type-map column))
Do it right, and you might even be able to provide limited support for wildcards. I hope this has been a fun trip down the dark side of SQL.
-
I was reluctant to embrace HoneySQL when someone first introduced me, but its core feature has won me over with time: eschewing cleverness. For example, other ORM users talk about "fighting the query generator" to win their desired performance target - a story foreign to my user experience. ↩
Posted on March 14, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.