Stephen Margheim
Posted on September 12, 2023
One of the reasons people hesitate to use SQLite in their Ruby on Rails applications, in my opinion, is a fear that they will miss certain features they are accustomed to from PostgeSQL or MySQL. As discussed in an earlier post, we can load SQLite extensions into our Rails applications to enhance the functionality of SQLite. Moreover, today I want to show you that it is possible to build on top of SQLite's primitives to provide matching behavior for one of my favorite features of Postgres—array columns.
When working with relational data, you often come across one particular dilemma: what do I do with this small amount of associated data? Do I build a whole new table with a foreign key, which keeps my schema highly normalized but also means I have to accept a JOIN
everytime I need to access this data. Do I simply stuff the data into a JSON
column on my primary table, which removes the need for the JOIN
but also bloats my primary table and opens up a possibility for stuffing unstructured data into that column.
Postgres offers a nice compromise here with their implementation of array columns. Instead of an amorphous JSON blob, your column is and will always simply be an array of values. This matches the effective behavior of a simple two-column associated table (foreign key plus value column), without the need for the JOIN
.
For my favorite example of the utility of this tool, consider Nate Hopkin's implementation of a tagging system, built on top of Postgres' array columns. To save you a click and demonstrate just how elegant this solution is, I will provide the code examples from the README
here:
# db/migrate/TIMESTAMP_add_groups_to_user.rb
class AddGroupsToUser < ActiveRecord::Migration[5.0]
def change
add_column :users, :groups, :string, array: true, default: [], null: false
add_index :users, :groups, using: "gin"
end
end
# app/models/user.rb
class User < ApplicationRecord
include TagColumns
tag_columns :groups
end
user = User.find(1)
# assigning tags
user.groups << :reader
user.groups << :writer
user.save
# checking tags
is_writer = user.has_group?(:writer)
is_reader_or_writer = user.has_any_groups?(:reader, :writer)
is_reader_and_writer = user.has_all_groups?(:reader, :writer)
# finding tagged records
assigned = User.with_groups
unassigned = User.without_groups
writers = User.with_any_groups(:writer)
non_writers = User.without_any_groups(:writer)
readers_or_writers = User.with_any_groups(:reader, :writer)
readers_and_writers = User.with_all_groups(:reader, :writer)
non_readers_and_writers = User.without_all_groups(:reader, :writer)
# find unique tags across all users
User.unique_groups
# find unique tags for users with the last name 'Smith'
User.unique_groups(last_name: "Smith")
With one array column on your model, you get a full suite of the core "tagging-style" functionality. I love solutions like this. The entire gem is no more than one file that defines the TagColumns
concern, and that file is only 105 lines (89 lines of code). Elegance and simplicity are what SQLite is all about, so, how can we accomplish the same result without an array column primitive?
Let's start with how we can add a column to a table that can only be an array. SQLite does support a wide variety of JSON functionality. It also supports standard column check constraints. This gives us everything we need. We will define a JSON
column and then add a constraint to ensure that the column is only an array
JSON type. With Rails' migration DSL, if you are creating the column as you create the table it looks like:
create_table :posts, force: true do |t|
t.json :tags, null: false, default: []
t.check_constraint "JSON_TYPE(tags) = 'array'", name: 'post_tags_is_array'
end
If you are simply adding the column to an existing table, the migration would look like:
add_column :posts, :tags, :json, default: [], null: false
add_check_constraint "JSON_TYPE(tags) = 'array'", name: 'post_tags_is_array'
Note: SQLite does not support
GIN
indexes. In order to provide an index for aJSON
column, the recommended pattern in SQLite is to define a generated column and then index that column. This blog post provides a good overview of the approach. Unfortunately for us Rails developers, theActiveRecord
adapter for SQLite doesn't yet support generated attributes, so we would have to drop down to running raw SQL. Support for Postgres generated columns was recently added, and I plan to open a similar pull request for the SQLite adapter in the near future. For the time being, therefore, I will not dig into indexing an "array" column in our SQLite database. Since SQLite doesn't need to eat the network latency cost of a query, even unindexed queries can be sufficiently fast. However, of course, we would prefer to be able to ensure our SQLite implementation of "array columns" can be indexed. Once I have improved Rails' support, I will write a new post detailing how to work with SQLite generated columns and indexing them.
This gives us a JSON
column that will only ever be an array of values. Without a schema setup, let's turn to the "taggable" functionality that we want to support. tag_columns
supports 11 methods:1
Model.unique_column_name()
Model.column_name_cloud()
Model.with_column_name()
Model.without_column_name()
Model.with_any_column_name(*items)
Model.with_all_column_name(*items)
Model.without_any_column_name(*items)
Model.without_all_column_name(*items)
model.has_any_column_name(*items)
model.has_all_column_name(*items)
model.has_column_name(*items)
We need SQL queries to back each one, and then the ActiveRecord method calls to generate those queries. As I don't want to derail this post with the process of coming up with each query and ActiveRecord method, I will summarize. At the heart of our implementation is the use of the JSON_EACH
function that SQLite provides, which will treat each value in our array as if it were a row in a table. Each virtual row will have a value
column that you can select from. So, to get the unique set of values for the tags
column in our example posts
table, we simply need this SQL query:
SELECT DISTINCT value
FROM "posts",
JSON_EACH("posts"."tags");
Readable and succinct. Similarly, in order to find those posts
that are tagged with draft
, we could use this query:
SELECT "posts".*
FROM "posts"
WHERE EXISTS (
SELECT 1
FROM JSON_EACH("posts"."tags")
WHERE value IN ('draft')
LIMIT 1
)
This query gets more complicated. In order to find all of the posts
with the tag, we need to isolate our query for selecting the posts and all of their attributes from the query to check for whether the tag is present or not. This is a perfect use-case for a nested query. Our inner query does a check for whether or not the specific tag is present.2 We use SELECT 1
because we only need to return a boolean for the WHERE EXISTS
check in the outer query; we use LIMIT 1
to optimize the inner query a bit, as we only need to know if draft
is present in the tags
array at least once, we don't care about duplicates. This shape of a query will drive all of the any_*
methods.
In order to support the all_*
methods, we need a query that ensures that returned posts
included each of the provided values. For example, Post.with_all_tags('draft', 'sqlite')
must only return those posts
who are tagged with both the draft
tag and the sqlite
tag; any post only tagged with draft
is ignored. Here is the shape for that query:
SELECT "posts".*
FROM "posts"
WHERE (
SELECT COUNT(DISTINCT value)
FROM JSON_EACH("posts"."tags")
WHERE value IN ('draft', 'sqlite')
) = 2;
Instead of a basic WHERE EXISTS
check, our outer query is now checking whether the number of matching tags for the post matches the number of queried tags. Remember that JSON_EACH
effectively converts our array column into a virtual table with rows; so, SELECT COUNT(*) FROM JSON_EACH() WHERE ...
will count the number of entries in our array column that match the where condition, returning that as an integer. We can use that integer returned from the inner query to ensure that the outer query only returns posts
with the total number of tags provided. In order to handle the possibilities of duplicate tags
, we ensure that we COUNT
only DISTINCT value
s. Were we to use SELECT COUNT(*)
or SELECT COUNT(value)
, our integer returned from the inner query could be larger than 2 (the size of the array of tags we are querying against). To ensure that the inner query only ever returns an integer as large or smaller than the array of tags, we need to count only distinct values.
However, those three basic queries form the foundation of our entire implementation. with_*
scopes use =
, while without_*
use !=
, but the shapes are all the same. So, the final piece to our puzzle is generating these queries in ActiveRecord.
Again, I'm not going to get bogged down in process. We don't want to use raw SQL strings if we can avoid it. Raw SQL strings are brittle in ActiveRecord usage. And we want to provide a model concern, so robustness is of particular value. This means we need to dip down and use Arel
. This is the relational algebra library that sits at the foundation of ActiveRecord, with ActiveRecord's query interface built on top of Arel. For a good intro on working with Arel
directly, check out this post. I won't review those details in this post.
As we stated earlier, every single query we need uses JSON_EACH
at its heart. So, we need to be able to generate this function in Ruby. Arel
provides an interface for functions that we can use like so:
# JSON_EACH("{table}"."{column}")
json_each = Arel::Nodes::NamedFunction.new("JSON_EACH", [arel_table[column_name]])
Note:
arel_table
is available to us as we will be executing this code in the context of an ActiveRecord model concern.
With our json_each
expression object ready, we could built the .unique_tags
method like so:
# SELECT DISTINCT value FROM "{table}", JSON_EACH("{table}"."{column}")
define_singleton_method :"unique_#{method_name}" do |conditions = "true"|
select('value')
.from([arel_table, json_each])
.distinct
.pluck('value')
.sort
end
In order to setup our .with_any_tags
scope, we simply need a method builder like this:
# SELECT "{table}".* FROM "{table}" WHERE EXISTS (SELECT 1 FROM JSON_EACH("{table}"."{column}") WHERE value IN ({values}) LIMIT 1)
scope :"with_any_#{method_name}", ->(*items) {
values = array_columns_sanitize_list(items)
overlap = Arel::SelectManager.new(json_each)
.project(1)
.where(Arel.sql('value').in(values))
.take(1)
.exists
where overlap
}
And the corresponding .with_all_tags
scope looks like this:
# SELECT "{table}".* FROM "{table}" WHERE (SELECT COUNT(DISTINCT value) FROM JSON_EACH("{table}"."{column}") WHERE value IN ({values})) = {values.size};
scope :"with_all_#{method_name}", ->(*items) {
values = array_columns_sanitize_list(items)
count = Arel::SelectManager.new(json_each)
.project(Arel.sql('value').count(distinct = true))
.where(Arel.sql('value').in(values))
contains = Arel::Nodes::Equality.new(count, values.size)
where contains
}
I won't paste each method here. You can find them in the Gist I have provided to accompany this post. The idea is to demonstrate how we can map the SQL queries we need to Arel
-based Ruby code.
We wrap all of this in an ArrayColumns
model concern and we are good to go. With a well-written schema migration and a single model concern, we have the ability to define "array column" types in our SQLite database, as well as query them as if they were an associated table, without the cost of a JOIN
.
Hopefully, this demonstrates the power and flexibility available in SQLite. Even without all of the native features and data types provided by Postgres, a little bit of ingenuity can provide equivalent functionality.
You can find the full code for the model concern detailed in this Gist. Check out the full script to see the full set of test cases as well.
-
To understand what each method does more precisely, consider the test suite that I wrote. ↩
-
We use
IN
, even with a single value, to allow the query to accommodate both singular and plural values easily. ↩
Posted on September 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.