Unit Testing and TDD With PostgreSQL is Easy
vbilopav
Posted on July 3, 2024
Note: this is a repost from my personal blog: Unit Testing and TDD With PostgreSQL is Easy
I believe this subject matter is important and it deserves a bigger audience.
I keep hearing that PostgreSQL, as well as all other databases - are not testable. That is, of course, completely wrong. Not only it is possible but I found it to be even easier and faster than traditional methods.
And no, I'm not talking about some integration testing or some Docker magic. Just plain-old PostgreSQL, that's all.
So, let's do some TDD with PostgreSQL, shall we?
Note: this article is neither an endorsement nor even a criticism of TDD, it is merely a demonstration of how easy is to do such things with PostgreSQL.
The Problem
Suppose we have a schema:
create table devices (
device_id int generated always as identity primary key,
name text
);
create table measurements (
timestamp timestamp not null,
device_id int not null references devices(device_id),
primary key(timestamp, device_id),
value numeric not null
);
We want to write a functionality that will have the following parameters:
- Period (starting and ending timestamps).
- Time interval.
- Device.
The result will be a cumulative sum for a device and for each given period between start and end, divided by the interval parameter.
Sounds good?
Let's totally do it, it will be fun, I promise.
Database Setup
First, we need a little schema setup to make it a little bit more testable.
Unit tests, by definition, must not interfere with each other in any shape or form and they must be able to run in parallel. That means that tests will have their own transactions, and any test data inserted will be rollback-ed at the end of tests.
However, inserting test data into a relational database can be a bit tricky. Usually, tables will reference some other tables that will reference some other tables too, and before you know it - in order to insert a few test records - we must insert data in all tables in the database. That is certainly possible, but still, inconvenient and tedious. So we don't want to do that. Luckily, PostgreSQL offers a simple solution to this.
In our example, we have table measurements
that reference the devices
table. This relation will be checked immediately: meaning, a moment when we insert a new measurement the PostgreSQL will check does that device even exists in a database to keep data integrity in check. We can change the behavior of that check to be performed at the end of the transaction. And since the transaction in our unit tests will be rollback-ed anyhow, that will allow us to insert some test data safely without inserting it into a dozen other tables, not concerned with our tests.
To enable this deferred check, a reference has to be created with deferrable initially deferred
declaration:
create table measurements (
timestamp timestamp not null,
device_id int not null references devices(device_id) deferrable initially deferred,
primary key(timestamp, device_id),
value numeric not null
);
Note, there is ˙the other option too: deferrable initially immediate
or simply deferrable
. That means that we can tell the running transaction to defer all reference checks until the end of transactions with the declaration set all constraints deferred
(docs).
If your table is already created then you'll have to recreate the constraint:
begin;
alter table only measurements drop constraint measurements_device_id_fkey;
alter table only measurements add constraint measurements_device_id_fkey
foreign key (device_id) references devices(device_id) deferrable initially deferred;
end;
Or, if you want to do that for the entire database because you had an architect or ORM who wasn't aware of this feature, you can simply execute this script:
$$
declare
_table text;
_fk text;
_def text;
begin
for _table, _fk, _def in (
select conrelid::regclass, conname, pg_get_constraintdef(oid)
from
pg_constraint
where
contype = 'f'
and condeferrable is false
and connamespace 'public'
)
loop
raise info 'setting fk % on table % to deferrable', _fk, _table;
execute(format('alter table only %s drop constraint %s', _table, _fk));
execute(format('alter table only %s add constraint %s %s deferrable initially deferred', _table, _fk, _def));
end loop;
end;
$$
That will do it. And one more little thing: I usually like to create a special schema for unit tests called simply: test
. And now we're ready.
TDD
Ok, let's first create a fresh SQL file, add an empty test and then execute it immediately:
create or replace procedure test.cumulative_sum() language plpgsql as $$
begin
-- arange
-- act
-- assert
rollback;
end;
$$;
call test.cumulative_sum();
With this approach when we execute a file in the editor, our changes to the test are applied, and the test is immediately executed. This allows for an extremely fast test loop.
Now, first, let's arrange some data:
create or replace procedure test.cumulative_sum() language plpgsql as $$
begin
-- arange
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:02:00', 0, 0.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:03:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:07:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:08:00', 0, 3.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:11:00', 0, 3.0);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:12:00', 0, 4.0);
-- act
-- assert
rollback;
end;
$$;
call test.cumulative_sum();
This will add some measurements to a non-existing device (id = 0).
If we execute our hypothetical calculation for this device from timestamps between 2021-01-01 00:00:00 and 2021-01-01 00:15:00 for 5-minute intervals, we should get the following results:
timestamp | sum |
---|---|
2021-01-01 00:05:00 |
3 |
2021-01-01 00:10:00 |
9 |
2021-01-01 00:15:00 |
16 |
You can use Excel or a calculator to verify the validity of these cumulative sum calculations.
Fine, now that we know what we must get, we can add act and assertion parts. First, we will add the act part, which we will call our non-existing function. Since we need to assert results multiple times (count and for each row), we can put the results into a temporary table:
create or replace procedure test.cumulative_sum() language plpgsql as $$
begin
-- arange
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:02:00', 0, 0.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:03:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:07:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:08:00', 0, 3.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:11:00', 0, 3.0);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:12:00', 0, 4.0);
-- act
create temp table result on commit drop as
select * from cumulative_sum('2021-01-01 00:00:00', '2021-01-01 00:15:00', '5 minutes', 0);
-- assert
rollback;
end;
$$;
call test.cumulative_sum();
This, of course, will fail, because we haven't written this cumulative_sum
function yet. But, before we do that, let's also add the assertion part to verify our results. Luckily for us, PostgreSQL supports assertions and assert statements:
create or replace procedure test.cumulative_sum() language plpgsql as $$
begin
-- arange
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:02:00', 0, 0.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:03:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:07:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:08:00', 0, 3.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:11:00', 0, 3.0);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:12:00', 0, 4.0);
-- act
create temp table result on commit drop as
select * from cumulative_sum('2021-01-01 00:00:00', '2021-01-01 00:15:00', '5 minutes', 0);
-- assert
assert (select count(*) from result) = 3,
'Expected 3 rows, got ' || (select count(*) from result)::text;
assert (select sum from result where timestamp = '2021-01-01 00:05:00') = 3,
'Expected 3, got ' || (select sum from result where timestamp = '2021-01-01 00:05:00')::text;
assert (select sum from result where timestamp = '2021-01-01 00:10:00') = 9,
'Expected 9, got ' || (select sum from result where timestamp = '2021-01-01 00:10:00')::text;
assert (select sum from result where timestamp = '2021-01-01 00:15:00') = 16,
'Expected 16, got ' || (select sum from result where timestamp = '2021-01-01 00:15:00')::text;
rollback;
end;
$$;
call test.cumulative_sum();
That looks a bit ugly, but to be fair, most of those assertions were autocompleted by the Copilot.
This will still fail because we haven't written this cumulative_sum
function yet. But at least we can extract input and output data contracts for this function now.
- The parameters will be these:
from timestamp,
to timestamp,
interval interval,
device_id int
- The resulting table will be this:
table (
"timestamp" timestamp,
sum numeric
)
So now, we know what our function should look like. We can add the first prototype, just above our failing tests:
create or replace function cumulative_sum(
_from timestamp,
_to timestamp,
_interval interval,
_device_id int
)
returns table (
"timestamp" timestamp,
sum numeric
)
language sql
as $$
select null::timestamp, null::numeric;
$$;
Again, we've placed this create or replace function cumulative_sum
above our failing tests, and again we are executing the entire file. Which in turn gives an extremely fast TDD refactor-red-green loop.
However, our tests are still in the red, since obviously, our newly created function is returning nonsense. So, let's refactor this:
create or replace function cumulative_sum(
_from timestamp,
_to timestamp,
_interval interval,
_device_id int
)
returns table (
"timestamp" timestamp,
sum numeric
)
language sql
as $$
select
p.period_to,
sum(coalesce(m.sum, 0)) over (rows unbounded preceding)
from (
select
series as period_from,
series + _interval as period_to
from generate_series(_from, _to - _interval, _interval) series
) p
left join lateral (
select
sum(coalesce(value, 0)) as sum
from measurements
where
timestamp > p.period_from
and timestamp <= p.period_to
and device_id = _device_id
) m on true
$$;
And now, this seems to be correct, and our tests are in green now. We can continue improving and optimizing this function while our tests are green as much as we want without any fear that something will be broken.
Here is the final content of our work in a single file:
create or replace function cumulative_sum(
_from timestamp,
_to timestamp,
_interval interval,
_device_id int
)
returns table (
"timestamp" timestamp,
sum numeric
)
language sql
as $$
select
p.period_to,
sum(coalesce(m.sum, 0)) over (rows unbounded preceding)
from (
select
series as period_from,
series + _interval as period_to
from generate_series(_from, _to - _interval, _interval) series
) p
left join lateral (
select
sum(coalesce(value, 0)) as sum
from measurements
where
timestamp > p.period_from
and timestamp <= p.period_to
and device_id = _device_id
) m on true
$$;
create or replace procedure test.cumulative_sum() language plpgsql as $$
begin
-- arange
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:02:00', 0, 0.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:03:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:07:00', 0, 2.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:08:00', 0, 3.5);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:11:00', 0, 3.0);
insert into measurements (timestamp, device_id, value) values ('2021-01-01 00:12:00', 0, 4.0);
-- act
create temp table result on commit drop as
select * from cumulative_sum('2021-01-01 00:00:00', '2021-01-01 00:15:00', '5 minutes', 0);
-- assert
assert (select count(*) from result) = 3,
'Expected 3 rows, got ' || (select count(*) from result)::text;
assert (select sum from result where timestamp = '2021-01-01 00:05:00') = 3,
'Expected 3, got ' || (select sum from result where timestamp = '2021-01-01 00:05:00')::text;
assert (select sum from result where timestamp = '2021-01-01 00:10:00') = 9,
'Expected 9, got ' || (select sum from result where timestamp = '2021-01-01 00:10:00')::text;
assert (select sum from result where timestamp = '2021-01-01 00:15:00') = 16,
'Expected 16, got ' || (select sum from result where timestamp = '2021-01-01 00:15:00')::text;
rollback;
end;
$$;
call test.cumulative_sum();
Conclusion
The arguments I hear all the time are as follows:
- Testing in a database is impossible. No, it isn't. I just showed you how. It may be in other databases, but PostgreSQL isn't that other database.
- Testing in a database is hard. No, it isn't. SQL may be hard if you haven't honed your skills. So, start learning.
- Testing in a database is slow. No, it isn't. It is way faster than anything out there. Again, hone your skills.
But what about running all tests in parallel, perhaps in a CI/CD pipeline you might ask.
Well, it's just a matter of a runner that will run all those parameterless procedures in parallel connections under some criteria. On my setup, that is the test schema. So if, it is a procedure and it is in test schema and it doesn't have parameters - it's a test.
Writing such a test runner would be really, really simple. In fact, that is precisely what I did by using NodeJS, for my projects: @vbilopav/pgmigrations. Here is an example project and how it is used in the GitHub actions: teamserator/.github/workflows
/build-and-test.yml
But in all fairness, it's so easy that anyone could do it.
Posted on July 3, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.