Greg Schafer
Posted on April 28, 2021
Most web apps/services that use a relational database are built around a web framework and an Object-Relational Mapping (ORM) library, which typically have conventions that prescribe how to create and load test fixtures/data into the database for testing. If you're building a webapp without an ORM [1], the story for how to create and load test data is less clear. What tools and approaches are available, and which work best? There are a lot of articles around the internet that describe specific techniques or example code in isolation, but few that provide a broader survey of the many different approaches that are possible. I hope this article will help fill that gap, exploring and discussing different approaches for creating and loading test data in PostgreSQL.
[1] Wait a minute, why would you build a webapp without an ORM?! This question could spawn an entire article of its own and in fact, many other articles have debated about ORMs for the last couple decades. I won't dive into that debate — it's up to the creator to decide if a project should use an ORM or not, and that decision depends on a lot of project-specific factors, such as the expertise of the creator and their team, the types and velocity of data involved, the performance and scaling requirements, and much more.
If you're interested in generating test data instead of (or in addition to) loading test data, please check out the follow-up article that explores generating test data for PostgreSQL using SQL, PL/pgSQL, and Python!
Follow Along with Docker
Want to follow along? I've collected sample data and scripts in a subfolder of our Tangram Vision blog repo: https://gitlab.com/tangram-vision-oss/tangram-visions-blog/-/tree/main/2021.04.28_LoadingTestDataIntoPostgreSQL
As described in the repo's README, you can run examples using the official Postgres Docker image with:
# The base postgres image requires a password to be set, but we'll just be
# testing locally, so no need to set a strong password.
docker run --name=postgres --rm --env=POSTGRES_PASSWORD=foo \
--volume=$(pwd)/schema.sql:/docker-entrypoint-initdb.d/schema.sql \
--volume=$(pwd):/repo
postgres:latest -c log_statement=all
To explain this Docker command a bit:
- The base postgres image requires a password to be set (via the
POSTGRES_PASSWORD
environment variable), but we'll just be testing locally, so no need to set a strong password. - Executable scripts (
*.sh
and*.sql
files) in the/docker-entrypoint-initdb.d
folder inside the container will be executed as PostgreSQL starts up. The above command mountsschema.sql
into that folder, so the database tables will be created. - The repo is also mounted to
/repo
inside the container, so example SQL and CSV files are accessible. - The PostgreSQL server is started with the
log_statement=all
config override, which increases the logging verbosity.
The repo contains a variety of files that start with add-data-
which demonstrate different ways of loading and generating test data. After the Postgres Docker container is running, you can run add-data-
files in a new terminal window with a command like:
docker exec --workdir=/repo postgres \
psql --host=localhost --username=postgres \
--file=add-data-sql-copy-csv.sql
If you want to interactively poke around the database with psql
, use:
docker exec --interactive --tty postgres \
psql --host=localhost --username=postgres
Sample Schema
For example code and data, I'll use the following simple schema:
- Musical artists have a name
- An artist can have many albums (one-to-many), which have a title and release date
- Genres have a name
- Albums can belong to many genres (many-to-many)
Loading Static Data
The simplest way to get test data into PostgreSQL is to make a static dataset, which you can save as CSV files or embed in SQL files directly.
SQL COPY from CSV Files
In the code repo accompanying this blogpost, there are 4 small CSV files, one for each table of the sample schema. The CSV files contain headers and data rows as shown in the image below.
We can import the data from these CSV files into a PostgreSQL database with the SQL COPY command:
-- Excerpt from add-data-copy-csv.sql in the sample code repo
COPY artists FROM '/repo/artists.csv' CSV HEADER;
COPY albums FROM '/repo/albums.csv' CSV HEADER;
COPY genres FROM '/repo/genres.csv' CSV HEADER;
COPY album_genres FROM '/repo/album_genres.csv' CSV HEADER
The COPY command has a variety of options for controlling quoting, delimiters, escape characters, and more. You can even limit which rows are imported with a WHERE clause. One potential downside is you must run it as a database superuser or as a user with permissions to read and write and execute files on the server — this isn't a concern when loading data for local testing, but keep it in mind if you ever want to use it in a more restrictive or production-like environment.
Psql Copy from CSV Files
The PostgreSQL interactive terminal (called psql) provides a copy command that is very similar to SQL COPY:
-- Excerpt from add-data-copy-csv.psql in the sample code repo
\copy artists from 'artists.csv' csv header
\copy albums from 'albums.csv' csv header
\copy genres from 'genres.csv' csv header
\copy album_genres from 'album_genres.csv' csv header
There are some important differences between SQL COPY and psql copy:
- Like other psql commands, the psql version of the copy command starts with a backslash (
\
) and doesn't need to end with a semicolon (;
). - SQL COPY runs in the server environment whereas psql copy runs in the client environment. To clarify, the filepath you provide to SQL COPY should point to a file on the server's filesystem. The filepath you provide to psql copy points to a file on the filesystem where you're running the psql client. If you're following along using the Docker image and commands provided in this blogpost, the server and client are the same container, but if you ever want to load data from your local machine to a database on a remote server, then you'll want to use psql copy.
- As a corollary to the above, psql copy is less performant than SQL COPY, because all the data must travel from the client to the server, rather than being directly loaded by the server.
- SQL COPY requires absolute filepaths, but psql can handle relative filepaths.
- Psql copy runs with the privileges of the user you're connecting to the server as, so it doesn't require superuser or local file read/write/execute permissions like SQL COPY does.
Putting Data in SQL Directly
As an alternative to storing data in separate CSV files (which are loaded with SQL or psql commands), you can store data in SQL files directly.
SQL COPY from stdin and pg_dump
The SQL COPY and psql copy commands can load data from stdin instead of a file. They will parse and load all the lines between the copy command and \.
as rows of data.
-- Excerpt from add-data-copy-stdin.sql in the sample code repo
COPY public.artists (artist_id, name) FROM stdin CSV;
1,"DJ Okawari"
2,"Steely Dan"
3,"Missy Elliott"
4,"TWRP"
5,"Donald Fagen"
6,"La Luz"
7,"Ella Fitzgerald"
\.
COPY public.albums (album_id, artist_id, title, released) FROM stdin CSV;
1,1,"Mirror",2009-06-24
2,2,"Pretzel Logic",1974-02-20
3,3,"Under Construction",2002-11-12
4,4,"Return to Wherever",2019-07-11
5,5,"The Nightfly",1982-10-01
6,6,"It's Alive",2013-10-15
7,7,"Pure Ella",1994-02-15
\.
...
In fact, this COPY ... FROM stdin
approach is how [pg_dump](https://www.postgresql.org/docs/current/app-pgdump.html)
outputs data if you're creating a dump or backup from an existing PostgreSQL database. However, pg_dump
uses a tab-separated format by default, rather than the comma-separated format shown above.
By default, pg_dump
also outputs SQL to re-create everything about the database (tables, constraints, views, functions, reset sequences, etc.), but you can instruct it to output only data with the --data-only
flag. To try out pg_dump
with the example Docker image, run:
docker exec --workdir=/repo postgres \
pg_dump --host=localhost --username=postgres postgres
SQL INSERTs
Another way to put data directly in SQL is to use INSERT statements. This approach could look like the following:
-- Excerpt from add-data-insert-static-ids.sql in the sample code repo
INSERT INTO artists (artist_id, name)
OVERRIDING SYSTEM VALUE
VALUES
(1, 'DJ Okawari'),
(2, 'Steely Dan'),
(3, 'Missy Elliott'),
(4, 'TWRP'),
(5, 'Donald Fagen'),
(6, 'La Luz'),
(7, 'Ella Fitzgerald');
INSERT INTO albums (album_id, artist_id, title, released)
OVERRIDING SYSTEM VALUE
VALUES
(1, 1, 'Mirror', '2009-06-24'),
(2, 2, 'Pretzel Logic', '1974-02-20'),
(3, 3, 'Under Construction', '2002-11-12'),
(4, 4, 'Return to Wherever', '2019-07-11'),
(5, 5, 'The Nightfly', '1982-10-01'),
(6, 6, 'It''s Alive', '2013-10-15'),
(7, 7, 'Pure Ella', '1994-02-15');
...
The OVERRIDING SYSTEM VALUE
clause lets us INSERT values into the primary key ID columns explicitly even though they are defined as GENERATED ALWAYS
.
The pg_dump
command's --column-inserts
option will output data as INSERT statements (a separate statement per row), rather than as the default TSV format. Using INSERTs instead of COPY will run much slower when restoring the data, so this is only recommended if you're restoring the data to a database that doesn't support COPY, such as sqlite3. Using INSERTs can be sped up somewhat with the --rows-per-insert
option, allowing you to INSERT many rows at a time per command, reducing the overhead of back-and-forth communication between client and server for every SQL statement.
Using INSERT statements, we could start moving away from statically declaring everything about our datasets — we could omit the primary key ID columns and lookup IDs as needed when inserting foreign keys, as in the following example:
-- Excerpt from add-data-insert-queried-ids.sql in the sample code repo
INSERT INTO artists (name)
VALUES
('DJ Okawari'),
('Steely Dan'),
('Missy Elliott'),
('TWRP'),
('Donald Fagen'),
('La Luz'),
('Ella Fitzgerald');
INSERT INTO albums (artist_id, title, released)
VALUES
((SELECT id FROM artists WHERE name = 'DJ Okawari'), 'Mirror', '2009-06-24'),
((SELECT id FROM artists WHERE name = 'Steely Dan'), 'Pretzel Logic', '1974-02-20'),
((SELECT id FROM artists WHERE name = 'Missy Elliott'), 'Under Construction', '2002-11-12'),
((SELECT id FROM artists WHERE name = 'TWRP'), 'Return to Wherever', '2019-07-11'),
((SELECT id FROM artists WHERE name = 'Donald Fagen'), 'The Nightfly', '1982-10-01'),
((SELECT id FROM artists WHERE name = 'La Luz'), 'It''s Alive', '2013-10-15'),
((SELECT id FROM artists WHERE name = 'Ella Fitzgerald'), 'Pure Ella', '1994-02-15');
...
This is hardly convenient, though, because we need to duplicate other row information (such as the artist name) in order to look up the corresponding ID. It gets even more complex if multiple artists have the same name! So, if you have a static dataset I'd suggest sticking to one of the previously mentioned approaches that use SQL COPY or psql copy.
Putting Data in CSVs vs in SQL Files
Is there a reason to prefer putting static datasets in CSVs or directly in SQL files? My thoughts boil down to the following points:
- CSVs are a widely understood and supported format (just make sure to be clear and consistent with encoding!). If your datasets will be maintained or created by people who prefer spreadsheet programs to database-admin and command-line tools, CSVs may be preferable.
- If you want to keep all your test data and database setup in one place, SQL files are a convenient way to do that.
- If your testing or continuous integration processes use
pg_dump
or its output, then you're already using datasets embedded in an SQL file — keep doing what makes sense for you!
I hope you learned something new and useful about the different approaches and tools available for loading static datasets into PostgreSQL. If you're looking to learn more check out the follow-up article about generating test data for PostgreSQL!
Posted on April 28, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.