Tomasz Wegrzanowski
Posted on January 21, 2022
Three most common data interchange formats are JSON, XML, and CSV. We already covered languages to manipulate JSON and XML. Obviously someone made one for CSV as well - CSVQ. I has SQL-like syntax, which makes perfect sense, as relational databases are basically glorified spreadsheets, so SQL is perfect for his.
I actually did something a bit similar myself before, but never got to the point of publishing it - it was a script that imported some CSV files into a fresh SQLite database, ran some SQLite code, then exported results back into CSV.
Data
Here's people.csv
with people and their favorite languages. GitHub Copilot created it for me for a previous episode. CSVQ can handle column names with spaces in them, but it requires some awkward quoting, so I replaced them with underscores.
first_name,last_name,favorite_language
Alice,Smith,Ruby
Bob,Smith,Python
Charlie,Brown,JavaScript
Daniel,Clark,PHP
Ed,Jones,C
Frank,Smith,HTML
Gary,Johnson,CSS
Heather,Williams,JavaScript
Ivan,Smith,Ruby
Jack,Jones,JavaScript
John,Brown,PHP
Karen,Clark,JavaScript
Larry,Jones,HTML
Mike,Williams,CSS
Nathan,Smith,Ruby
Oscar,Brown,JavaScript
Peter,Clark,PHP
Quinn,Jones,JavaScript
Ralph,Smith,HTML
Sally,Williams,CSS
Tom,Smith,Ruby
Will,Jones,JavaScript
Xavier,Brown,PHP
Yvonne,Clark,JavaScript
Zachary,Jones,HTML
Hello, World!
We can run queries from command line:
$ csvq 'select * from people where last_name = "Jones"'
+------------+-----------+-------------------+
| first_name | last_name | favorite_language |
+------------+-----------+-------------------+
| Ed | Jones | C |
| Jack | Jones | JavaScript |
| Larry | Jones | HTML |
| Quinn | Jones | JavaScript |
| Will | Jones | JavaScript |
| Zachary | Jones | HTML |
+------------+-----------+-------------------+
We can also run them from a REPL like environment:
$ csvq
csvq interactive shell
Press Ctrl+D or execute "EXIT;" to terminate this shell.
> select first_name || ' likes ' || favorite_language as fun_fact from people;
+--------------------------+
| fun_fact |
+--------------------------+
| Alice likes Ruby |
| Bob likes Python |
| Charlie likes JavaScript |
| Daniel likes PHP |
| Ed likes C |
| Frank likes HTML |
| Gary likes CSS |
| Heather likes JavaScript |
| Ivan likes Ruby |
| Jack likes JavaScript |
| John likes PHP |
| Karen likes JavaScript |
| Larry likes HTML |
| Mike likes CSS |
| Nathan likes Ruby |
| Oscar likes JavaScript |
| Peter likes PHP |
| Quinn likes JavaScript |
| Ralph likes HTML |
| Sally likes CSS |
| Tom likes Ruby |
| Will likes JavaScript |
| Xavier likes PHP |
| Yvonne likes JavaScript |
| Zachary likes HTML |
+--------------------------+
By default the output is formatted in a ASCII-art style table. You can even color-code it a bit by passing -c
option.
Aggregate Queries
We can save queries to files, like this one:
select favorite_language, count(*) as cnt
from people
group by favorite_language
order by count(*) desc;
Then we can run it from command line:
$ csvq -s most_popular.csvq
+-------------------+-----+
| favorite_language | cnt |
+-------------------+-----+
| JavaScript | 8 |
| Ruby | 4 |
| PHP | 4 |
| HTML | 4 |
| CSS | 3 |
| Python | 1 |
| C | 1 |
+-------------------+-----+
We can switch to CSV output:
$ csvq -f csv -s most_popular.csvq
favorite_language,cnt
JavaScript,8
Ruby,4
PHP,4
HTML,4
CSS,3
Python,1
C,1
Or to JSON output, -P
makes it pretty-printed:
$ csvq -P -f json -s most_popular.csvq
[
{
"favorite_language": "JavaScript",
"cnt": 8
},
{
"favorite_language": "Ruby",
"cnt": 4
},
{
"favorite_language": "PHP",
"cnt": 4
},
{
"favorite_language": "HTML",
"cnt": 4
},
{
"favorite_language": "CSS",
"cnt": 3
},
{
"favorite_language": "Python",
"cnt": 1
},
{
"favorite_language": "C",
"cnt": 1
}
]
We can also do other aggregate queries. Of course the details are a bit different from other SQL systems, but that's pretty much a given. SQL is more an idea than a language, and CSVQ doesn't even claim to be SQL, just SQL-like:
select favorite_language, listagg(first_name || ' ' || last_name, ',') as fans
from people
group by favorite_language
order by favorite_language;
$ csvq -f csv -s who_likes.csvq
favorite_language,fans
C,Ed Jones
CSS,"Gary Johnson,Mike Williams,Sally Williams"
HTML,"Frank Smith,Larry Jones,Ralph Smith,Zachary Jones"
JavaScript,"Charlie Brown,Heather Williams,Jack Jones,Karen Clark,Oscar Brown,Quinn Jones,Will Jones,Yvonne Clark"
PHP,"Daniel Clark,John Brown,Peter Clark,Xavier Brown"
Python,Bob Smith
Ruby,"Alice Smith,Ivan Smith,Nathan Smith,Tom Smith"
Creating and modifying tables
CSVQ also supports CREATE TABLE
, INSERT
, UPDATE
and all the other usual statements.
There are some minor differences from SQL - there are no types, and creating a table requires passing extension.
Here's an example:
create table `clicks.csv` (name, counter);
insert into clicks (name, counter) values ('upvote', 0);
insert into clicks (name, counter) values ('downvote', 0);
update clicks set counter = counter + 420 where name = 'upvote';
update clicks set counter = counter + 69 where name = 'downvote';
select * from clicks;
We can run it (-q
is quiet, so it stops extra messages that records got inserted or updated etc.):
$ csvq -q -s counters.csvq
+----------+---------+
| name | counter |
+----------+---------+
| upvote | 420 |
| downvote | 69 |
+----------+---------+
That creates the following clicks.csv
:
name,counter
upvote,420
downvote,69
Fibonacci
CSVQ supports recursive CTEs, so it's extremely straightforward. It's identicaly to SQLite version:
with recursive fib(a, b, i) as (
select 1, 1, 1
union all
select b, a + b, i + 1
from fib
where i + 1 <= 60
)
select a from fib;
csvq -s fib.csvq
+---------------+
| a |
+---------------+
| 1 |
| 1 |
| 2 |
| 3 |
| 5 |
| 8 |
| 13 |
| 21 |
| 34 |
| 55 |
| 89 |
| 144 |
| 233 |
| 377 |
| 610 |
| 987 |
| 1597 |
| 2584 |
| 4181 |
| 6765 |
| 10946 |
| 17711 |
| 28657 |
| 46368 |
| 75025 |
| 121393 |
| 196418 |
| 317811 |
| 514229 |
| 832040 |
| 1346269 |
| 2178309 |
| 3524578 |
| 5702887 |
| 9227465 |
| 14930352 |
| 24157817 |
| 39088169 |
| 63245986 |
| 102334155 |
| 165580141 |
| 267914296 |
| 433494437 |
| 701408733 |
| 1134903170 |
| 1836311903 |
| 2971215073 |
| 4807526976 |
| 7778742049 |
| 12586269025 |
| 20365011074 |
| 32951280099 |
| 53316291173 |
| 86267571272 |
| 139583862445 |
| 225851433717 |
| 365435296162 |
| 591286729879 |
| 956722026041 |
| 1548008755920 |
+---------------+
A different from SQLite is that SQLite uses double precision floats, while CSVQ uses 64bit numbers, so if you keep going these will eventually wrap to negative. That can be potentially problematic for some data processing, but it's good enough most of the time.
The obvious trick of changing 1
to 1.0
does not work. I'm not sure what exactly is CSVQ's numerical data type logic, as it supports small real numbers, but for big ones it uses integers. It claims to support 64bit integers and 64bit floats, but there's something weird going on.
FizzBuzz
CSVQ doesn't have any equivalent of PostgreSQL generate_series
, so we cannot just get numbers 1-100. We need to use a recursive CTE or temporary table or such.
CSVQ supports user-defined functions. The syntax is quite different from typical SQL, but it shouldn't be too hard to understand:
declare fizzbuzz function (@value)
as
begin
case
when @value % 15 = 0 then return 'FizzBuzz';
when @value % 5 = 0 then return 'Buzz';
when @value % 3 = 0 then return 'Fizz';
else return @value;
end case;
end;
with recursive generate_series(value) as (
select 1
union all select value+1
from generate_series
where value + 1 <= 100
)
select fizzbuzz(value)
from generate_series;
And we can get out FizzBuzz! (-N
skips the header)
$ csvq -N -f csv -s fizzbuzz.csvq
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
...
FizzBuzz
91
92
Fizz
94
Buzz
Fizz
97
98
Fizz
Buzz
Should you use CSVQ?
Yes! If you use CSVs a lot, it's a great tool.
CSVQ is very little known, but it's basically JQ for CSV, and JQ is widely beloved for good reasons.
It does pretty much everything you want for quick CSV processing, and because it's so SQL-like, it has very low learning curve.
For very complicated task, you'll eventually need to either import the CSVs into a more full-featured database (like SQLite or Excel or whatnot), or use a general purpose programming language like Ruby or Python, but CSVQ gets really far with extremely minimal code.
For some downsides, error messages you'll get are not great. Especially if you try some "standard" SQL, it will simply tell you position of error, but no suggestion what you should do instead. But honestly it's simple enough that it's not been much of a problem for me.
Code
All code examples for the series will be in this repository.
Posted on January 21, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.