Super simple and fast delimited CSV data normalization with AWK
István Lantos
Posted on July 25, 2018
There was a discussion here on dev.to which programming language or tool more powerful than people realize?
I shared my AWK snippet, where I normalize a 8GB sized delimited CSV data source for database import, only with AWK, as fast as a Go counterpart program. This task completes in 7 minutes on my home machine, same as my Go code:
#!/usr/bin/awk -f
# This block only runs once at the start, before the first line
# Use this to print CSV header on the top
BEGIN {
FS="|"; # input field separator
OFS="|"; # output field separator
}
# This block runs at every line
{
# We will order a new named variable to every column
line = $0; # variable `$0` stores the entire line
url = $1;
title = $2;
body = $3
tags = $4;
if (line ~ /^$/) next; # if line is blank, then skip it
if (NF != 4) next; # if column count is not equal to 4, then skip the line
# Skip any line where tags column contains the word "cars"
if (index(tags, "cars") != 0) { next; }
# Normalize the url column with regex by only keeping the article id
# Example input: <a href="https://example.com/article/foo123456">Hello</a>
gsub(/.*example\.com\/article\/|[\042].*/, "", url); # outputs: foo123456
# Skip lines that has non-alphanumeric characters in url column (like <>#&@)
# Skip lines that has empty url column (after gsub normalization)
# Skip lines where url starts with foo or bar
if (url !~ /[[:alnum:]]/ ||
length(url) == 0 ||
url ~ /^foo|^bar/) {
next;
}
# Replace multiple ; with one (needed for errorless CSV import in Postgres)
gsub(/[\073]+/, ";", tags);
# Print the line with OFS, aka: profit! :)
print url, title, body, tags;
}
Then you should run this with:
$ zcat my-csv-file.zip | ./normalize.awk | sort --field-separator="|" --key=1,1 > normalized-sorted.csv
Pro tip: combine this with tail
command when updating DB later on. ;)
Maybe you find this snippet useful and saves you countless time to develop similar solution in Python or Go. You're welcome!
If you have anything to add or modify, then shoot me!
Posted on July 25, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.