Transforming TSV data using ColdFusion & Miller (Command line FTW!)
James Moberg
Posted on September 5, 2024
We license data from multiple providers and discovered that not all data can be treated equally even though it's the same specification. For example, we work with two (2) separate Real Estate Transaction Standards (RETS) providers and their implementations are slightly different. The west coast provider that we've been working with for almost 20 years (ie, before RETS) has been consistent and we haven't encountered any data-related problems. While working with a new RETS provider located in the southeast, occasional issues have occurred where a single PostalCode
value contains ZIP+4 instead of the generic/base 5 digit ZIP. This has never been an issues with our long-time provider and this change was causing an error the data is bulk imported into Microsoft SQL Server. (NOTE: The RETS specification does allow for ZIP+4, but it's extremely rare. We've been importing the ZIP as an integer since first licensing MLS data.)
I was exploring whether or not the data could be transformed via the command line and discovered Miller (BSD2; Linux/Mac/Windows). Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.
With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more.
The syntax for Miller isn't too complex and there's a ReadTheDocs website dedicated to the flags, verbs & functions and also hosts a bunch of FAQs & examples.
In a nutshell, I needed to:
- Set the
-I
flag to proceess files in-place - Set the input format:
--itsv
- Set the output format
--otsv
- Explicitly set the output row separator:
--ors '\r\n'
(if not set, it was convertingWindows CR/LF
toLinux LF
) - Use the
put
command to transform an existing column's data:put '$PostalCode = substr($PostalCode,0,4)'
While using CFExecute
, I discovered that Miller required the in-place processing to be performed in the same directory as the file. This approach required the use of a temporary Windows BAT file so that the process would be executed from the same directory otherwise it would complain that the output directory couldn't be found. The BAT file performs the following functions:
- Sets the working directory:
cd /d #arguments.fileDir#
- Performs the Miller function
- Self-deletes the BAT file after completion:
(goto) 2>nul & del ""%~f0""
The processing of Miller using this function is extremely fast (~400 ms). This approach is more portable and also probably more performant than reading in the entire TSV file using CF, analyzing each row, making the transformation and re-saving the TSV file.
Basic Proof-of-Concept
https://gist.github.com/JamoCA/94f614fa1b2d881f26f26f9926a9b1fa
Posted on September 5, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.