Transforming TSV data using ColdFusion & Miller (Command line FTW!)

gamesover

James Moberg

Posted on September 5, 2024

Transforming TSV data using ColdFusion & Miller (Command line FTW!)

We license data from multiple providers and discovered that not all data can be treated equally even though it's the same specification. For example, we work with two (2) separate Real Estate Transaction Standards (RETS) providers and their implementations are slightly different. The west coast provider that we've been working with for almost 20 years (ie, before RETS) has been consistent and we haven't encountered any data-related problems. While working with a new RETS provider located in the southeast, occasional issues have occurred where a single PostalCode value contains ZIP+4 instead of the generic/base 5 digit ZIP. This has never been an issues with our long-time provider and this change was causing an error the data is bulk imported into Microsoft SQL Server. (NOTE: The RETS specification does allow for ZIP+4, but it's extremely rare. We've been importing the ZIP as an integer since first licensing MLS data.)

I was exploring whether or not the data could be transformed via the command line and discovered Miller (BSD2; Linux/Mac/Windows). Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.

With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more.

The syntax for Miller isn't too complex and there's a ReadTheDocs website dedicated to the flags, verbs & functions and also hosts a bunch of FAQs & examples.

In a nutshell, I needed to:

  • Set the -I flag to proceess files in-place
  • Set the input format: --itsv
  • Set the output format --otsv
  • Explicitly set the output row separator: --ors '\r\n' (if not set, it was converting Windows CR/LF to Linux LF)
  • Use the put command to transform an existing column's data: put '$PostalCode = substr($PostalCode,0,4)'

While using CFExecute, I discovered that Miller required the in-place processing to be performed in the same directory as the file. This approach required the use of a temporary Windows BAT file so that the process would be executed from the same directory otherwise it would complain that the output directory couldn't be found. The BAT file performs the following functions:

  • Sets the working directory: cd /d #arguments.fileDir#
  • Performs the Miller function
  • Self-deletes the BAT file after completion: (goto) 2>nul & del ""%~f0""

The processing of Miller using this function is extremely fast (~400 ms). This approach is more portable and also probably more performant than reading in the entire TSV file using CF, analyzing each row, making the transformation and re-saving the TSV file.

Basic Proof-of-Concept

https://gist.github.com/JamoCA/94f614fa1b2d881f26f26f9926a9b1fa

💖 💪 🙅 🚩
gamesover
James Moberg

Posted on September 5, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related