fast-csv for CSV files

I recently had to undertake pre-processing on a CSV file with NodeJS+Typescript before ingesting it into a system.

The CSV file in question presents a number of challenges:

The CSV file is large @ ~125k rows
Includes a header row but individual headers need to be renamed
There are redundant columns to remove
There may be additional columns that we also don't know about that need to be dropped
The columns need reordering
Blank lines must be skipped

Via a quick Google I found fast-csv.

An initial & superficial look at fast-csv highlights a few qualities making it attractive enough to explore further:

It is still actively being developed (at the time of this post) giving some assurance around bug fixes
Uses the MIT friendly open source license
Has no runtime dependencies minimizing any down stream license issues

In looking at the feature set, fast-csv is comprised of 'parse' and 'format' routines for ingesting and transforming CSV files. It also supports streams for fast processing of large files. The following describes how I made use of fast-csv features to meet the above requirements.

To start with here's the initial CSV file we will ingest:

beta,alpha,redundant,charlie,delta

betaRow1,alphaRow1,redundantRow1,charlieRow1,deltaRow1
betaRow2,alphaRow2,redundantRow2,charlieRow2,deltaRow2
betaRow3,alphaRow3,redundantRow3,charlieRow3,deltaRow3

Our goal is to rename and reorder the columns, drop the blank line, drop the 'redundant' column, and our program should be able to also drop the 'delta' column which it wont know about at all. The final output should look like:

NewAlpha,NewBeta,NewCharlie
alphaRow1,betaRow1,charlieRow1
alphaRow2,betaRow2,charlieRow2
alphaRow3,betaRow3,charlieRow3

The following code shows the solution:

import * as fs from 'fs';
import * as csv from 'fast-csv';

const inputFile = __dirname + '/../sample-data/input.csv';
const outputFile = __dirname + '/../sample-data/output.csv';

(async function () {

  const writeStream = fs.createWriteStream(outputFile);

  const parse = csv.parse(
    { 
      ignoreEmpty: true,
      discardUnmappedColumns: true,
      headers: ['beta','alpha','redundant','charlie'],
    });

  const transform = csv.format({ headers: true })
    .transform((row) => (
      {
        NewAlpha: row.alpha, // reordered
        NewBeta: row.beta,
        NewCharlie: row.charlie,
        // redundant is dropped
        // delta is not loaded by parse() above
      }
    ));

  const stream = fs.createReadStream(inputFile)
    .pipe(parse)
    .pipe(transform)
    .pipe(writeStream);
})();

In explaining the solution:

parse() options

ignoreEmpty takes care of skipping the blank line(s)
discardUnmappedColumns will drop any columns we don't specify in the following headers option, taking care of dropping the 'delta' column
headers maps the columns we are loading. Note how I've used discardUnmappedColumns to drop 'delta' but I'm still loading 'redundant'. The 'redundant' column is dropped in the format() options described next

format() options

headers directs the output to include the header row
The transform() row post-processor allows us to reorder the columns, rename the columns, and also drop the 'redundant' column

With a larger CSV file in hand, testing shows the above routine can process ~125k rows with 126 columns, from a file of approx 135MB in size, in ~19 seconds on my MBP 3.2Ghz i7.

fast-csv indeed.

Blog

fast-csv for CSV files

Chris Muir

Join Our Newsletter. No Spam, Only the good stuff.

Related