fast-csv for CSV files

chriscmuir

Chris Muir

Posted on March 5, 2021

fast-csv for CSV files

I recently had to undertake pre-processing on a CSV file with NodeJS+Typescript before ingesting it into a system.

The CSV file in question presents a number of challenges:

  1. The CSV file is large @ ~125k rows
  2. Includes a header row but individual headers need to be renamed
  3. There are redundant columns to remove
  4. There may be additional columns that we also don't know about that need to be dropped
  5. The columns need reordering
  6. Blank lines must be skipped

Via a quick Google I found fast-csv.

An initial & superficial look at fast-csv highlights a few qualities making it attractive enough to explore further:

  • It is still actively being developed (at the time of this post) giving some assurance around bug fixes
  • Uses the MIT friendly open source license
  • Has no runtime dependencies minimizing any down stream license issues

In looking at the feature set, fast-csv is comprised of 'parse' and 'format' routines for ingesting and transforming CSV files. It also supports streams for fast processing of large files. The following describes how I made use of fast-csv features to meet the above requirements.

To start with here's the initial CSV file we will ingest:

beta,alpha,redundant,charlie,delta

betaRow1,alphaRow1,redundantRow1,charlieRow1,deltaRow1
betaRow2,alphaRow2,redundantRow2,charlieRow2,deltaRow2
betaRow3,alphaRow3,redundantRow3,charlieRow3,deltaRow3
Enter fullscreen mode Exit fullscreen mode

Our goal is to rename and reorder the columns, drop the blank line, drop the 'redundant' column, and our program should be able to also drop the 'delta' column which it wont know about at all. The final output should look like:

NewAlpha,NewBeta,NewCharlie
alphaRow1,betaRow1,charlieRow1
alphaRow2,betaRow2,charlieRow2
alphaRow3,betaRow3,charlieRow3
Enter fullscreen mode Exit fullscreen mode

The following code shows the solution:

import * as fs from 'fs';
import * as csv from 'fast-csv';

const inputFile = __dirname + '/../sample-data/input.csv';
const outputFile = __dirname + '/../sample-data/output.csv';

(async function () {

  const writeStream = fs.createWriteStream(outputFile);

  const parse = csv.parse(
    { 
      ignoreEmpty: true,
      discardUnmappedColumns: true,
      headers: ['beta','alpha','redundant','charlie'],
    });

  const transform = csv.format({ headers: true })
    .transform((row) => (
      {
        NewAlpha: row.alpha, // reordered
        NewBeta: row.beta,
        NewCharlie: row.charlie,
        // redundant is dropped
        // delta is not loaded by parse() above
      }
    ));

  const stream = fs.createReadStream(inputFile)
    .pipe(parse)
    .pipe(transform)
    .pipe(writeStream);
})();
Enter fullscreen mode Exit fullscreen mode

In explaining the solution:

parse() options

  • ignoreEmpty takes care of skipping the blank line(s)
  • discardUnmappedColumns will drop any columns we don't specify in the following headers option, taking care of dropping the 'delta' column
  • headers maps the columns we are loading. Note how I've used discardUnmappedColumns to drop 'delta' but I'm still loading 'redundant'. The 'redundant' column is dropped in the format() options described next

format() options

  • headers directs the output to include the header row
  • The transform() row post-processor allows us to reorder the columns, rename the columns, and also drop the 'redundant' column

With a larger CSV file in hand, testing shows the above routine can process ~125k rows with 126 columns, from a file of approx 135MB in size, in ~19 seconds on my MBP 3.2Ghz i7.

fast-csv indeed.

💖 💪 🙅 🚩
chriscmuir
Chris Muir

Posted on March 5, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related