fast-csv for CSV files
Chris Muir
Posted on March 5, 2021
I recently had to undertake pre-processing on a CSV file with NodeJS+Typescript before ingesting it into a system.
The CSV file in question presents a number of challenges:
- The CSV file is large @ ~125k rows
- Includes a header row but individual headers need to be renamed
- There are redundant columns to remove
- There may be additional columns that we also don't know about that need to be dropped
- The columns need reordering
- Blank lines must be skipped
Via a quick Google I found fast-csv.
An initial & superficial look at fast-csv highlights a few qualities making it attractive enough to explore further:
- It is still actively being developed (at the time of this post) giving some assurance around bug fixes
- Uses the MIT friendly open source license
- Has no runtime dependencies minimizing any down stream license issues
In looking at the feature set, fast-csv is comprised of 'parse' and 'format' routines for ingesting and transforming CSV files. It also supports streams for fast processing of large files. The following describes how I made use of fast-csv features to meet the above requirements.
To start with here's the initial CSV file we will ingest:
beta,alpha,redundant,charlie,delta
betaRow1,alphaRow1,redundantRow1,charlieRow1,deltaRow1
betaRow2,alphaRow2,redundantRow2,charlieRow2,deltaRow2
betaRow3,alphaRow3,redundantRow3,charlieRow3,deltaRow3
Our goal is to rename and reorder the columns, drop the blank line, drop the 'redundant' column, and our program should be able to also drop the 'delta' column which it wont know about at all. The final output should look like:
NewAlpha,NewBeta,NewCharlie
alphaRow1,betaRow1,charlieRow1
alphaRow2,betaRow2,charlieRow2
alphaRow3,betaRow3,charlieRow3
The following code shows the solution:
import * as fs from 'fs';
import * as csv from 'fast-csv';
const inputFile = __dirname + '/../sample-data/input.csv';
const outputFile = __dirname + '/../sample-data/output.csv';
(async function () {
const writeStream = fs.createWriteStream(outputFile);
const parse = csv.parse(
{
ignoreEmpty: true,
discardUnmappedColumns: true,
headers: ['beta','alpha','redundant','charlie'],
});
const transform = csv.format({ headers: true })
.transform((row) => (
{
NewAlpha: row.alpha, // reordered
NewBeta: row.beta,
NewCharlie: row.charlie,
// redundant is dropped
// delta is not loaded by parse() above
}
));
const stream = fs.createReadStream(inputFile)
.pipe(parse)
.pipe(transform)
.pipe(writeStream);
})();
In explaining the solution:
parse() options
- ignoreEmpty takes care of skipping the blank line(s)
- discardUnmappedColumns will drop any columns we don't specify in the following headers option, taking care of dropping the 'delta' column
- headers maps the columns we are loading. Note how I've used discardUnmappedColumns to drop 'delta' but I'm still loading 'redundant'. The 'redundant' column is dropped in the format() options described next
format() options
- headers directs the output to include the header row
- The transform() row post-processor allows us to reorder the columns, rename the columns, and also drop the 'redundant' column
With a larger CSV file in hand, testing shows the above routine can process ~125k rows with 126 columns, from a file of approx 135MB in size, in ~19 seconds on my MBP 3.2Ghz i7.
fast-csv indeed.
Posted on March 5, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.