Building An Async CSV Parser With Serverless Cloud
We're Serverless!
Posted on March 3, 2022
Originally posted on Serverless
In this tutorial, we will be building a CSV parser, powered by Serverless Cloud. This tutorial will just be the API side of the application, with more instructions to come.
Recently, the Serverless Cloud team introduced Serverless Storage, a simple cloud storage service for all of your application’s file storage. Even more recently, storage listeners were introduced, allowing your application to react to certain storage events outside of an API call, allowing for heavy batch processing in the background without holding up the frontend client. This tutorial will utilize Serverless Storage listeners to asynchronously parse CSV files uploaded via an API.
Getting Started
On your local machine, create a new directory called csv-parser, then open this new directory in any code editor you prefer. This directory will be the folder for your application files. Using your terminal, install the Serverless Cloud CLI if you haven’t already, and then run cloud within the csv-parser directory.
To install the Serverless Cloud CLI, run the following in your terminal:
You may be prompted to login if you haven’t already, then the CLI will ask you to name your new application and select a template. Enter “csv-parser” as the name, and choose the “JavaScript API” template.
Within just a few seconds, your new application will be generated in the directory, and deployed live to your personal development instance. The CLI will now enter development mode, streaming live logs and errors to your terminal as you iterate through the project. Development mode does not prevent commands, so you can continue to interact with the CLI as you are working on the project.
Building an upload endpoint
Our CSV Parser service will consist of three endpoints, one to upload and save a CSV, and two others to retrieve any processed data from the uploaded CSVs.
Starting with the “upload” endpoint, we will need a .post method that receives the files. The api
interface from the SDK simplifies this process, automatically loading the file into memory for you.
Just copy the following code snippet into your index.js file, replacing all boilerplate code that was generated with the starter template.
Let’s walk through the code here: api.post will create a POST endpoint available at the /csv route. The route also takes a “delimiter” query parameter, in cases where the submitted CSV files use something other than a comma, but defaults to a comma if not provided.
With api.post, any posted files will be available in req.files, including a buffer and the original name of the file uploaded. Keep in mind that to get these file names, we need to use multipart form requests. If you post the file as the body, it will not contain any other data.
To avoid flooding our Storage root directory with CSVs, we are going to make a “csvs” directory, to keep things organized. Finally, with the file’s data in hand as a buffer, and a name settled upon, we can write this file to Serverless Storage using storage.write. You can save any custom metadata with storage.write as well, so here we will store the provided delimiter for later parsing.
Asynchronously Parsing the CSV
With API calls, we have limited processing time (up to 29 seconds). We also want to send a response back to the caller as soon as possible to provide a better user experience. With nearly infinite CSV sizes possible, parsing and storing the data during the API call could be slow, and potentially cause the API call to timeout. To remedy this, we will add an event listener that fires when new files are written to the “csvs” directory, and only when those files end in “.csv”.
First though, we need a parsing library! My personal favorite for CSV work in Node.js is papaparse, so we’ll use it for this tutorial. Feel free to use any library you are comfortable with to do the equivalent processing.
Without quitting the CLI, simply type “install papaparse”. This will add the package to your application, and automatically sync its content with your personal instance. We will also need to import papaparse into our index.js file.
To make all of our processed data queryable, we will be storing all the rows in Serverless Data. We will need to update our SDK import statement to include the data interface.
Copy this code snippet into your application (including the updated import statements), and your application will now be listening to any CSV write events.
For this tutorial, I’ve used a simple employee spreadsheet that expects the CSV headers to be “ID”, “First Name”, “Last Name”, and “Email”. With just a few lines of code, you could adapt this to fit your use case, or even add the ability to dynamically parse the headers.
Now, let’s walk through this code. First, take a look at the first argument of storage.on. We want to react to storage writes to the csvs directory, and only process .csv files. We achieve this using the write action with the corresponding glob: write:csvs/*.csv
. Now, we need the path of the file that just got saved, along with any metadata for possible custom delimiters. For this, we call storage.stat with the given path, which returns information about the file. In this case though, we only need the metadata.
Now, we need to load the CSV into memory for processing. For this we will use storage.readBuffer, which returns the entire file as a buffer. Papaparse, though, takes entire CSV’s as strings in its simplest form, so we convert the loaded buffer into a string using .toString(‘utf-8’), with a utf-8 encoding just to be safe. Papaparse also takes some configuration options as the second argument, perfect for our saved delimiter!
Finally, we check if the CSV had any data at all. If it does, we iterate through the array of rows, converting each value into a variable with camel casing. With all the data of the row loaded and ready, we can save it to Serverless Data using data.set. Serverless Data is a powerful key/value store that lets you create collections of data, so our key will use “employee” as the collection name, and the employee’s ID to uniquely identify them within the collection. We use a colon “:” to separate collection names from their key.
Now any saved CSV’s (in the expected format) will be saved!
Retrieving all the processed rows
If you’ve made it this far, your application is now able to take in CSV files, and parse them with the storage.on listener. But, don’t we want to do something with all that data? Of course we do!
Let’s make two new GET endpoints using the api, called “/employees” and “/employee/:id”. We want to be able to get everything that has been saved, but also allow the API to just return a single employee via a query.
Copy and paste this into your project, and you will now be able to retrieve any stored employees.
To walk through the “/employees” endpoint, all we need to do is use data.get with employee:*
, to return any database entry in the “employee” collection, or thought of another way, any key that begins with “employee:”. Since every row we write in the storage listener does this, it should be every row from any uploaded CSVs.
To get a single employee, we do something very similar, except with a path parameter and by sending in the full key to data.get. This one also has a chance to not have any entries, so we need to check if nothing came back, and return a 404 status code to notate that this employee ID does not exist (yet).
Summary
And there it is! An asynchronous CSV parser all built using Serverless Cloud. While there are a lot of improvements that can be made to the implementation here, this will do its job well. There is much more you can do with the power of Serverless Cloud that we didn’t touch on here. Check out the docs if you’re curious for more!
Posted on March 3, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.