Processing 13 million rows from a CSV file in the Browser (Without freezing the screen)
Wesley Miranda
Posted on March 26, 2023
Have you ever thought that you could process huge files in the browser? and even better, without freezing the browser screen. Browsers are well-known only for showing things on the screen, and all the hard processing is a server's responsibility. But, nowadays the browsers are much more powerful, and new great APIs have been implemented, which open many different possibilities to implement and improve our applications.
Application: We're going to create a pure Javascript application able to load many big CSV files and process them separately in different threads, transforming the CSV rows into a Javascript object and sending mocked requests for all these objects.
Main Principles:
Streams: Interface to load and process file chunks instead of the entire buffer.
Web Workers: Interface to deal with multithreading in the browser.
If you want to know about multithreading and Streams in NodeJS, you can access my tutorials through the links below:
Steps to Reproduce:
Create a thread worker to handle all the file processing
Inside our thread worker we are going to process the file using Streams and treat each chunk of the file.
At the end we are going to deal with the interface, to load the files and show the outcomes.
Requirements
I am using version
111.0.5563.65
of Google Chrome to test the application.You will need to serve the application in your
localhost
because Chrome blocks the creation of threads if you are in a local path. I am using the vs codelive server
plugin to do that.The CSV file with 13 million rows you can download here
Thread Worker
The thread worker must be responsible for the file processing, extracting the rows as Javascript objects, and in the end, simulating the requests for all the Javascript objects to a server.
To measure the file loading progress and transform CSV rows into a Javascript object just like we did at my first tutorial, We are going to use Transform Streams, and a Writable Stream to store the objects.
threadWorker.js
// (1)
let readableStream = null
let fileIndex = null
let bytesLoaded = 0
let linesSent = 0
const objectsToSend = []
let fileCompletelyLoaded = false
const readyEvent = new Event('ready')
// (2)
const ObjectTranform = {
headerLine: true,
keys: [],
tailChunk: '',
start() {
this.decoder = new TextDecoder('utf-8');
},
transform(chunk, controller) {
const stringChunks = this.decoder.decode(chunk, { stream: true })
const lines = stringChunks.split('\n')
for (const line of lines) {
const lineString = (this.tailChunk + line)
let values = lineString.split(',')
if (this.headerLine) {
this.keys = values
this.headerLine = false
continue
}
if (values.length !== this.keys.length || lineString[lineString.length - 1] === ',') {
this.tailChunk = line
} else {
const chunkObject = {}
this.keys.forEach((element, index) => {
chunkObject[element] = values[index]
})
this.tailChunk = ''
controller.enqueue(`${JSON.stringify(chunkObject)}`)
}
}
},
}
// (3)
const ProgressTransform = {
transform(chunk, controller) {
bytesLoaded += chunk.length
controller.enqueue(chunk)
postMessage({ progressLoaded: bytesLoaded, progressSent: linesSent, index: fileIndex, totalToSend: 0 })
},
flush() {
fileCompletelyLoaded = true
}
}
// (4)
const MyWritable = {
write(chunk) {
objectsToSend.push(postRequest(JSON.parse(chunk)))
},
close() {
if (fileCompletelyLoaded) {
postMessage({ totalToSend: objectsToSend.length, index: fileIndex, progressLoaded: bytesLoaded, progressSent: linesSent })
dispatchEvent(readyEvent)
}
},
abort(err) {
console.log("Sink error:", err);
},
}
// (5)
const postRequest = async data => {
return new Promise((resolve, reject) => {
setTimeout(() => {
linesSent++
postMessage({ totalToSend: objectsToSend.length, progressSent: linesSent, progressLoaded: bytesLoaded, index: fileIndex })
resolve(data)
}, 3000)
})
}
// (6)
addEventListener('ready', async () => {
await Promise.all(objectsToSend)
})
// (7)
addEventListener("message", event => {
fileIndex = event.data?.index
readableStream = event.data?.file?.stream()
readableStream
.pipeThrough(new TransformStream(ProgressTransform))
.pipeThrough(new TransformStream(ObjectTranform))
.pipeTo(new WritableStream(MyWritable))
})
From the code sections above:
-
Controller variables:
-
readableStream
: to store the file passed by the main thread. -
fileIndex
: to handle which file is being processed. -
bytesLoaded
: to sum the number of bytes processed. -
linesSent
: to sum the number of objects that were sent. -
objectsToSend
: array to store all the objects extracted. -
fileCompletelyLoaded
: flag to know if the file was loaded. -
readyEvent
: event to dispatch the requests to the server
-
Transform Stream to convert the rows into Javascript Objects. There is a session in one of the previous tutorials that I explain better how I do that. here
Transform Stream is responsible to control the progress, incrementing the
bytesLoaded
variables, and setting the flagfileCompletelyLoaded
when the file is loaded. Theenqueue
function We apply to pass the chunk to the rest of the Stream pipeline.Writable Stream is responsible to store all our converted objects and dispatch the
ready
event to start the requests.Function used to simulate a request behavior with a 3 seconds delay.
Listener is responsible to listen when the
ready
is dispatched and starting the mocked requests.Another listener, but it is waiting for messages from the main thread with the file index and the file object.
OBS: The postMessage
function is used to send messages to the main thread that created the worker.
UI behavior
Our focus is not the interface here, so we have a simple input
that accepts multiple files and a div
to show the progress and information related to the files.
main.js
// (1)
const input = document.getElementById('files')
const progress = document.getElementById('progress')
// (2)
const formatBytes = (bytes, decimals = 2) => {
if (!+bytes) return '0 Bytes'
const k = 1024
const dm = decimals < 0 ? 0 : decimals
const sizes = ['Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']
const i = Math.floor(Math.log(bytes) / Math.log(k))
return `${parseFloat((bytes / Math.pow(k, i)).toFixed(dm))} ${sizes[i]}`
}
// (3)
input.addEventListener('change', async (e) => {
const files = e.target.files
const workersInfo = []
for (const i in files) {
const worker = new Worker("threadWorker.js")
if (files[i].name) {
worker.postMessage({ index: i, name: files[i].name, file: files[i] })
worker.addEventListener("message", event => {
if (event.data) {
const infos = {
progressSent: event.data.progressSent, progressLoaded: event.data.progressLoaded, index: event.data.index, totalToSend: event.data.totalToSend, fileSize: files[i].size, fileName: files[i].name
}
workersInfo[i] = infos
}
progress.innerHTML = `
<table align="center" border cellspacing="1">
<thead>
<tr>
<th>File</th>
<th>File Size</th>
<th>Loaded</th>
<th></th>
<th>Total Rows</th>
<th>Rows Sent</th>
<th></th>
</tr>
</thead>
<tbody>
${workersInfo.map(info =>
`<tr>
<td>${info.fileName}</td>
<td>${formatBytes(info.fileSize)}</td>
<td>${formatBytes(info.progressLoaded)}</td>
<td><progress value="${Math.ceil(info.progressLoaded / info.fileSize * 100)}" max="100"> 32% </progress></td>
<td>${info.totalToSend}</td>
<td>${info.progressSent}</td>
<td><progress value="${Math.ceil(info.progressSent / info.totalToSend * 100)}" max="100"> 32% </progress></td>
</tr>`
)}
</tbody>
</table>
`
})
}
}
})
From the code sections above:
Getting the references to manipulate the DOM.
This function I got on Stackoverflow elegantly shows the bytes.
The listener is responsible to listen when the user selects the files to process. When this event is dispatched we are going to create one thread for each, using
postMessage
function to pass the file to the thread created. Also, the thread should listen to messages to update the results on the screen.
Joining Everything
Now we can import all the scripts we created in an HTML file and put the necessary tags.
index.html
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8'>
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
<title>Multithreading browser</title>
<meta name='viewport' content='width=device-width, initial-scale=1'>
<script src='threadWorker.js' async></script>
<script src='main.js' async></script>
</head>
<body>
<input type="file" multiple id="files" accept=".csv" /><br><br>
<div id="progress"></div>
</body>
</html>
Takeaways
Nowadays modern browsers make available awesome functionalities to improve performance and create nice things.
We can process hard jobs in the browser like manipulating files and taking off some server's responsibilities.
You can take a look at the entire code here
Posted on March 26, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
March 26, 2023