Processing millions of records to CSV using Node.js Streams
Oluwatimilehin Adesoji
Posted on November 19, 2022
Efficient data management is essential for backend engineering, especially when working with huge amounts of data. A typical task is to use Node.js to stream millions of entries from a MySQL database. This article will discuss the challenges and review how to implement a streaming solution for optimal outcomes.
The Role of Streams in Boosting Performance
Streams are important because they facilitate real-time responsiveness, minimize memory usage, and handle and process data in chunks effectively. The smooth piping and chaining of streams allow us to quickly construct modular code while guaranteeing a balanced data flow with backpressure management.
Navigating the Challenge of Streaming Millions of Records
Addressing the streaming problem of managing millions of data is essential. It might take a lot of time and resources to get many entries from a database. Conventional methods, such as retrieving every record at once, might cause memory problems and increase delay. We choose to broadcast the data in segments/ chunks to get around these challenges.
Now to the Implementation
We would use mysql2
that supports streaming to establish a connection to your MySQL database, and the @json2csv
for the CSV report generation.
Install the Dependencies
First, we install 2 required dependencies using the command below:
npm install --save mysql2 @json2csv/node
Create a Database Pool
Next, we create a database connection pool using the mysql2
package installed earlier. This would allow us to manage connections to the database.
import mysql from 'mysql2/promise';
function createDatabasePool(): mysql.Pool {
const { pool } = mysql.createPool({
host: '127.0.0.1',
user: 'USER',
password: 'PASSWORD',
database: 'DATABASE',
port: 3306
});
return pool;
}
Stream to CSV
We create a function that accepts a readable stream as the source, and transforms and formats the data.
import stream, { Readable } from 'stream';
import { Transform } from '@json2csv/node';
function streamToCsv(data: {
source: Readable;
headers: string[];
onError?: (data: { [key: string]: any }) => any;
onFinish?: (data: { [key: string]: any }) => any;
}): { passThrough: stream.PassThrough } {
const transformStream = new Transform(
{
fields: data.headers
},
{},
{
objectMode: true
}
);
const passThrough = new stream.PassThrough();
const pipe = stream.pipeline(data.source, transformStream, ps, (err) => {
if (err) {
console.log('Error => ', err);
if (data.onError) {
data.onError(err);
}
}
});
if (data.onError) {
pipe.on('error', data.onError);
}
if (data.onFinish) {
pipe.on('finish', data.onFinish);
}
return { passThrough };
};
function streamData (sqlQuery: string): Readable {
const pool = createDatabasePool();
const result = pool.query(sqlQuery).stream();
return result;
}
Putting it all together
With everything set, we can now proceed to stream data to CSV. We start by first creating a connection to the database and then create a query stream from the SQL query. Afterward, we stream to CSV and propagate the data written to the passthrough into the Express response.
const sqlQuery = `SELECT id 'ID', name 'Name', email 'Email', created_at 'Date' FROM your_table`;
const readableStream = streamData(sqlQuery);
const { passThrough } = streamToCsv({
source: readableStream,
headers: ['ID', 'Name', 'Email', 'Date'],
onError: (error) => console.error(error),
onFinish: () => console.log('process complete')
});
passThrough.pipe(res); // res being Express response
Some Benefits of Streaming are
Streaming in Node.js has multiple benefits that improve application performance, scalability, and real-time responsiveness:
Memory Efficiency: Streams handle data in segments rather than loading full datasets into memory. This reduces the memory footprint, making Node.js well-suited for processing huge files.
Pipelining and Chaining: Streams provide a modular and reusable approach to code design since they are simple to pipe and link together. This makes it easier to create intricate workflows for data processing.
Improved Performance: Streams improve speed by starting data processing before receiving the whole data set, allowing for segmented data processing. It accelerates response times, which is particularly advantageous in high-concurrency scenarios.
Scalability: Your application can handle big datasets using streaming without reducing speed since it is more scalable. Streaming offers a more sustainable solution as data quantities rise, making it ideal for applications with changing data requirements.
Parallel Processing: Streaming makes it possible to process data in parallel, which improves system resource utilization. This is particularly helpful in situations when the streaming data has to undergo several procedures at once.
Real-time Processing: Streams provide the instantaneous processing of data upon its availability. Applications like monitoring systems or analytics dashboards applications that need to react instantly to changing data depend on this.
Conclusion
In this article, we have discussed how using streams can improve the performance of our application especially when working with large data. By breaking data into smaller chunks for processing, we can conserve memory and improve the performance of our application.
Thank you for reading!
Posted on November 19, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024
November 30, 2024