Optimizing massive MongoDB inserts, load 50 million records faster by 33%!
Dmtro Harazdovskiy
Posted on October 16, 2022
Intro
Original article here
Working with 100k — 1m database records is almost not a problem with current Mongo Atlas pricing plans. You get the most out of it without any hustle, just by enough hardware, simply use indexes and pagination.
But imagine your project recently got a massive customer dataset for the new ETL pipeline. Now you need to make a one-time upload script and later deal with KPI calculation for all (or part) of the dataset.
What about having a dataset larger than 1 million records? 7 million? Or even more than 50🍋! Not all of the standard approaches would work efficiently for such volumes.
Therefore, let’s take a look at how we can handle this volume by gradually increasing dataset size and tackling performance issues!
Disclaimer — All the code is located here — Github
Setup
To make this experiment simple, I choose to use the M30 *Dedicated *cluster on Mongo Atlas. There is a serverless Mongo plan but let’s omit it for this experiment.
Also, there is an already built-in tool in Atlas for loading dataset samples, but I recommend following along with a *rough *approach to doing custom scripts. You might have a self-hosted MongoDB server without Atlas, or atlas CLI can not support your dataset extension type, moreover, you will control the flow of data on your own.
One more thing to mention is internet speed. Experiments were completed on my local Macbook Pro with a 25–35 Mbps wifi connection. The results of the current experiment may differ from running on a prod-ready EC2 instance since the network is much better there.
Where could you find some sample data?
To play around with data generation and make a custom dataset I can recommend using — https://generatedata.com/. I’ve used it to generate 1🍋 records of the data. At the moment of writing this article, the basic yearly plan costs 25$ and you would not regret it.
This tool works in the browser so you need to be ready for heavy CPU load during generation.
Don’t forget to *strip whitespace from generated content *tick to make your file ~20% smaller
It took me ~3h to generate a 1🍋 dataset of 254MB size. It was a JSON file that we are going to use later on to load it into Mongo.
You will find links to a ready to use datasets in the insertion paragraph.
I recommend using this tool for no more than 2m data generation since it took a reasonable time and your laptop CPU.
Want more data with less effort? Go to — Kaggle
A colleague of mine recommended me a great DS platform with already prepared datasets for testing data. After digging it, I found two datasets that suit the best.
You can review all the contents of it, but generally, there is one big 5GB JSON file — yelp_academic_dataset_review.json it contains 6,9 🍋 JSON records. Use Dadroid to review massive JSON files.
This is a big one — 45GB of steam reviews. I did not know how many of the records are inside while downloading it since all the files were in the paginated folders. Later on, we will find out how to write a script, to summarize, the count of all the records. The exact number of them is going to be revealed, but believe me, this can be called Big(little) Data.
How to insert the data?
I recommend putting your dataset outside of the working folder. It will prevent your code editor from indexing all of that tree of files and slowing down the text search.
Inserting 1 million records
Let’s start with inserting 1m JSON using just a simple updateMany operation. Template of generated data:
Data sample:
I decided to try inserting all the docs in one piece without any chunks:
In this script, we are simply calling an anonymous function that will be executed immediately. I’m going to use a native mongodb driver since all other libs are built on top of that. On line 3 I’ve imported promisified version fs to use await in JSON parse input.
It was interesting to try what the free plan is capable of, so this script was run on the free Atlas cluster.
> node insert/1m/index.js
Reading json: 3.279s
Connected successfully to server
Started insertion process successfully to server
Inserting records: 2:18.415 (m:ss.mmm)
Mongo did not even feel that! It took node ~3 seconds to read the file and about 2,5 minutes for Mongo to write 1m records. Also, there were no more than 3 connections created for that.
What about the M30 cluster? Let’s test how it would perform running the same script.
node insert/1m/index.js
Reading json: 2.874s
Connected successfully to server
Started insertion process successfully to server
Inserting records: 2:20.470 (m:ss.mmm)
No changes in performance here. The current approach is the *fine *approach, you can use it for one time insert. But the easy way is not our one. Moving on to the next section.
Inserting 7 million
Now let’s work with 7 million JSON. Records example:
One review here is less massive than one record in a previously generated dataset, nevertheless, this volume is enough to put some stress both on Node and Mongo.
Would the previous approach work out for such a data volume? Let’s test. I’ve changed a path to a new dataset and started my script.
node insert/1m/index.js
RangeError [ERR_FS_FILE_TOO_LARGE]: File size (5341868833) is greater than 2 GB
at new NodeError (node:internal/errors:372:5)
at readFileHandle (node:internal/fs/promises:377:11)
at async /Users/mongo-performance/insert/1m/index.js:16:36 {
code: 'ERR_FS_FILE_TOO_LARGE'
2GB is a limit size for files that can be stored in a Node.js buffer entity. That’s why we should use Streams here!
A quick reference from the doc:
A stream is an abstract interface for working with streaming data in Node.js. There are many stream objects provided by Node.js. For instance, a request to an HTTP server and process.stdout are both stream instances. Streams can be readable, writable, or both. All streams are instances of EventEmitter.
Now we need to make some rough changes to the script.
Line 23 — we have created a readable stream and used stream-json lib to pipe it. This was done since consuming a raw Buffer from Node.js would make the process of reading more complicated. We need somehow to stringify that Buffer and parse JSON out from string on our own — stream-json is going to do this for us.
Line 25 — stream is consumed via Event Emitter. The function is subscribed to a *data *event and each data variable that contains exactly one JSON record from our 7m dataset file.
Line 27 — all the records are pushed into one accumulator array — arrayToInsert
Line 29 has a remainder operator — it is used just for triggering if branch whenever a new 100k of records are pushed to the array.
Line 30 — when we entered the if statement branch, pipeline should be paused to prevent new array pushes. Nothing new further — insertMany and wait 100ms after each call. Resume pipeline and do it until end the event fires to exit the script.
In this example, we are inserting 100k sequentially with a slight delay. Testing time!
node ./insert/insert-7mil-m30.js
Connected successfully to server
arrayToInsert size - 354.85045185293717 mb
Inserting time - 100000: 45.277s
--------------
.................
arrayToInsert size - *142*.*23186546517442* mb
Inserting time - *6900000*: *25*.911s
--------------
Operation took - *1749*.*997* seconds = ~29 minutes
Process finished with exit code 0
It took about 29 minutes to insert. And now it seems to be reasonable too much time to run.
As you can see here we were using about 30% of the system CPU and 66 IOPs were the highest value during the insertion.
According to 29 minutes per 7m docs, when you need to put 50m records it would take almost an infinity (~3,5h to run)😊.
We definitely can improve it by adding some parallel insert operations, and probably increasing the 100k chunk size to let more parallel operations be handled:
Lines 36 - 42 — radash.parallel is used to make 5 parallel insertions with a chunk of 20k records. Chunks were created by lodash.chunk method.
Later on, this approach will be reused for larger insertions so take your time to figure out how it is working.
Results:
node insert/7m/7m-m30-per200k-5X20k.js
Connected successfully to server
arrayToInsert size - 269.14077478740194 mb
id: 1664486093579 - stats: size 26.909 mb, records: 20000 - took: : 7.258s
id: 1664486093497 - stats: size 26.581 mb, records: 20000 - took: : 11.450s
................
id: 1664486950224 - stats: size 27.520 mb, records: 20000 - took: : 6.973s
id: 1664486951694 - stats: size 27.230 mb, records: 20000 - took: : 5.562s
--------------
Operation took - 867.229 seconds // ~14 minutes
As we can see here performance improved to 50%! I guess It’s much better now. What about Mongo?
System CPU has risen to 40%, it’s 10% more than the sequenced 100k insertion.
IOPS also almost doubled from 50 to 100, and connections also increased from 40 to ~60.
I’m sure there is a big way to improve here but this result is fine for most cases. Now using this parallelization we can get on our Big-little data.
Insertion of more than 50 🍋
To make this test happen we need lots of data, right? So that I’ve chosen the steam dataset — user reviews of about 50000+ games on Steam. Also, I’ve mentioned that before messing around with the insertions script we need just to be aware of how exactly records are there.
Quick overview of the dataset folders
This dataset consists of folders, each of them containing 10 to 2000 JSON files. Each JSON may contain any number of reviews.
In the current example, the script should primarily iterate over each folder, read the file and save a counter of reviews array length there. Take a look at one review inside of JSON file:
As you may see this record is the biggest document among compared datasets.
One thing I want to make in this script is to add some input parameters for the limitation of our folder's pagination. It will let us test the script on a subset of data and parallelize insertion by selecting a range of folders for each process. Now let’s check the script itself!
Line 2 — minimist package was used to make it easy to read CLI parameters inside of the node process. It can be now used like this* node inde.js --from=1 --to=55*
Line 11 — lists directories using fs.promises.readdir
Line 13 — select paths that are inside the datasets’ folder and cut them to the exact range we will be working with.
Line 18 — every path that is inside of a loop contains a folder without JSON files and the same operation should be done here — list all JSON files with fs.promises.readdir.
innerPaths contains all paths to JSON files we need to iterate over. I’ve used radash.parallel to read files in dir quicker. Also, no streams here since almost every file is not bigger than 100kb.
Line 27 — parsing JSON out of Buffer string and reading length of reviews the key. Summing doneReadingReview var.
There is an important console log on line 33. Since data insertion can be parallelized later, we need to know how many folders were read to accumulate the next million records. Time to run this weirdo script!
node insert/50m+/read-size-of-dataset/index.js --from=1 --to=60000
All paths 51936
done paths - 1953 done reviews 1000066
done paths - 3339 done reviews 4000078
........
done paths - 44528 done reviews 59000025
done paths - 51410 done reviews 63000010
Reviews 63199505
Reviews 63199505
Operation took - 429.955 seconds // ~ 7 minutes
My laptop fans reached max RPS while reading this dataset. When do we have here:
51936 folders
incredible 63 199 505 records!
read time of 7 minutes (and that’s just counting the length)
Let’s count how long it would take to insert 63🍋 using our best results on previous steps (7🍋/14m) — 2h (about 126 minutes). I don’t think we can give up on that.
Insertion of 63 million records
As we saw before — power is in parallelization. Using the previous section's log output we should divide our logs into 6 even parts. Then reuse done paths parts as --from and — toparams inputs for our next script. It will result in these pairs:
1 - 14454
14454 - 25104
25104 - 28687
28687 - 34044
34044 - 42245
42245 - 52000
Since we already prepared our counting script to use ranges for the folder, the next step is just to add the insertion. Let’s review the ready script and its main parts!
The are 3 main parts that were changed:
Lines 46–47 — we are simply pushing new reviews into an array
Lines 55–65 — when there are more reviews than the threshold they are inserted into Mogno, and when the loop is on the last iteration we forcibly insert data to make sure all data is inserted
Lines 87–100 — insertion function that is chunking all reviews per 20k and parallelly inserts it
Since we already defined pairs for chunks of the same dataset size earlier I guess we can try running the insertion script in 6 parallel processes. To make things simple let’s just open 6 terminal windows and run it! You can always use built-in worker threads modules in Node.js and the pm2 tool, but our experiment can satisfy the most primitive solution. Testing time!
Results of each parallel execution:
--from=*1* --to=*14454*
Done reading reviews: *10 963 467
*Script: took - *4963*.*341* seconds ~ 1h 22m
--------
--from=*14454* --to=*25104
*Done reading reviews: *10 498 700
*Script: took - *5016*.*944* seconds ~ 1h 23m
--------
--from=*25104* --to=*28687*
Done reading reviews: *10 874 942
*Script: took - *5050*.*838* seconds ~ 1h 24m
---------
--from=*28687* --to=*34044
*Done reading reviews: *11 659 485
*Script: took - *5088*.*016 * seconds ~ 1h 24m
---------
--from=*34044* --to=*42245
*Done reading reviews: *10 044 895
*Script: took - *4796*.*953* seconds ~ 1h 19m
---------
--from=*42245* --to=*52000*
Done reading reviews: *9 158 016
*Script: took - *4197* seconds ~ 1h 9m
It took about 1 hour and 24 minutes to insert 63🍋 of records! It’s 33% faster than our previous prediction for sequential insert! We loaded almost 45GB of data in under one and a half an hour. What about Mongo?
These inserts loaded the cluster pretty heavily. Max System CPU ~65%, process CPU was about 40% during the whole period. Connections stayed at 100 almost the entire time and IOPS averaged 125. I guess these results are good enough now!
One more thing
All the approaches described are general performance improvement. But there is one more driver-specific hack you can use depending on your project.
MongoDB insertMany method has options param you can change to make inserts even faster:
writeConcern
ordered
Using them may also speed up insertion.
But you need to keep in mind that using: ordered: false will not allow you to rely on the document's insertion order.
writeConcern: {w: 0} requests no acknowledgment of the write operation (source).
If none of these imply the app requirements you are welcome to use it!
Wrap up!
you can easily insert 1m of records using insertMany . It should not make any dramatic impact on your app performance.
loading from 1m to 10m would probably require you to use Stream API and some parallelization of inserts. It may impact the performance of the App because of insertion time and increased write load.
to load more than 10 million you need parallelization of both processes and insertions to make it fast. App performance can degrade dramatically since cluster CPU and IOPS are used a lot. Schedule script run or tune how many operations/processes can execute in parallel
Immense appreciation for you reading up to this moment, hope It will help you become a better dev.
Subscribe to my Medium, follow me on Twitter, get connected on LinkedIn, and see code examples on GitHub!
Posted on October 16, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.