Loran Kloeze
Posted on July 1, 2022
Imagine importing a lot of data in you Rails app. I'm talking about > 100,000 records. Is it a good idea to just run 100,000 x Address.create(data)
? No, probably not. Let's find out what a better way is to bulk create data in your Rails database.
The app
We're creating an app that imports data from a csv file with customer data. The file contains 1,000 rows consisting of 6 fields. It looks something like this:
id,first_name,last_name,email,gender,ip_address
1,Lianne,Hosten,lhosten0@cnbc.com,Female,112.33.73.252
2,Aubrie,Dorin,adorin1@unesco.org,Female,254.88.120.47
...
The requirement is simple: we want each row in the csv to have a saved ActiveRecord counterpart called Customer
.
Model
The model and associated migration:
# The model
class Customer < ApplicationRecord
end
# And its migration
class CreateCustomer < ActiveRecord::Migration[7.0]
def change
create_table :uploads do |t|
t.string :first_name
t.string :last_name
t.string :email
t.string :gender
t.string :ip_address
t.timestamps
end
end
end
Inserting data
There are many ways to introduce the csv file in your app. You might download it from an url, read it from the filesystem or from a submitted form with a file input. Either way, you'll mostly end up with a csv file in a controller action.
Making it work
When you're developing software, most of the time you should first focus on getting the stuff to work before moving on to focus on performance. So let's create something that works.
require 'csv'
require 'benchmark'
class CustomersController < ApplicationController
def create_bulk
file = params[:csv_file] # Uploaded through a form
time = Benchmark.measure do
CSV.open(file.path, headers: true).each do |row|
Customer.create(row)
end
end
puts time
end
end
Benchmark results: 3.884280 0.871490 4.755770 ( 13.748814)
Yeah, that took 13 seconds on my machine for creating only 1,000 customers. Imagine importing 250,000 ones...
Making it work fast
So, it's too slow. What are we going to do? Well, there's a method from ActiveRecord called insert_all
.
The method insert_all
receives an array of attributes per record and creates an SQL INSERT
statement to insert all the records in one trip. For example:
customers = [
{first_name: 'Lianne', last_name: 'Hosten', email: 'lhosten0@cnbc.com', gender: 'Female', ip_address: '112.33.73.252'},
{first_name: 'Aubrie', last_name: 'Dorin', email: 'adorin1@unesco.org', gender: 'Female', ip_address: '254.88.120.47'},
...
]
Customer.insert_all(customers)
The resulting SQL statement:
Customer Bulk Insert (7.9ms) INSERT INTO "customers"
("first_name","last_name","email","gender","ip_address","created_at","updated_at")
VALUES
('Lianne', 'Hosten', 'lhosten0@cnbc.com', 'Female', '112.33.73.252', STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW'), STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW')),
('Aubrie', 'Dorin', 'adorin1@unesco.org', 'Female', '254.88.120.47', STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW'), STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW')),
...
Let's give this a try in our app.
...
class CustomersController < ApplicationController
def create_bulk
file = params[:csv_file] # Uploaded through a form
time = Benchmark.measure do
CSV.open(file.path, headers: true).each_slice(250) do |rows|
Customer.insert_all(rows.map(&:to_h))
end
end
puts time
end
end
Benchmark results: 0.047267 0.035247 0.082514 ( 0.130676)
Hey, what is each_slice
? Normally each
enumerates an array and gives you each element to work on. But each_slice
gives you back n elements to work on. Yep, it gives you back an array.
The csv contains 1,000 rows so the code above generates 1,000 / 250 = 4 SQL INSERT
queries instead of 1,000 INSERT
queries. Another thing that speeds up the import is that the Customer
model is not instantiated which shaves off quite a few more cpu cycles.
It took 0.13 seconds instead of 13 seconds... 100 times faster.
Alright, why should I ever use #create
again? Well, read on.
Should I use insert_all
whenever possible?
Wow, 100 times faster, there must be a catch. Yes, there is and it's a big one. The huge drawback here is that callbacks and validations are not triggered.
I repeat: callbacks and validations are not triggered when using insert_all
.
So make sure the data you use with insert_all
is valid(ish). Check your callbacks to find out if there is anything that should run before or after your insert_all
.
Important notes
- Use
insert_all
only when callbacks and validations are okay to be ignored. - The benchmarks results that I compared are the so called elapsed real times. Those are nice to give you a human sense of the time elapsed but they are less ideal when you really want to compare the time the CPU was busy chewing on your code.
Posted on July 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024