Creating lots of records in Rails with insert_all

lorankloeze

Loran Kloeze

Posted on July 1, 2022

Creating lots of records in Rails with insert_all

Imagine importing a lot of data in you Rails app. I'm talking about > 100,000 records. Is it a good idea to just run 100,000 x Address.create(data)? No, probably not. Let's find out what a better way is to bulk create data in your Rails database.

The app

We're creating an app that imports data from a csv file with customer data. The file contains 1,000 rows consisting of 6 fields. It looks something like this:

id,first_name,last_name,email,gender,ip_address
1,Lianne,Hosten,lhosten0@cnbc.com,Female,112.33.73.252
2,Aubrie,Dorin,adorin1@unesco.org,Female,254.88.120.47
...
Enter fullscreen mode Exit fullscreen mode

The requirement is simple: we want each row in the csv to have a saved ActiveRecord counterpart called Customer.

Model

The model and associated migration:

# The model
class Customer < ApplicationRecord
end

# And its migration
class CreateCustomer < ActiveRecord::Migration[7.0]
  def change
    create_table :uploads do |t|
      t.string :first_name
      t.string :last_name
      t.string :email
      t.string :gender
      t.string :ip_address

      t.timestamps      
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Inserting data

There are many ways to introduce the csv file in your app. You might download it from an url, read it from the filesystem or from a submitted form with a file input. Either way, you'll mostly end up with a csv file in a controller action.

Making it work

When you're developing software, most of the time you should first focus on getting the stuff to work before moving on to focus on performance. So let's create something that works.

require 'csv'
require 'benchmark'

class CustomersController < ApplicationController

  def create_bulk
    file = params[:csv_file] # Uploaded through a form

    time = Benchmark.measure do
      CSV.open(file.path, headers: true).each do |row|
        Customer.create(row)
      end
    end

    puts time
  end
end

Enter fullscreen mode Exit fullscreen mode

Benchmark results: 3.884280 0.871490 4.755770 ( 13.748814)
Yeah, that took 13 seconds on my machine for creating only 1,000 customers. Imagine importing 250,000 ones...

Making it work fast

So, it's too slow. What are we going to do? Well, there's a method from ActiveRecord called insert_all.

The method insert_all receives an array of attributes per record and creates an SQL INSERT statement to insert all the records in one trip. For example:

customers = [
  {first_name: 'Lianne', last_name: 'Hosten', email: 'lhosten0@cnbc.com', gender: 'Female', ip_address: '112.33.73.252'},
  {first_name: 'Aubrie', last_name: 'Dorin', email: 'adorin1@unesco.org', gender: 'Female', ip_address: '254.88.120.47'},
  ...
]
Customer.insert_all(customers)
Enter fullscreen mode Exit fullscreen mode

The resulting SQL statement:

Customer Bulk Insert (7.9ms)  INSERT INTO "customers"
("first_name","last_name","email","gender","ip_address","created_at","updated_at") 
VALUES 
('Lianne', 'Hosten', 'lhosten0@cnbc.com', 'Female', '112.33.73.252', STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW'), STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW')), 
('Aubrie', 'Dorin', 'adorin1@unesco.org', 'Female', '254.88.120.47', STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW'), STRFTIME('%Y-%m-%d %H:%M:%f', 'NOW')),
...
Enter fullscreen mode Exit fullscreen mode

Let's give this a try in our app.

...
class CustomersController < ApplicationController

  def create_bulk
    file = params[:csv_file] # Uploaded through a form

    time = Benchmark.measure do
      CSV.open(file.path, headers: true).each_slice(250) do |rows|
        Customer.insert_all(rows.map(&:to_h))
      end
    end

    puts time
  end
end

Enter fullscreen mode Exit fullscreen mode

Benchmark results: 0.047267 0.035247 0.082514 ( 0.130676)

Hey, what is each_slice? Normally each enumerates an array and gives you each element to work on. But each_slice gives you back n elements to work on. Yep, it gives you back an array.

The csv contains 1,000 rows so the code above generates 1,000 / 250 = 4 SQL INSERT queries instead of 1,000 INSERT queries. Another thing that speeds up the import is that the Customer model is not instantiated which shaves off quite a few more cpu cycles.

It took 0.13 seconds instead of 13 seconds... 100 times faster.
Alright, why should I ever use #create again? Well, read on.

Should I use insert_all whenever possible?

Wow, 100 times faster, there must be a catch. Yes, there is and it's a big one. The huge drawback here is that callbacks and validations are not triggered.

I repeat: callbacks and validations are not triggered when using insert_all.

So make sure the data you use with insert_all is valid(ish). Check your callbacks to find out if there is anything that should run before or after your insert_all.

Important notes

  • Use insert_all only when callbacks and validations are okay to be ignored.
  • The benchmarks results that I compared are the so called elapsed real times. Those are nice to give you a human sense of the time elapsed but they are less ideal when you really want to compare the time the CPU was busy chewing on your code.
💖 💪 🙅 🚩
lorankloeze
Loran Kloeze

Posted on July 1, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related