Batching – A powerful way to solve N+1 queries every Rubyist should know
exAspArk
Posted on August 16, 2017
In this post, I’m going to tell you about batching as a technique to help avoid N+1 queries, existing battle-tested tools like Haskell Haxl and JavaScript DataLoader, and how similar approaches can be used in any Ruby program.
What are N+1 queries?
First, let’s find out what N+1 queries are and why they are called that. Imagine, we have 2 SQL tables: users
and posts
. If we run the following code by using ActiveRecord models:
posts = Post.where(id: [1, 2, 3])
# SELECT * FROM posts WHERE id IN (1, 2, 3)
users = posts.map { |post| post.user }
# SELECT * FROM users WHERE id = 1
# SELECT * FROM users WHERE id = 2
# SELECT * FROM users WHERE id = 3
This makes 1
query to select all posts, then N
queries to select users for each post. So, the code produces 1+N queries. Because changing the order of the addends does not change the sum, it’s common to call it N+1 queries.
How do people usually solve N+1 queries?
Generally, there are 2 popular ways to solve the problem of N+1 queries in the Ruby world:
- Eager loading data in models
posts = Post.where(id: [1, 2, 3]).includes(:user)
# SELECT * FROM posts WHERE id IN (1, 2, 3)
# SELECT * FROM users WHERE id IN (1, 2, 3)
users = posts.map { |post| post.user }
It preloads the specified associations under the hood, which makes it easy to use. But ORM can’t always help. For example, in a case when we need to load data from different sources such as another database.
- Preloading data and passing it through arguments
class Post < ApplicationRecord
def rating(like_count, angry_count)
like_count * 2 - angry_count
end
end
posts = Post.where(id: [1, 2, 3])
# SELECT * FROM posts WHERE id IN (1, 2, 3)
post_emoticons = Emoticon.where(post_id: posts.map(&:id))
like_count_by_post_id = post_emoticons.like.group(:post_id).count
angry_count_by_post_id = post_emoticons.angry.group(:post_id).count
# SELECT COUNT(*) FROM emoticons WHERE name = 'like' AND post_id IN (1, 2, 3) GROUP BY post_id
# SELECT COUNT(*) FROM emoticons WHERE name = 'angry' AND post_id IN (1, 2, 3) GROUP BY post_id
posts.map do |post|
post.rating(
like_count_by_post_id[post.id],
angry_count_by_post_id[post.id]
)
end
This method is very flexible and can work when the simpler method of using includes
can’t. And it may also be more memory efficient – in our example above we don’t load all emoticons to calculate a rating for each post. Instead, our database can do all the hard work and return just a count for each post, which is passed as an argument. However, passing the data through arguments can be complicated, especially when there are several layers below (e.g. load Emoticons, pass through Users to Posts).
Both of these approaches can help us to avoid N+1 queries. But the problem is that we should know in advance on the top level which data we need to preload. And we will preload it every time, even if it’s not necessary. For example, with GraphQL these approaches don’t work since with GraphQL we can’t predict which fields users are going to ask in a query. Is there any other way to avoid N+1 queries? Yes, it’s called batching.
What is batching?
Batching isn’t a new way to solve N+1 queries. Facebook released the Haskel Haxl library in 2014, but the technique has been used long before. It uses such concepts as Monad, Applicative and Functor. I won’t get into explaining these ideas as it deserves a separate post (would you be interested in learning more about functional programming in Ruby?).
The idea of batching was also implemented in other programming languages. One of the most well-known libraries is JavaScript DataLoader, which became very popular with the rise of GraphQL. Here is a great video about its source code by Lee Byron, an engineer at Facebook. The code is pretty straight forward and contains just 300 lines.
These are the general steps for batching:
- Passing an item to load in any part of the app.
- Loading and caching values for the passed items in batch.
- Getting the loaded value where the item was passed.
The main advantage of using this technique is that batching is isolated. It allows to load data where and when it’s needed.
Here is a basic example of using JavaScript DataLoader:
var batch = (userIds) => ...;
var loader = new DataLoader(userIds => batch(userIds));
// “load” schedules a job to dispatch a queue with
// Node.js “process.nextTick” and returns a promise
loader.load(userId1).then(user1 => console.log(user1));
loader.load(userId2).then(user2 => console.log(user2));
loader.load(userId3).then(user3 => console.log(user3));
First, we are creating a loader
with a function which accepts all the collected items to load (userIds
). These items are passed to our batch
function which loads all users at once. Then we can call the loader.load
function which returns a promise with a loaded value (user
). OK, but what about Ruby?
Batching in Ruby
At Universe, we have monthly hackathons when everyone is free to experiment with any ideas and technologies such as Ethereum, Elixir, Progressive Web App, etc. During my last hackathon, which I won in 1 of the nominations, I was learning about existing techniques to avoid N+1 queries in GraphQL and building a tool to transform GraphQL query to MongoDB Aggregation Pipeline for batching. Check out our previous post which describes how to use Aggregation Pipeline to “join” different collections, filter them, serialize and so on.
At the same time, while we’re migrating to GraphQL, we’re still supporting our RESTful APIs. Usually, N+1 DB queries and HTTP requests are the main problems which cause bottlenecks in our applications as we continue to scale our platform.
That is why we decided to create a new tool which will allow us to solve N+1 queries in our existing RESTful APIs as well as in GraphQL. A simple tool which every Ruby developer will be able to understand and use. It’s called BatchLoader.
Laziness
class Post < ApplicationRecord
belongs_to :user
def user_lazy
# something cool with BatchLoader
end
end
posts = Post.where(id: [1, 2, 3])
# SELECT * FROM posts WHERE id IN (1, 2, 3)
users_lazy = posts.map { |post| post.user_lazy }
BatchLoader.sync!(users_lazy)
# SELECT * FROM users WHERE id IN (1, 2, 3)
BatchLoader doesn’t try to mimic implementations in other programming languages which have an asynchronous nature. So, it doesn’t use any extra primitives such as Promises. There is no reason to use them in Ruby unless you’re using something like EventMachine.
Instead, it uses the idea of “lazy objects”, which are used in Ruby standard library. For example, “lazy arrays” – they allow to manipulate with elements and resolve them at the end when it’s necessary:
range = 1..Float::INFINITY
values_lazy = range.lazy.map { |i| i * i }.take(10)
values_lazy.force
# => [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
As you can see, the 2 code blocks above have the same pattern:
- Collect lazy objects.
- Resolve them at the end.
Batching
Now let’s have a closer look at Post#user_lazy
, which returns a lazy BatchLoader
instance:
# app/models/post.rb
def user_lazy
BatchLoader.for(user_id).batch do |user_ids|
User.where(id: user_ids)
end
end
BatchLoader.for
accepts an item (user_id
) which should be collected and used for batching later. Then we call the batch
method where we pass a block which will use all the collected items (user_ids
). Inside the block, we execute a batch query for our items (User.where
).
JavaScript DataLoader maps the passed item and loaded value implicitly. But it relies on these 2 constraints:
- The array of passed items (
user_ids
) must be the same length as the array of loaded values (users
). This usually means that we have to addnil
s in the array for absent values. - Each index in the array of passed items must correspond to the same index in the array of loaded values. This usually means that we should sort the loaded values.
BatchLoader, in this case, provides a load
method which can be simply called to map a passed item (user_id
) to loaded value (user
):
# app/models/post.rb
def user_lazy
BatchLoader.for(user_id).batch do |user_ids, batch_loader|
User.where(id: user_ids).each { |u| batch_loader.load(u.id, u) }
end
end
RESTful API example
Now imagine we have a regular Rails application with N+1 HTTP requests:
# app/models/post.rb
class Post < ApplicationRecord
def rating
HttpClient.request(:get, "https://example.com/ratings/#{id}")
end
end
# app/controllers/posts_controller.rb
class PostsController < ApplicationController
def index
posts = Post.limit(10)
serialized_posts = posts.map do |post|
{id: post.id, rating: post.rating} # <== N+1 HTTP requests
end
render json: serialized_posts
end
end
We can batch the requests with a gem called parallel by executing all HTTP requests concurrently in threads. Thankfully, MRI releases GIL (global interpreter lock) when a thread hits blocking I/O – HTTP request in our case.
# app/models/post.rb
def rating_lazy
BatchLoader.for(post).batch do |posts, batch_loader|
Parallel.each(posts, in_threads: 10) do |post|
batch_loader.load(post, post.rating)
end
end
end
# app/controllers/posts_controller.rb
class PostsController < ApplicationController
def index
posts = Post.limit(10)
serialized_posts = posts.map do |post|
{id: post.id, rating: post.lazy_rating}
end
render json: BatchLoader.sync!(serialized_posts)
end
end
Thread-safety
The example with concurrent HTTP requests in threads will work only if HttpClient
is thread-safe. BatchLoader#load
is thread-safe out of the box, so it doesn’t need any extra dependencies.
GraphQL example
Batching is particularly useful with GraphQL. Using such techniques as preloading data in advance to avoid N+1 queries can be very complicated since a user can ask for any available fields in a query. Let’s take a look at the simple graphql-ruby schema example:
Schema = GraphQL::Schema.define do
query QueryType
end
QueryType = GraphQL::ObjectType.define do
name "Query"
field :posts, !types[PostType], resolve: ->(obj, args, ctx) do
Post.all
end
end
PostType = GraphQL::ObjectType.define do
name "Post"
field :user, !UserType, resolve: ->(post, args, ctx) do
post.user # <== N+1 queries
end
end
UserType = GraphQL::ObjectType.define do
name "User"
field :name, !types.String
end
If we want to execute a simple query like the following, we will get N+1 queries for each post.user
:
query = "
{
posts {
user {
name
}
}
}
"
Schema.execute(query)
# SELECT * FROM posts WHERE id IN (1, 2, 3)
# SELECT * FROM users WHERE id = 1
# SELECT * FROM users WHERE id = 2
# SELECT * FROM users WHERE id = 3
To avoid this problem, all we have to do is to change the resolver to use BatchLoader:
PostType = GraphQL::ObjectType.define do
name "Post"
field :user, !UserType, resolve: ->(post, args, ctx) do
BatchLoader.for(post.user_id).batch do |ids, batch_loader|
User.where(id: ids).each { |u| batch_loader.load(u.id, u) }
end
end
end
And setup GraphQL with the built-in lazy_resolve
method:
Schema = GraphQL::Schema.define do
query QueryType
lazy_resolve BatchLoader, :sync
end
That’s it. GraphQL lazy_resolve
will basically call a resolve
lambda on the field. If it returns an instance of a lazy BatchLoader
, it’ll call BatchLoader#sync
later to get the actual loaded value automatically.
Caching
BatchLoader also provides a caching mechanism out of the box. So, it won’t make queries for already loaded values. For example:
def user_lazy(id)
BatchLoader.for(id).batch do |ids, batch_loader|
User.where(id: ids).each { |u| batch_loader.load(u.id, u) }
end
end
user_lazy(1) # no request
# => <#BatchLoader>
user_lazy(1).sync # SELECT * FROM users WHERE id IN (1)
# => <#User>
user_lazy(1).sync # no request
# => <#User>
Conclusion
Overall, batching is a powerful technique to avoid N+1 queries. I believe that every Ruby developer should know about it, and not just people who use GraphQL or other programming languages. It will allow you to decouple unrelated parts of your application and load your data where and when it’s needed in batch without sacrificing the performance.
At Universe, we’re using BatchLoader on production for both RESTful API and GraphQL by sharing the same code. For more information, check out the README and the source code, which has just 150 lines.
Originally published on Medium.
Posted on August 16, 2017
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.