A Bit of Advanced ActiveRecord Querying - With Benchmarks

Introduction

The SQL provides us many features to query with efficiency: sub-queries, custom joins, aggregations, etc. We can use all of this with an ORM like ActiveRecord, however often we ignore it. Why? I believe there are three main reasons for this:

You're not an advanced user of SQL;
You don't know enough of ActiveRecord to use its advanced features;
You simply are attained to some bad habits because of the facilities of your programming language (after all, it's much simpler to iterate over ruby objects doing simply comparisons than thinking in a complex custom join, right?).

All good with that until you don't face the side-effects of poor writing code: memory leak and low performance.

This post has the intention of deepening a little bit more in what ActiveRecord can do to help you avoid the problems mentioned above, and benchmark the ActiveRecord approach with some other solution using a ruby Enumerable.

Note: If you wanna follow along with the coding of this post, you can use this script with all the data, queries, and comparisons I used.

Domain

In this post, a simplified version of a Cryptocurrency Exchange will be used as a domain. To better understand what it's being queried in the next sections, here is an explanation of the meaning of each table:

networks: The networks with a running blockchain, each one working with a specific set of currencies;
currencies: Cryptocurrencies;
currency_networks: An association table linking a currency to a network, meaning that this currency can be processed in the specified network;
pairs: A pair of currencies that can be traded, e.g., "BTC-ETH" means "buying ETH (quote currency) with BTC (base currency)";
tickers: Last 24h information from a specific currency pair.

It's not in the scope of this post to discuss anything related to the right way of modeling this domain. The domain was built just thinking in exemplifying queries.

Querying `belongs_to` associations

Consider the following domain models:

class Network < ActiveRecord::Base
  has_many :currency_networks
  has_many :currencies, through: :currency_networks
end

class Currency < ActiveRecord::Base
  has_many :currency_networks
  has_many :networks, through: :currency_networks
end

class CurrencyNetwork < ActiveRecord::Base
  belongs_to :currency
  belongs_to :network
end

Let's say we wanna have all currency networks where the network is active. We can solve this using a ruby Enumerable:

CurrencyNetwork.all.select { |cn| cn.network.active? }

Although that solves our problem, we're potentially scaling to new ones:

Our select block is hitting the database for each CurrencyNetwork we retrieved;
We're retrieving more data than we really need with cn.network: we don't wanna retrieve the Network data, just the CurrencyNetwork. Bare in mind that when we call cn.network we are instantiating a new ActiveRecord object to store the Network result, consuming more memory than needed.

Those aren't big problems to a situation where we don't have many records in the database, as in the case of the script I'm running to exemplify this command, but you can imagine what that cause for your application depending on how frequent you run this action and the number of rows in your database.

We can improve this by delegating great part of the heavy lifting to our database doing this:

CurrencyNetwork.joins(:network).where(networks: { active: true })

If we append to_sql at the end of this query we can see the SQL that's producing:

SELECT "currency_networks".* 
  FROM "currency_networks" 
  INNER JOIN "networks" 
  ON "networks"."id" = "currency_networks"."network_id"

joins is a powerful resource responsible for joining our tables (INNER JOIN), what we can do because CurrencyNetwork belongs_to :network (specified on the ON clause). ActiveRecord provides us a clean way to access the table attributes from our belongs_to association with the hash syntax: { networks: { active: true } }.

What are the benefits of your query approach:

Performance: We only hit the database once, compared to the potentially thousands of times with the Enumerable solution;
Spare memory: We only receive ActiveRecord instances from CurrencyNetwork instead of the Network instances that came as plus with the Enumerable solution.

We don't need to rely upon a priori reasoning only, we can use memory_profiles and benchmark_ips to compare the memory consumption and iterations per second of each solution.

Belongs to query - Enumerable solution (memory profiler):
Total allocated: 120236 bytes (1290 objects)
Total retained:  27100 bytes (224 objects)

Belongs to query - ActiveRecord query solution (memory profiler):
Total allocated: 4584 bytes (52 objects)
Total retained:  360 bytes (7 objects)

Considering how the database was populated in the script used for this test, with the query solution we consume ~26x less memory with the total allocations and ~75x less memory with retained allocations.

IPS - Belongs to query - ActiveRecord query solution:    17716.1 i/s
IPS - Belongs to query - Enumerable solution:      656.8 i/s - 26.97x  (± 0.00) slower

Comparing the iterations per second, we can check that the Enumerable solution is 26.97x slower.

Querying `has_many` associations

Consider the same domain models of the previous section. We wanna do a slightly similar query that the previous one, but now our result must be the currencies belonging to active networks.

Enumerable solution:

Currency.all.select { |currency| currency.networks.where(active: true).any? }

In this example, I'm doing things propositional worse mixing a query with a where clause, but this is probably with what you'd come up to if not using joins.

ActiveRecord query solution:

Currency
  .joins(currency_networks: :network)
  .where(networks: { active: true })

You can notice that we use the same interface to query has_many as the one used with belongs_to, so there's nothing new here besides the pluralized relationship name in the joins.

The benefits are the same we saw in the previous section, and we can benchmark both approaches here too:

Has many query - Enumerable solution (memory profiler):
Total allocated: 171024 bytes (2107 objects)
Total retained:  24566 bytes (243 objects)

Has many query - ActiveRecord query solution (memory profiler):
Total allocated: 4464 bytes (50 objects)
Total retained:  280 bytes (6 objects)

We can see an, even more, accentuated memory consumption in this case with the Enumerable solution. The ActiveRecord query solution consumed ~38x less memory in the total allocated, and ~87x less memory in the total retained.

IPS - Has many query - ActiveRecord query solution:    23395.8 i/s
IPS - Has many query - Enumerable solution:      569.6 i/s - 41.08x  (± 0.00) slower

No surprise, the Enumerator solution is 41.08x slower.

Querying With Aggregations and Subqueries

Aggregations are actions that imply in perform any mathematical operation in your data, be it counts, sums, averages calculations, or comparisons to retrieve maximums and minimums.

Consider the following domain models:

class Currency < ActiveRecord::Base
  has_many :currency_networks
  has_many :networks, through: :currency_networks
  has_many :pairs_as_base, class_name: "Pair", foreign_key: :base_currency_id
  has_many :pairs_as_quote, class_name: "Pair", foreign_key: :quote_currency_id
end

class Pair < ActiveRecord::Base
  has_many :tickers

  belongs_to :base_currency, class_name: "Currency", foreign_key: :base_currency_id
  belongs_to :quote_currency, class_name: "Currency", foreign_key: :quote_currency_id
end

class Ticker < ActiveRecord::Base
  belongs_to :pair

  validates_presence_of :trade_count
end

Using aggregation we can answer how many trades were performed in all pairs:

Ticker.avg(:trade_count)
# => 458926.666666667

We deliver the heavy lifting of looking at each ticker record and performing the calculation to our database, which is good. At this point, you already have a notion of the performance improvement of ActiveRecord queries over Enumerable based solutions to do the same, so I'll spare you of the performance metrics in more simple examples like this one.

Bear in mind the result of this query I added as commentary in the snippet above: it's an integer, not an ActiveRecord relation. So it's not possible to keep chaining queries after calling an aggregation on that way.

You can expect another outcome if you're using aggregation over grouped columns. Let's say we wanna retrieve the sum of trade_count for each currency that is a base_currency of a Pair:

Currency
  .joins(pairs_as_base: :tickers)
  .group("currencies.name").sum(:trade_count)
# => {"Bitcoin"=>930518, "Ethereum"=>103466, "Polkadot"=>342796}

As a result, you have a hash with your aggregated answer.

Subqueries

Subqueries are queries inside queries, and can be placed basically anywhere in your query: SELECT, WHERE, FROM, etc. You can perform interesting queries combining aggregations with subqueries.

Let's say we wanna list all pairs are inside tickers that have the trade_count above the average trade counts of all tickers:

# Hits the database twice
Pair
  .joins(:tickers)
  .where("trade_count >= :avg", avg: Ticker.average(:trade_count))

# Hits the database only once
Pair
  .joins(:tickers)
  .where("trade_count >= (:avg)", avg: Ticker.select("AVG(trade_count)"))

Why the first query hits the database twice, and the second query only once? It's a detail of the ActiveRecord syntax.

Remember what we saw just before: calling the aggregation method as a ruby method directly from ActiveRecord object returns to us a number, and that can only be done by executing the query. So Ticker.average(:trade_count) is returning a number, and ActiveRecord is replacing :avg with it.

If you wanna have lazy using aggregation inside a query, we explicitly express the aggregation as an SQL function like here Ticker.select("AVG(trade_count)"). In the example of the single database hit, ActiveRecord is taking this query and injecting it in place of :avg. Be aware that here we're calling (:avg)! That's because a subquery must be evaluated inside (), otherwise an error will be raised. Appending to_sql in the single database hit query we can see the query that's performed:

SELECT "tickers".* 
  FROM "tickers" 
  WHERE (trade_count > (SELECT AVG(trade_count) FROM "tickers"))

Following to another query, we could ask now for the tickers that have their trade_count above or equal the average trade_count of all listed tickers:

avg_trade_count_sql = Ticker.select("AVG(trade_count)").to_sql
Ticker
  .group(:pair_id)
  .having("tickers.trade_count >= (#{avg_trade_count_sql})")

Here we use the subquery in the having clause, producing the following query:

SELECT "tickers".* 
  FROM "tickers" 
  GROUP BY "tickers"."pair_id" 
  HAVING (tickers.trade_count >= (SELECT AVG(trade_count) FROM "tickers"))

Or we could do a query to analyze this info in another way: get the average ticker trade_count for each pair, and only return the tickers that have trade_count above or equal the average trade_count of the tickers belonging to given pair:

avg_trade_count_by_pairs_sql = Ticker.select("pair_id, AVG(trade_count) as average").group(:pair_id).to_sql
Ticker
  .joins("INNER JOIN (" + average_trade_count_by_pairs_sql + ") trades ON tickers.pair_id = trades.pair_id")
  .where("tickers.trade_count >= trades.average")

This produces the following SQL:

SELECT "tickers".* 
  FROM "tickers" 
  INNER JOIN (
    SELECT pair_id, AVG(trade_count) as average 
    FROM "tickers" GROUP BY "tickers"."pair_id"
  ) trades 
  ON tickers.pair_id = trades.pair_id 
  WHERE (tickers.trade_count >= trades.average)

This is a tricky one, so let's break it down to understand how we're doing this. First off, we have our subquery SQL in avg_trade_count_by_pairs_sql. If executed individually, will return to you an aggregated result like the following:

Each pair_id is listed with an aggregated result that represents the average of trade_count of all of the tickets belonging to the pair (i.e., SELECT * FROM tickets WHERE pair_id = [THE PAIR ID]). This SQL is merged inside the Ticker custom join (INNER JOIN) we named as trades. In the end, we have tickers.trade_count from the current ticker row being iterated, and trades.average that we have from our subquery, and represents the average ticker_count of the ticker pair of the current iteration (association made because of our ON tickers.pair_id = trades.pair_id from the custom join).

We're already writing custom joins, anything beyond that it's more SQL specifics than ActiveRecord, so we can end up here. To finish, though, let's benchmark this query with a simple solution using ruby Enumerable:

  tickers = []
  Pair.all.each do |pair|
    avg = Ticker.where(pair: pair).average(:trade_count)
    selectable_tickers = Ticker.where(pair: pair).where("trade_count >= ?", avg)
    tickers.push(*selectable_tickers.to_a)
  end

Avg trade count by pairs - Enumerable solution (memory profiler)
Total allocated: 72357 bytes (872 objects)
Total retained:  573 bytes (9 objects)

Avg trade count by pairs - ActiveRecord query solution (memory profiler)
Total allocated: 6348 bytes (86 objects)
Total retained:  296 bytes (5 objects)

Our ActiveRecord solution consumes ~11x less memory in the total allocated, and ~2x less memory in the total retained.

IPS - Avg trade count by pairs - ActiveRecord query solution:    15095.6 i/s
IPS - Avg trade count by pairs - Enumerable solution:      740.6 i/s - 20.38x  (± 0.00) slower

Comparing the iterations per second, we can check that the Enumerable solution is 20.38x slower.

Conclusion

In this post, we saw how to use advanced ActiveRecord features to go beyond a simple where clause. We saw how to use ActiveRecord to query belongs_to and has_many associations, using joins, aggregations, and subqueries, combining all.

We also compared some queries with the equivalent using a ruby Enumerable based solution. For benchmarking both approaches we checked the memory consumption and iterations per second, which showed without surprise the Enumerable based solutions consuming in most cases dozens of times more memory because of extra allocations and being also dozens of times slower.

Of course, we should keep in mind that the benchmark was done using only a very small database. We weren't worried about checking how much worse it can get with larger databases, or searching for points of stabilization (a.k.a., "it doesn't get much worse than that"). The idea of benchmarking was to show you how bad it can be when you rely upon ruby Enumerator to do things that our database is capable of doing through ActiveRecord, which is a common bad "code smell".

This is it! If you have any comments or suggestions, don't hold back, let me know.

Options if you like my content and would like to support me directly (never required, but much appreciated):

BTC address: bc1q5l93xue3hxrrwdjxcqyjhaxfw6vz0ycdw2sg06

Blog

A Bit of Advanced ActiveRecord Querying - With Benchmarks

Rodrigo Walter Ehresmann

Introduction

Domain

Querying `belongs_to` associations

Querying `has_many` associations

Querying With Aggregations and Subqueries

Subqueries

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related

A Bit of Advanced ActiveRecord Querying - With Benchmarks

Rodrigo Walter Ehresmann

Introduction

Domain

Querying belongs_to associations

Querying has_many associations

Querying With Aggregations and Subqueries

Subqueries

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related

Querying `belongs_to` associations

Querying `has_many` associations