konung
Posted on February 20, 2023
TLDR: I'm now using Strategy #5
- Why Get Random Records?
- Strategy #1 - Use
RANDOM()
- Strategy #2 - Pluck & Array.sample
- Strategy #3 -
find_by
with Min/Max ID - Strategy #4 - Random Offset
- Strategy #5 - Where with Min/Max
- Tips and Caveats
- Your thoughts?
If you've ever needed to collect a random record from a database whether using Rails or another framework/language, you may have found that there are several strategies to choose from. In this post, we'll explore the top 5 strategies for collecting random records, along with some tips and caveats to keep in mind.
Why Get Random Records?
Before we dive into the strategies, it's worth taking a moment to consider why someone might need to collect a random record from a database. Generally speaking random behavior is rarely a desired feature in an application, as it leads to non-deterministic results. However, it may be needed for business reasons.
One common use case is for displaying random content on a website or app, such as "random image" feature, or "random auhor" on Book tracking website. Another use case is for testing or debugging, where it may be useful to select a random record to check for edge cases or unexpected behavior.
Strategy #1 - Use RANDOM()
The first strategy is to use the database's built-in RANDOM
function to select a random record. For PostgreSQL or SQLite, use RANDOM()
, and for MySQL or MariaDB, use RAND()
. While this strategy is the most robust, it can be slow as the table grows, taking around 260ms on a table with 1,000,000 records in my sample data.
Pros:
- Most robust
Cons:
- Slow. As the table grows, takes longer and longer (~260ms on my table)
Author.order('RANDOM()').limit(1)
Strategy #2 - Pluck & Array.sample
Another strategy is to select a ALL IDs by plucking or selecting the IDs from the database and using Ruby's Array#sample
method to select one of the IDs at random. While this strategy is faster than strategy #1, it can still get slow as the table grows, taking around 50ms on a table with 1,000,000 records. Also consider the amount of memory such strategy might consume.
Pros:
- Faster (~50ms on my table of 1_000_000 records)
Cons:
- As the table grows, it will start getting slower, and you simply may run out of esources, depending on the size of the machine. i.e.: you're running on a small VM ```ruby
Author.find(Author.pluck(:id).sample)
<br>
## Strategy #3 - `find_by` with Min/Max ID
A third strategy is to get the minimum and maximum values of the IDs in the database, use `rand(min..max)` to generate a random ID, and then use `find_by` to look up the record. This strategy is extremely fast, taking around 17ms on a table with 1,000,000 records, but it can be brittle if there are deleted records.
Pros:
- Extremely fast, if no deleted data (~17ms on my sample data)
- If the table grows, it doesn't affect the speed of the lookup
Cons:
- Requires Numeric IDs, can't use GUID or Varchar IDs
- Needs sequential IDs without missing IDs
- Can be brittle if any records were deleted and ID's are missing.
- Error handling & additional lookups can mitigate the problem, but as the number of missing IDs becomes larger, this will slow as well
```ruby
Author.find_by(rand(1..Author.count))
Strategy #4 - Random Offset
A fourth strategy is to use a random offset to look up a record. This strategy is not as fast as the others, taking around 120ms on a table with 1,000,000 records, but it doesn't require error handling and works with any type of ID.
Pros:
- No error handling needed, doesn't depend on the type of ID
Cons:
- Not as fast - about 120ms on average on my sample data, but depending on where the ffset is in the table the actual queries can be anywhere between 20ms and 600ms. hat's just how offset works.
Author.offset(rand(1..Author.count)).limit(1)
Strategy #5 - Where with Min/Max
A fifth strategy, and my preferred strategy, is to get the minimum and maximum values of the IDs in the database, use rand(min..max)
to generate a random ID, and then use a where
lookup to select the record. This strategy is extremely fast, taking only 1-6ms on a table with 1,000,000 records, and it's also robust, handling deleted records without issue.
Pros:
- Extremely fast - I get 1-6ms results
- If the table grows, it doesn't affect the speed of the lookup (actually faster than offset)
- If records get deleted, it will just find the first match after the deleted one.
- Can be modified into a scope, and used with other scopes - i.e
Author.published_today.random_record
(Just keep in mind that it's essentially kind of reimplementing an offset in a way, which may limit your data set for a lookup). The reason we are doing it this way, is becauseoffset
will be slow because of how it works.
Cons:
- Not as clear as using just
offset
, but offset will get slower depending on how big your data is.
# `random_id` Can be memoized / cached,
# but beware of scoping before caching(!!!)
# as you can get empty results where many are expected
random_id = rand(Author.minimum(:id)..Author.maximum(:id))
Author.where("id >= ?", random_id).limit(1)
Out of these strategies, my preferred strategy used to be the third one. Even if the table has some deleted records, it is still faster on average for my use-case. Our deletes from that table are also rare, and we use soft deletes that I can use in my error handling. I sometimes used strategy #4, as it also works well.
However, since I first used this code, I was answering a StackOverflow question that lead to this blog post, and came up with strategy #5 which is now my prefered option.
Tips and Caveats
A couple of tips to keep in mind when implementing these strategies:
- If the actual current count is not critical, you can preload
Author.count
and memorize it asAuthor.total_count
for example or keep it in an app-load config that gets loaded on app restart (if you do them often). This will remove the extra time needed to get Author.count and speed up option 3 to under 5ms and option 4 to about 100ms. Here's an example of how to do this:
def self.total_count
@total_count ||= count
end
However, I would warn against going crazy with optimizations here, such as caching min/max values, as depending on the context, scoping, etc, it may lead to unexpected results. If you use strategy #5, the bottleneck is not going to be in the database , so making 3 quick queries is prefereable, unless it really-really matters for your WebScale™ app.
When it comes to error handling, you have a couple of options. You can do lookups until you find a record that exists, or you can do multiple lookups using #3 and select a random one from the result set. Depending on how many gaps you have in a your data set, strategy #3 may not be the best choice.
Your thoughts?
Did I miss any approaches, or made a glaring blunder? Please, comment and let me know :)
P.S.:
Cover Art by MidJourney AI for "random records from database" prompt
Posted on February 20, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.