ActiveRecord For Databases Without Unique Ids

honeybadger_staff

Honeybadger Staff

Posted on March 16, 2021

ActiveRecord For Databases Without Unique Ids

This article was originally written by Regan Ryan on the Honeybadger Developer Blog.

Sometimes unique situations and things out of our control lead to wildly unorthodox requirements. Recently, I had an experience where I needed to use ActiveRecord without relying on the database ID for any records. If anyone is considering doing the same, I highly recommend finding another way! But, let's move on to the rest of the story.

Decisions were made. Smaller databases (clones in structure but not in data) needed to be merged. I joined the project just as the team was putting the finishing touches on a script that copies and pastes database records from one database to another. It copied everything exactly as-is, including IDs.

Database A

id fruit user_id
... ... ...
123 orange 456
... ... ...

Database B

id fruit user_id
... ... ...
123 banana 74
... ... ...

Database A after merge

id fruit user_id
... ... ...
123 orange 456
123 banana 74
... ... ...

This breaks the fundamental reason for having IDs: unique identification. I didn't know the specifics, but I felt like all kinds of problems would show up once duplicate IDs were introduced into the system. I tried to say something, but I was new to the project, and others seemed certain this was the best path forward. In a few days, we were going to deploy the code and start handling data with duplicate IDs. The question was no longer, "should we do this?"; instead, the questions were, "how do we do this?" and "how much longer will this take?"

Working with duplicate IDs

So, how do you handle data with duplicate IDs? The solution was to make a composite ID of several fields. Most of our DB fetches looked like this:

# This doesn't work, there may be 2 users with id: 123
FavoriteFruit.find(123)

# Multiple IDs scope the query to the correct record
FavoriteFruit.find_by(id: 123, user_id: 456)
Enter fullscreen mode Exit fullscreen mode

All the ActiveRecord calls were updated in this way, and as I glanced through the code, it seemed to make sense. Until we deployed it.

All hell breaks loose

Shortly after we deployed the code, the phones started ringing. Customers were seeing numbers that didn't add up. They couldn't update their own records. All kinds of features were breaking.

What should we do? We didn't just deploy code; we also moved data from one database to another (and new data were created/updated after we deployed). It was not a simple rollback situation. We needed to fix things fast.

What is Rails doing?

The first step in debugging was to see what the current behavior was and how to reproduce the error. I took a clone of the production data and started a Rails console. Depending on your setup, you may not automatically see the SQL queries Rails runs when you execute an ActiveRecord query. Here's how to ensure SQL statements are visible on your console:

ActiveRecord::Base.logger = Logger.new(STDOUT)
Enter fullscreen mode Exit fullscreen mode

After that, I tried some common Rails queries:

$ FavoriteFruit.find_by(id: 123, user_id: 456)

FavoriteFruit Load (0.6ms)
SELECT  "favorite_fruits".*
FROM "favorite_fruits"
WHERE "favorite_fruits"."id" = $1
AND "favorite_fruits"."user_id" = $2
[["id", "123"], ["user_id", "456"]]
Enter fullscreen mode Exit fullscreen mode

find_by seemed to work fine, but then I saw some code like this:

fruit = FavoriteFruit.find_by(id: 123, user_id: 456)
...
...
fruit.reload
Enter fullscreen mode Exit fullscreen mode

That reload made me curious, so I tested that too:

$ fruit.reload

FavoriteFruit Load (0.3ms)
SELECT  "favorite_fruits".*
FROM "favorite_fruits"
WHERE "favorite_fruits"."id" = $1
LIMIT $2
[["id", 123], ["LIMIT", 1]]
Enter fullscreen mode Exit fullscreen mode

Uh oh. So, even though we initially fetched the correct record with find_by, whenever we called reload, it would take the ID of the record and do a simple find-by-id query, which, of course, would often give incorrect data due to our duplicate IDs.

Why did it do that? I examined the Rails source code for clues. This is a great aspect of coding with Ruby on Rails, the source code is plain Ruby and freely available to access. I simply googled "ActiveRecord reload" and quickly found this:

# File activerecord/lib/active_record/persistence.rb, line 602
def reload(options = nil)
  self.class.connection.clear_query_cache

  fresh_object =
    if options && options[:lock]
      self.class.unscoped { self.class.lock(options[:lock]).find(id) }
    else
      self.class.unscoped { self.class.find(id) }
    end

  @attributes = fresh_object.instance_variable_get("@attributes")
  @new_record = false
  self
end
Enter fullscreen mode Exit fullscreen mode

This shows that reload is, more or less, a wrapper for self.class.find(id). Querying only by an ID was hardwired into this method. For us to work with duplicate IDs, we'd need to either override core Rails methods (never recommended) or stop using reload altogether.

Our Solution

So, we decided to go through every reload in the code and change it to find_by to get the database fetching via multiple keys.

However, that was only some of the bugs resolved. After more digging, I decided to test our update calls:

$ fruit = FavoriteFruit.find_by(id: 123, user_id: 456)
$ fruit.update(last_eaten: Time.now)

FavoriteFruit Update (43.3ms)
UPDATE "favorite_fruits"
SET "last_eaten" = $1
WHERE "favorite_fruits"."id" = $2
[["updated_at", "2020-04-16 06:24:57.989195"], ["id", 123]]
Enter fullscreen mode Exit fullscreen mode

Uh oh. You can see that even though find_by scoped the record by specific fields, when we called update on the Rails record, it created a simple WHERE id = x query, which also breaks with duplicate IDs. How did we get around this?

We made a custom update method, update_unique, which looks like this:

class FavoriteFruit
  def update_unique(attributes)
    run_callbacks :save do
      self.class
        .where(id: id, user_id: user_id)
        .update_all(attributes)
    end
    self.class.find_by(id: id, user_id: user_id)
  end
end
Enter fullscreen mode Exit fullscreen mode

Which let us update records scoped to more than IDs:

$ fruit.update_unique(last_eaten: Time.now)

FavoriteFruit Update All (3.2ms)
UPDATE "favorite_fruits"
SET "last_eaten" = '2020-04-16 06:24:57.989195'
WHERE "favorite_fruits"."id" = $1
AND "favorite_fruits"."user_id" = $2
[["id", "123"], ["user_id", "456"]]
Enter fullscreen mode Exit fullscreen mode

This code ensured a narrow scope for updating records, but by calling the class's update_all method, we lost the callbacks that normally come with updating a record. Therefore, we had to manually run the callbacks and do another database call to retrieve the updated record since update_all doesn't return the updated record. The final product isn't too messy, but it's definitely more difficult to read than fruit.update.

The Real Solution

Due to sunken costs, management, and time constraints, our solution was to monkey patch Rails into using multiple keys for all database calls. This worked, in the sense that customers would still buy and use the product, but it was a bad idea for several reasons:

  • Any future development might inadvertently reintroduce bugs by using common Rails methods. New developers will need strict training to keep the code free of hidden bugs, such as using the reload method.
  • The code is more complicated, less clear, and less maintainable. This is technical debt that slows down development speed more and more as the project goes on.
  • Testing slows down a lot. You need to test not only that a function works but also that it works when various objects have duplicate IDs. It takes more time to write tests, and then each time the test suite is run, it takes more time to run through all the extra tests. Testing can also easily miss bugs if each developer on the project doesn't carefully test all possible scenarios.

The real solution to this problem is to never have duplicate IDs in the first place. If data need to be transferred from one database to another, then the script doing that should collect and insert the data without IDs, allowing the receiving database to use its standardized auto-increment counter to give each record its own unique ID.

Another solution would be to use UUIDs for all records. This type of ID is a long string of characters created at random (instead of step-by-step counting, as with an integer ID). Then, moving data to other databases would have no conflicts or issues.

The bottom line is that Rails was built with the understanding that IDs are unique per record and a quick and easy way to manipulate specific data in the database. Rails is an opinionated framework, and the beauty of this is how smoothly everything runs, as long as you stick to the Rails way of doing things. This applies not only to Rails but also to many other aspects of programming. When things get complicated, we should know how to identify the problem; however, if we write clear, maintainable, and conventional code, we can avoid many of these complications in the first place.

💖 💪 🙅 🚩
honeybadger_staff
Honeybadger Staff

Posted on March 16, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related