Could not create unique index: how to solve duplication errors
Thomas Bracher
Posted on September 23, 2022
Unique indexes enforce the uniqueness (as their name implies) of one or more columns' values. They can guarantee one-to-one relationships in your model. For instance, let's have a unique shopping cart per user:
=# create unique index uidx_shopping_carts_on_user_id on shopping_carts (user_id);
ERROR: could not create unique index "uidx_shopping_carts_by_user_id"
DETAIL: Key (user_id)=(2) is duplicated.
Wow! What was that? It seems that the index can not be created: some existing rows already infringe the constraint.
Oopsy
Let's explore the data a bit. Can we spot culprits?
=# select user_id from shopping_carts order by user_id;
user_id
---------
1
2
2
3
4
5
Oopsy, "2" appears twice. This can be quite an issue. It can go wrong for 3 reasons:
- The user can see the wrong cart, leading to items disappearing
- Your data team has to apply heuristics to reunify the duplicates
- Worst case: it leads to complex bugs because everyone suppose shopping carts are unique
Unfortunately, postgres cannot create a unique index as long as the duplicate items exist.
First things first: how many duplicates do you have? Some self joining request does the trick:
=# select
-= count(*)
-= from shopping_carts a
-= join (select
-= user_id,
-= count(*)
-= from shopping_carts
-= group by user_id
-= having count(*) > 1) as b
-= on b.user_id = a.user_id;
count
-------
76533
There are too many duplicates for a manual cleanup... Create a dashboard to track the duplicate injection rate. If your table already includes a created_at
datetime, you can use a variation of the previous query:
=# select
-# last_occurrence,
-# count(*)
-# from shopping_carts a
-# join (select
-# user_id,
-# max(date(created_at)) as last_occurrence,
-# count(*)
-# from shopping_carts
-# group by user_id
-# having count(*) > 1) as b
-# on b.user_id = a.user_id
-# group by last_occurrence
-# order by last_occurrence DESC;
last_occurrence | count
-----------------+-------
2022-09-13 | 193
2022-09-12 | 484
We have the duplicate rate per day. This is useful to check if our future fixes work.
Remove the root cause
Let's consider your endpoint implements the following pseudo-ruby code:
def create
ShoppingCarts.create!(...)
end
A quick fix is to check the existence of the row before creating it:
def create
unless ShoppingCarts.where(user_id: current_user.id).exists? do
ShoppingCarts.create!(...)
end
end
This way no duplicate creation, right? You deploy and check the numbers the next day:
last_occurrence | count
-----------------+-------
2022-09-14 | 8
2022-09-13 | 193
2022-09-12 | 484
Wait what? Still more duplicates!!!!
Transactions
This is a race condition happening. Two requests check existence and insert at the same time.
request 1 | request 2 | select count(*) from shopping_carts where id = 2 |
---|---|---|
exists()... == false | - | 0 |
- | exists()... == false | 0 |
insert()... | - | 1 |
- | insert()... | 2 |
Transactions are gonna help, but not directly.
def create
ActiveRecord::Base.transaction do
unless ShoppingCarts.where(user_id: current_user.id).exists? do
ShoppingCarts.create!(...)
end
end
end
You deploy this code and watch its effects:
last_occurrence | count
-----------------+-------
2022-09-15 | 15
2022-09-14 | 8
2022-09-13 | 193
2022-09-12 | 484
The problem is not solved. Transactions are configured to committed read
isolation by default. If you replay the previous race condition under this isolation, you'll find no violation. Transactions do not avoid race conditions. Fortunately, there are tools to stop the bleeding: locks.
Locks are usually managed by your database for operations like indexation, etc. Locks are a common way to mutually exclude processes from shared resources. You can also manually lock a row. Mutual exclusion is preventing 2 processes to access the same resource.
In postgres you can lock rows or tables using the for udpate
keyword in a select
query:
def create
ActiveRecord::Base.transaction do
unless ShoppingCarts.lock("FOR UPDATE").where(user_id: current_user.id).exists? do
ShoppingCarts.create!(...)
end
end
end
You deploy and check the numbers the next day:
last_occurrence | count
-----------------+-------
2022-09-16 | 11
2022-09-15 | 15
2022-09-14 | 8
2022-09-13 | 193
2022-09-12 | 484
Oopsy, the lock does not work? Because the select
query returns no rows, no lock is upheld. We need to lock something that exists. Great thing that we have a unique user!
def create
current_user.with_lock do
unless ShoppingCarts.where(user_id: current_user.id).exists? do
ShoppingCarts.create!(...)
end
end
end
Here the transaction is finally helpful. The lock is released when the transaction is committed. You can deploy and see the result:
last_occurrence | count
-----------------+-------
2022-09-16 | 11
2022-09-15 | 15
2022-09-14 | 8
2022-09-13 | 193
2022-09-12 | 484
No line for 2022-09-17: victory! Next, we remove duplicates and create the unique index.
Removing duplicates
Removing duplicates can be a touchy matter. I find the following heuristic true most of the time: the latest modified row is the most updated. Some frameworks automatically generate updated_at
columns. This can be useful to apply the heuristic. We use updated_at
column to discriminate the old rows to delete:
select
b.id
from shopping_carts a
join (select
user_id,
max(updated_at) AS last_updated_at
from shopping_carts
group by user_id
having (count(*) > 1)) b
on a.user_id = b.user_id
and a.updated_at < b.last_updated_at
This way only the last updated rows remain. Let's check if there's no duplicate left:
=# select
-= count(*)
-= from shopping_carts a
-= join (select
-= user_id,
-= count(*)
-= from shopping_carts
-= group by user_id
-= having count(*) > 1) as b
-= on b.user_id = a.user_id;
count
-------
0
Nice!
Creating the unique index
With no duplicate in the table left, we can create the index:
=# create unique index uidx_shopping_carts_on_user_id on shopping_carts (user_id);
This time, there's no error. Every user has at most one shopping cart. When attempting to create duplicates, you encounter the following error:
=# insert into shopping_carts(user_id) values (2), (2);
ERROR: duplicate key value violates unique constraint "uidx_shopping_carts_by_user_id"
No process can violate the business rule, even under race conditions! Great value for the buck if you ask me.
Conclusion
I hope you learned some sql knowledge. Here are the main takeaways:
- Learn some sql syntax to navigate easily in your data
- Columns
created_at
andupdated_at
managed by your ORM are great tools for maintenance - Transactions do not avoid race conditions
- Introduce unique constraints before you feed the data (this applies to other constraints)
I had to solve a similar problem at work and we are hiring talented engineers to solve this kind of problems.
Take care.
Photo by Stefan Cosma on Unsplash
Posted on September 23, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024