toooooooooomy
Posted on March 15, 2021
Introduction
The working process of production data deletion is completely different from development tasks.
We should take it as another kind of task, and think with another way. The coding examples are based on RoR in this article.
Estimate the workload
It is difficult to estimate without investigation to work through data deletion.
I can say, do not estimate via the usual development style estimation process.
As a rough estimation, it could take 2 weeks at least to complete the whole process.
Because we should answer the questions listed below in advance.
Furthermore, performance could be a problem after you finish up writing code.
I do recommend doing a sensitive investigation, discuss it with all involved people including QA developers before start working on it.
It would be the fastest way.
Questions we should answer before start coding
- How much data do we have to delete?
- How can we test the deletion?
- What is the data to delete?
- How many related resources belong to target data?
- How can we estimate the deletion time?
- Can we have an alternative approach if it's difficult to delete the whole data?
General coding tips to handle huge data to delete
I show the example via Ruby on Rails(RoR) code but it would be able to translate to other languages easily.
Let's say, you are working with an E-commerce app and we have orders, and order_items belong to each order.
Avoid N+1
Do not iterate all associated relations
If the target data amount is huge, using iteration could be a cause of a terrible performance problem.
Order.where(user_id: 10).all.each do |order|
order.order_items.each do |order_item|
...
end
end
In this example, if the order amount is millions, this leads significant N+1 problem.
We should consider the way not to use ORM magics.
Use delete_all instead of destroy_all
In RoR, there are basically 2 methods to delete records, delete and destroy. destroy
methods calls callbacks defined on a model before deleting the object. On the other hand, delete
doesn't see callbacks set up on the model, simply calling a delete SQL.
Given that, delete_all is way faster than destroy_all, and it makes a significant performance difference if the data is big.
And, it means we should take care of the operation we defined in the before_destroy
or after_destroy
callbacks manually.
Use batches
If it comes to delete even thousands of records, we should handle it carefully to use database resources sufficiently. Probably simply calling delete_all
is not enough.
In that case, use something like built-in method find_in_batches or delete_in_batches gem.
The batch size should be around some thousands to prevent DB stuck. In find_in_batches, default is 1,000. In delete_in_batches, default is 10,000.
Order.find_in_batches(batch_size: 5000).each do |orders|
orders.delete_all
sleep(1) # throttle down DB
end
Set asleep to throttle down the DB
Like the above example, looks good to throttle down by using sleep in batches when we delete in batches. The RoR official website recommends it.
When the deletion is quite slow
Possibly you couldn't find this problem until using the full prod pruned data. That is why using prod pruned data is important.
Anyhow, in many cases, the cause of the slow query is full table scan due to no index as you know.
Case 1: No index on the column used in where SQL
DELETE users where email = 'something@localhost';
SQL tries to look for target records by email but there is no index on email so that it does a full table scan. It is simple and you are probably familiar with it.
Case 2: No index on the column to use constraint check
DELETE addresses;
Table "addresses"
Column | Type | Modifiers
-----------+------+------------------------------
id | int | not null default nextval(...
street_no | int | not null
.
.
.
Indexes:
"idx_primary" PRIMARY KEY, btree (id)
Referenced by:
TABLE "users" CONSTRAINT "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID
Table "users"
Column | Type | Modifiers
-------------+--------+--------------------------------
id | bigint | not null default nextval(...
address_id | int |
.
.
.
Indexes:
"idx_primary" PRIMARY KEY, btree (id)
Foreign-key constraints:
"fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID
SQL tries to delete addresses but addresses referenced by users table, and it has a foreign key constraint on address_id. But because address_id does not have index, SQL does a full table scan for users when deleting addresses
When we need further consideration
If the target data has a lot of associations, we should consider deleting some of them tentatively.
For example, an order has some order_items, purchaced_users, addresses, items, etc...
But in this case, we will create orphan data which doesn't have the parent data(i, e. relations of orders and order_items, deleting only orders)
This would be tech debt, so when we take this solution, we should have common sense with the people involved.
Create to_be_deleted tables
To delete only required data and delete orphan data in the future, consider creating "to_be_deleted" tables which have ids of target data to be deleted. Those tables look like below.
-- orders_to_be_deleted
Table "orders_to_be_deleted"
Column | Type | Modifiers
----------+---------+-----------
order_id | integer | not null
Indexes:
"index_orders_to_be_deleted_on_order_id" btree (order_id)
By using these id tables, we can refer to deleted ids after deleting orders themselves for example.
Create a DB backup before starting the deletion
Thanks for this tip Raphael!
No matter how careful we prepare, unexpected situations always come in front of us. Make sure you can have the backup with your team.
Conclusion
Those are what we should consider when deleting data.
I'm glad if someone won't do the same mistakes as mine when working on data deletion projects.
Posted on March 15, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.