Introduction

The working process of production data deletion is completely different from development tasks.
We should take it as another kind of task, and think with another way. The coding examples are based on RoR in this article.

Estimate the workload

It is difficult to estimate without investigation to work through data deletion.
I can say, do not estimate via the usual development style estimation process.
As a rough estimation, it could take 2 weeks at least to complete the whole process.
Because we should answer the questions listed below in advance.
Furthermore, performance could be a problem after you finish up writing code.
I do recommend doing a sensitive investigation, discuss it with all involved people including QA developers before start working on it.
It would be the fastest way.

Questions we should answer before start coding

How much data do we have to delete?
How can we test the deletion?
What is the data to delete?
How many related resources belong to target data?
How can we estimate the deletion time?
Can we have an alternative approach if it's difficult to delete the whole data?

General coding tips to handle huge data to delete

I show the example via Ruby on Rails(RoR) code but it would be able to translate to other languages easily.
Let's say, you are working with an E-commerce app and we have orders, and order_items belong to each order.

Avoid N+1

Do not iterate all associated relations

If the target data amount is huge, using iteration could be a cause of a terrible performance problem.

Order.where(user_id: 10).all.each do |order|
  order.order_items.each do |order_item|
     ...
  end
end

In this example, if the order amount is millions, this leads significant N+1 problem.
We should consider the way not to use ORM magics.

Use delete_all instead of destroy_all

In RoR, there are basically 2 methods to delete records, delete and destroy. destroy methods calls callbacks defined on a model before deleting the object. On the other hand, delete doesn't see callbacks set up on the model, simply calling a delete SQL.
Given that, delete_all is way faster than destroy_all, and it makes a significant performance difference if the data is big.
And, it means we should take care of the operation we defined in the before_destroy or after_destroy callbacks manually.

Use batches

If it comes to delete even thousands of records, we should handle it carefully to use database resources sufficiently. Probably simply calling delete_all is not enough.
In that case, use something like built-in method find_in_batches or delete_in_batches gem.
The batch size should be around some thousands to prevent DB stuck. In find_in_batches, default is 1,000. In delete_in_batches, default is 10,000.

Order.find_in_batches(batch_size: 5000).each do |orders|
  orders.delete_all
  sleep(1) # throttle down DB
end

Set asleep to throttle down the DB

Like the above example, looks good to throttle down by using sleep in batches when we delete in batches. The RoR official website recommends it.

When the deletion is quite slow

Possibly you couldn't find this problem until using the full prod pruned data. That is why using prod pruned data is important.
Anyhow, in many cases, the cause of the slow query is full table scan due to no index as you know.

Case 1: No index on the column used in where SQL

DELETE users where email = 'something@localhost';

SQL tries to look for target records by email but there is no index on email so that it does a full table scan. It is simple and you are probably familiar with it.

Case 2: No index on the column to use constraint check

DELETE addresses;

                  Table "addresses"

   Column  | Type |     Modifiers
-----------+------+------------------------------
 id        | int  | not null default nextval(...
 street_no | int  | not null
.
.
.
Indexes:
    "idx_primary" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "users" CONSTRAINT "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID

                  Table "users"
    Column   |  Type  |         Modifiers
-------------+--------+--------------------------------
 id          | bigint | not null default nextval(...
 address_id  | int    |
.
.
.
Indexes:
    "idx_primary" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID

SQL tries to delete addresses but addresses referenced by users table, and it has a foreign key constraint on address_id. But because address_id does not have index, SQL does a full table scan for users when deleting addresses

When we need further consideration

If the target data has a lot of associations, we should consider deleting some of them tentatively.
For example, an order has some order_items, purchaced_users, addresses, items, etc...
But in this case, we will create orphan data which doesn't have the parent data(i, e. relations of orders and order_items, deleting only orders)
This would be tech debt, so when we take this solution, we should have common sense with the people involved.

Create to_be_deleted tables

To delete only required data and delete orphan data in the future, consider creating "to_be_deleted" tables which have ids of target data to be deleted. Those tables look like below.

-- orders_to_be_deleted
Table "orders_to_be_deleted"
  Column  |  Type   | Modifiers
----------+---------+-----------
 order_id | integer | not null
Indexes:
    "index_orders_to_be_deleted_on_order_id" btree (order_id)

By using these id tables, we can refer to deleted ids after deleting orders themselves for example.

Create a DB backup before starting the deletion

Thanks for this tip Raphael!
No matter how careful we prepare, unexpected situations always come in front of us. Make sure you can have the backup with your team.

Conclusion

Those are what we should consider when deleting data.
I'm glad if someone won't do the same mistakes as mine when working on data deletion projects.

Blog

Work with Production Data Deletion

toooooooooomy