Performant script in updating a large table🚀
Jimmy Yeung
Posted on September 6, 2021
From time to time, it's inevitable to update some fields in the database via some manual scripts, no matter it's data patching or some random requirements from the business side.
I often have a hard time in writing a script that updates a large table. It just takes so much time in running once, not to mention if you need to debug/optimise after each run.
Use Case
In one of my sprint, I need to append a json object into a jsonb
column, in which the json object is computed differently for each row. There are millions of records in the Postgres table and it takes hours to run the script for testing. In order to minimise the time taken for each test so that I can finish my work on time, I tried to compare different ways in order to acquire the fastest result:
(In the examples below, I'm using 10000 records for each case).
-
Naive Django ORM
.save()
way
for record in records: json_value = <computation on json_value> record.json_array_field.append(json_value) record.save()
with
cProfile
:6951470 function calls (6891465 primitive calls) in 31.840 seconds
This is definitely the slowest way as it hits the database and executes the update query for every row. Just the first try, we could definitely make things faster.
-
Django Django ORM
bulk_update
way
for record in records: json_value = <computation on json_value> record.json_array_field.append(json_value) MyModel.objects.bulk_update(records, fields=. ["json_array_field"])
with
cProfile
:5072220 function calls (4612182 primitive calls) in 20.213 seconds
Slightly better.
bulk_update
is actually suggested by Django for optimization asbulk_update
just executes one query. But it's still a lot of time, I tried to find if there's better way to improve. -
Django
bulk_update
withbatch_size
for record in records: json_value = <computation on json_value> record.json_array_field.append(json_value) MyModel.objects.bulk_update( records, fields=["json_array_field"], batch_size=1000 )
with
cProfile
:5077900 function calls (4617619 primitive calls) in 11.765 seconds
Django: If updating a large number of columns in a large number of rows, the SQL generated can be very large. The batch_size parameter controls how many objects are saved in a single query.
Performance are further improved by dividing the updates into 10 batches. Not bad, but again, we could make things faster.
-
SQL way
values_to_be_updated = [(json.dumps(<computation on json_value>), i.id) for i in records] with connection.cursor() as cursor: execute_values( cursor, "UPDATE my_table " "SET json_array_field = json_array_field::jsonb || " "(v.new_json_value)::jsonb FROM (VALUES %s) as v (new_json_value, bid) " "WHERE id = v.bid::uuid;", values_to_be_updated, )
with
cProfile
:802769 function calls (802756 primitive calls) in 3.603 seconds
That calls it a day since it reduces down to just ~3 seconds for updating! It's one tenth of using the
.save()
one. I just feel so dumbed on trying different ORM methods at first and spent hours waiting it finish running. I should definitely use the SQL way at first. (Imagine if we are updating millions or billions of rows, how huge the difference would be.)
Key Takeaway (for me)
Maybe it's well-known already, but just want to say it again: when writing a one-off bulk insert/update script in a table with large amount of records, do try consider running a SQL at the very first beginning. That could save you a lot of time on waiting and optimising.
But if you needs to consider readability or you do not care about performance, ORM could be a better choice. Just remember in terms of performance, SQL always wins ;D
Posted on September 6, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.