HAP
Posted on July 2, 2021
Background
I'm working on a project that is developed using django. Part of the is project requires deleting n-level-deep relational data with a huge quantity of detail records at the lowest level.
Django allows for the setting of on-delete actions in the ForeignKey
instantiation. Unfortunately, the on-delete action is not set in the database. It turns out that django is designed to handle cascade actions in the ORM at the application level. What this means is that it can use up a lot of memory resources as it walks relations, instantiates objects and deletes them through its ORM functionality.
Our project is deployed in Openshift. This ORM activity was causing the pod to be halted as it OOM'd. This was a drag for everyone and was causing data processing operations to back up.
So, what to do about it? I decided to not worry about django messaging at the lower cascade levels and instead manage the cascade myself. This required the development of some utility functions to compile QuerySet instances into SQL and another function to recursively walk model relations. The idea was to have the relation walker recursively reach the lowest relations level, then execute SQL directly on its way back up. Also, there is no query execution until the recursion works its way back. The benefit here is that we can use the ORM to construct a top-level query for records to delete, then the cascade function will find all of the models that depend on records from the target and execute the SQL necessary to simulate the delete cascade functionality that a database like PostgreSQL can handle.
This design was intended to be a compromise for ease of use in our application and not have to write a complicated django application extension or change the entire applications model inheritance to use a subclass of models.Model
.
Design
Utility Functions
This required the creation of small set of utility functions to handle compilation of a QuerySet
instance into either DELETE
or UPDATE
SQL statements. Also, a common function was needed to directly execute the compiled SQL with parameters
The delete SQL compiler:
from django.db import transaction
from django.db.models.sql.compiler import SQLDeleteCompiler
def get_delete_sql(query):
"""
Compile a DELTE SQL statement from a QuerySet instance
Params:
query (QuerySet) : The query to compile to SQL
Returns (tuple):
(sql, params) : sql is the sql statement to exeucte
params are any parameters for the statement to use. This is a tuple.
"""
return SQLDeleteCompiler(query.query, transaction.get_connection(), query.db).as_sql()
The update SQL compiler:
from django.db import models
def get_update_sql(query, **updatespec):
"""
Compile the query with the update specifications into an UPDATE SQL statement and parameters
Params:
query (QuerySet) : The QuerySet that will select the row(s) to update
updatespec (dict) : {column: new_value} expressed in the function call as `column=new_value` named parameters
Returns (tuple):
(sql, params) : sql is the sql statement to exeucte
params are any parameters for the statement to use. This is a tuple.
"""
assert query.query.can_filter()
query.for_write = True
q = query.query.chain(models.sql.UpdateQuery)
q.add_update_values(updatespec)
q._annotations = None
return q.get_compiler(query.db).as_sql()
The executor:
import logging
from django.db import transaction
from sqlparse import format as format_sql
LOG = logging.getLogger(__name__)
def execute_compiled_sql(sql, params=None):
"""
Execute the SQL with any parameters directly using connection.cursor()
Params:
sql (str) : The parameterized SQL statement.
params (tuple/list) : Parameters for the sql statement or None
Returns :
int : rows affected by the statement
"""
rows_affected = 0
with transaction.get_connection().cursor() as cur:
params = params or None
LOG.debug(format_sql(cur.mogrify(sql, params).decode("utf-8"), reindent_aligned=True))
cur.execute(sql, params)
rows_affected = cur.rowcount
return rows_affected
This executor function will log the actual sql statement (with parameters inline) at the DEBUG
level.
Cascade Function
The cascade function will walk all relations defined in the django model and will directly execute CASCADE
and SET_NULL
on-delete actions. Anything else is passed to the database and will return any database exception needed on error.
To call this function, all you need is the model from which you wish to start the cascade actions and a query against that model that will return the row(s) you wish to delete.
def cascade_delete(from_model, instance_pk_query, skip_relations=[], base_model=None, level=0):
"""
Performs a cascading delete by walking the Django model relations and executing compiled SQL
to perform the on_delete actions instead or running the collector.
Parameters:
from_model (models.Model) : A model class that is the relation root
instance_pk_query (QuerySet) : A query for the records to delete and cascade from
base_model (None; Model) : The root model class, If null, this will be set for you.
level (int) : Recursion depth. This is used in logging only. Do not set.
skip_relations (Iterable of Models) : Relations to skip over in case they are handled explicitly elsewhere
"""
if base_model is None:
base_model = from_model
instance_pk_query = instance_pk_query.values_list("pk").order_by()
LOG.info(f"Level {level} Delete Cascade for {base_model.__name__}: Checking relations for {from_model.__name__}")
for model_relation in from_model._meta.related_objects:
related_model = model_relation.related_model
if related_model in skip_relations:
LOG.info(f"SKIPPING RELATION {related_model.__name__} from caller directive")
continue
if model_relation.on_delete.__name__ == "SET_NULL":
filterspec = {f"{model_relation.remote_field.column}__in": models.Subquery(instance_pk_query)}
updatespec = {f"{model_relation.remote_field.column}": None}
LOG.info(
f" Executing SET NULL constraint action on {related_model.__name__}"
f" relation of {from_model.__name__}"
)
rec_count = execute_update_sql(related_model.objects.filter(**filterspec), **updatespec)
LOG.info(f" Updated {rec_count} records in {related_model.__name__}")
elif model_relation.on_delete.__name__ == "CASCADE":
filterspec = {f"{model_relation.remote_field.column}__in": models.Subquery(instance_pk_query)}
related_pk_values = related_model.objects.filter(**filterspec).values_list(related_model._meta.pk.name)
LOG.info(f" Cascading delete to relations of {related_model.__name__}")
cascade_delete(
related_model, related_pk_values, base_model=base_model, level=level + 1, skip_relations=skip_relations
)
LOG.info(f"Level {level}: delete records from {from_model.__name__}")
if level == 0:
del_query = instance_pk_query
else:
filterspec = {f"{from_model._meta.pk.name}__in": models.Subquery(instance_pk_query)}
del_query = from_model.objects.filter(**filterspec)
rec_count = execute_delete_sql(del_query)
LOG.info(f"Deleted {rec_count} records from {from_model.__name__}")
Note the skip_relations override. This is necessary for special cases that are handled outside of the cascade process and would either be unnecessary work or would possibly introduce an error. For my project, it was to skip partitioned tables where the delete involved simply dropping a partition. This would be unnecessary work and would be inefficient since all partitions could be scanned.
So the cascade will be handled appropriately as it checks all relations at every level and executes deletes bottom-up. This allows us to use the existing model relations and allows us to use the ORM QuerySet to build queries as normal.
I hope that you will find this useful if you are also using django and need to delete large amounts of related data via cascading actions.
Addendum
The project I'm working on makes heavy use of deferred constraints. If you are using this cascade delete and your database has deferred constraints, you may want to set them to immediate for this transaction.
Posted on July 2, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.