Aggregation in Django JSONFields

saschalalala

Sascha

Posted on May 25, 2019

Aggregation in Django JSONFields

This is my first post on dev.to and I think about it more as a note for myself and perhaps someone finds this as interesting as I do.

What I am doing (going to do) with Django and Postgres

I am currently working with some Twitter JSON data that I analyze. I use different libraries and tools for that, e.g. pandas and networkx for looking at the data and creating networks, also Kibana and Elasticsearch for some more quantitative analysis and visualization (Tweets per day and things like that).

However, I identified some users that I want to concentrate my analysis on, read some content, look at some user level visualizations, things like that. It's quite cumbersome to do these things in a jupyter notebook because it is lacking the easy usage of a webapplication that concentrates on these things.

Why JSONField and why Postgres

First thing is, Django works very well with Postgres and I don't want to integrate an extra NoSQL database to the project. Also, I have no idea, how well Django works with e.g. MongoDB. Second, I have no idea, how NoSQL databases "work". I work with MySQL and Postgres for years and even if I am no expert, I usually get along quite well. The third point (which is a little mixed up with the first one), is, I want to use the ORM and don't want to learn some new syntax just for data retrieval.

Data preparation

I don't want to go into detail on that part, because I have a quite large and self-made toolchain. The most important part is that at the end I have a pandas dataframe that I can export to a flat json structure. I decided not to create a Django model that can be used with this json structure because I am quite sure that the datastructure will change quite often in the future and I don't want to write migrations and touch my core application just because I added some fields to my datastructure.

The datastructure looks like this:

{
    "created_at": 15588051070000,
    "favorite_count": 4,
    "id": <unique_id>,
    "is_quote": null,
    "is_reply": null,
    "is_retweet": null,
    "quote_status_id": 0,
    "quote_text": null,
    "quote_user_id": null,
    "quote_user_name": null,
    "reply_count": 0,
    "reply_status_id": 0,
    "reply_user_id": null,
    "reply_user_name": null,
    "retweet_count": 5,
    "retweet_status_id": 0,
    "retweet_text": null,
    "retweet_user_id": null,
    "retweet_user_name": null,
    "text": "<tweet_text>",
    "text_is_extended": false,
    "user_id": <user_id>,
    "user_name": "<user_name>"
}

and I am mostly interested on the text, the creation time and the amount of retweets. There is some annotating and filtering necessary in order to calculate some of these values.

My first important query

Of course, I want my database to do the heavy lifting when it comes to data aggregation/grouping/filtering etc. If I wouldn't want the database to do this, I could just use json flatfiles and load them into python dictionaries, pandas dataframes or javascript objects.

The information I need on the first page of my webapp is the following:

All usernames with their respective amount of tweets and the sum of their retweets ordered by this sum.

That's not that hard to do in a "normal" Django application, I think you would do something like this:

# I wrote this from the top of my head so if it is wrong,
# please correct me in the comments

tweets = Tweet.objects.values("user_name")
    .annotate(
        tweet_count=Count("user_name"),
        retweets=Sum("retweet_count")
    )
    .order_by("-retweets")

Well, of course, I cannot do this, because my Tweet model has no Tweet.user_name or something like this. It just looks like this:

class Tweet(models.Model):
    name = models.CharField(max_length=200)
    data = JSONField()
    created_at = models.DateTimeField(null=True)

and all of the data I need are json formatted inside Tweet.data. I read about JSONField in the Django docs (https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/fields/#key-index-and-path-lookups) and I knew already that I can do things like

Tweet.objects.filter(data__user_name="whoever")

which is really cool because that is more or less exactly the same syntax you use when you have a normal relational database model. So I thought, I could to the same in my annotations. Just change the Sum("retweet_count") to Sum("data__retweet_count") and I would be fine.

It turns out that this is not the case, because these aggregation functions only "see" the data thing of the object and not its content. So while it works to count these things, summing the numbers inside data__retweet_count is not doable with this approach.

After reading some really annoyingly complex solutions on stackoverflow that used RAWSql queries and such, I found the solution that I am using and that works really good:

Tweet.objects.annotate(
    user_name=Cast(
        KeyTextTransform("user_name", "data"), models.TextField()
    )
)
.values("user_name")
.annotate(
    tweet_count=Count("data"),
    retweets=Sum(
        Cast(
            KeyTextTransform("retweet_count", "data"), models.IntegerField()
        )
    ),
)
.order_by("-retweets")

So while that looks a lot more complicated than the query above, in reality it is not and I will explain the things that happen there one after the other.

The most important thing here is KeyTextTransform that you have to import from django.contrib.postgres.fields.jsonb. What it does is extracting the specified key from the specified json object for you (on an ORM level, I have no idea, how that works).

So the first annotation with the user_name is basically just Get the user_name from the json thing as a string (Textfield) and name it user_name. The only reason I do that is because if I would not, that value would be called data__user_name in the resulting data structure and I don't want that.

The next annotation is first the tweet count which is just a counter of the occurence of the username inside my database so I could put everything in there. The retweets annotation is basically the same as the user_name thing but with a cast to Integer instead of Text and a summation of the numbers inside there.

So, to summarize this query: The only things that are added on top are the key lookup in the json datastructure and the casting of that value to the datatype I need. That's all.

Final results

I put this query in a Manager and my Tweet model looks like this:

import logging
import pytz

from django.conf import settings
from django.contrib.postgres.fields import JSONField
from django.contrib.postgres.fields.jsonb import KeyTextTransform
from django.db import models
from django.db.models import Count, Sum
from django.db.models.functions import Cast

from django.utils import timezone

logger = logging.getLogger(__name__)


class TweetManager(models.Manager):
    def counts_only(self):
        return (
            Tweet.objects.annotate(
                user_name=Cast(
                    KeyTextTransform("user_name", "data"), models.TextField()
                )
            )
            .values("user_name")
            .annotate(
                tweet_count=Count("data"),
                retweets=Sum(
                    Cast(
                        KeyTextTransform("retweet_count", "data"), models.IntegerField()
                    )
                ),
            )
            .order_by("-retweets")
        )


class Tweet(models.Model):
    name = models.CharField(max_length=200)
    data = JSONField()
    created_at = models.DateTimeField(null=True)
    objects = TweetManager()

    def __str__(self):
        return self.name

    def save(self, *args, **kwargs):
        try:
            self.name = f"{self.data['user_name']}_{self.data['id']}"
            # the creation datetime is stored as miliseconds timestamp
            naive_created_at = timezone.datetime.fromtimestamp(
                self.data["created_at"] / 1000
            )
            self.created_at = pytz.timezone(settings.TIME_ZONE).localize(
                naive_created_at, is_dst=None
            )
        except Exception as e:
            logger.error(e)
        super().save(*args, **kwargs)

I now can call it in my view:

Tweet.objects.counts_only()

and get a result that I can pass to the frontend and work with it in my template or javascript without recalculating anything:

[{'user_name': 'user_0', 'tweet_count': 25, 'retweets': 2760},
{'user_name': 'user_1', 'tweet_count': 1, 'retweets': 891},
{'user_name': 'user_2', 'tweet_count': 165, 'retweets': 2265},
{'user_name': 'user_3', 'tweet_count': 12, 'retweets': 1769},
{'user_name': 'user_4', 'tweet_count': 59, 'retweets': 1663},
{'user_name': 'user_5', 'tweet_count': 6, 'retweets': 1657},
{'user_name': 'user_6', 'tweet_count': 8, 'retweets': 1420},
{'user_name': 'user_7', 'tweet_count': 26, 'retweets': 1186},
{'user_name': 'user_8', 'tweet_count': 32, 'retweets': 1076},
{'user_name': 'user_9', 'tweet_count': 473, 'retweets': 932}]
💖 💪 🙅 🚩
saschalalala
Sascha

Posted on May 25, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related