Need to Create a Simple Content Aggregator Website using Django? It's a Minimalist Task

Introduction

The web contains a lot of content housed in thousands of websites. Different websites contain related content that one website could efficiently host. The nature of distribution of related content on the web forces users to visit different websites for the same information, which could be hectic and time consuming.

Several content gathering websites exist to enhance user experience. The sites gather information about various topics in one place so that users do not have to scour the web unnecessarily. This article discusses the engineering behind these content aggregator sites. Developers interested in designing content aggregators may find this article helpful.

Prerequisites:

An understanding of Python basics and the Django framework
Basic knowledge of RSS feeds

RSS feeds

Content aggregator sites collect content from Really Simple Syndication (RSS) feeds. RSS feeds are XML files that many websites, which host frequently changing content, create to track content records. An RSS feed typically contains important information about a piece of web content. For instance for a news content, the feed could contain the following information:

Name of the publishing site
Title of content
Publication date
URL of the content

Python's feedparser library

A Python library known as feedparser is a useful tool that a content aggregator site developer can use to parse and extract content from RSS feeds. However, before using the library, the developer should decide the kind of data to retrieve from RSS feeds. If one is building a news aggregator website, one would perhaps need to retrieve the metadata listed in the previous section for each news item at the minimum.

To use the feedparser library to parse RSS feeds:

Create a Python virtual environment and install Django
Install the feedparser library using pip
```
pip install feedparser
```
Create a Django project and app

Create a database table in models.py file with the necessary columns. For instance, for a news website, a database table could be something like this:

 # app/models.py
 from django.db import models

 # Create your models here
 class News(models.Model):
     title = models.CharField(max_length=200)
     pub_date = models.DateTimeField()
     link = models.URLField()
     site_name = models.CharField(max_length=100)

     def __str__(self) -> str:
         return f"{self.site_name}: {self.title}"

Run Django migrations to include your table in the database

 python manage.py makemigrations
 python manage.py migrate

To test the working of the feedparser, one can manually retrieve content from an RSS feed and display it in the terminal. To achieve this:

Start a terminal session inside your virtual environment
Import the feedparser library
Use the parse function to extract data from an RSS feed

In code,

(venv) python manage.py shell
>>> import feedparser
>>> data = feedparser.parse('some_rss_feed_url')
>>> site_name = data.channel.title
>>> print(site_name)

To make the project more robust and efficient, one can automate the data retrieval process. Additionally, it is paramount to implement a storage mechanism. For this purpose, it may be necessary to create a custom Django command.

A Custom Django Command

The file manage.py runs all commands within a Django project. It is possible to create a custom command in Django to utilize the manage.py file. To achieve this, the command should be created inside the project app directory in a subdirectory called management/commands. When the developer runs the manage.py file, it checks for custom commands in the directory and executes them. Furthermore, it is worth noting that all custom commands inherit from the BaseCommand object. Therefore, our command file should import BaseCommand.

Inside the app/management/commands directory:

Create a file with a suitable name that will contain the command code to execute
Import the necessary dependencies:
- BaseCommand object
- Feedparser library
- Your model
Create a command class that inherits from BaseCommand
Create a handle method that does two things:
1. Retrieves data from a feed
2. Stores the data in a database as illustrated below

# app/management/commands/news.py

from django.core.management.base import BaseCommand
import feedparser
from app.models import News

class Command(BaseCommand):
    # parses RSS feeds
    def handle(self, *args, **kwargs):
        feed = feedparser.parse("some_website_rss_feed_url")
        name = feed.channel.site

        # Stores parsed data in db
        for item in feed.entries:
            if not News.objects.filter(id=item.id).exists():
                news_item = News(
                    title=item.title,
                    pub_date=parser.parse(item.published),
                    link=item.link,
                    site_name = name
              )
              news_item.save()

With this set-up in place, when you execute the above command, the app will retrieve data from the specified RSS feed URL and store it in the database. One can access the model from the admin panel to inspect the data. To hook up URLs from several feeds into the project, follow this tutorial

Conclusion

The tutorial above discusses the main building blocks necessary to create a simple content aggregator website using Django. The main tools a developer needs are RSS feed URLs, clear understanding of the kind of data to retrieve from the RSS feeds, the feeparser library and a custom Django command. With these tools, the developer can conceptualize and construct a suitable data model, parse RSS feeds, and implement a command that handles how the data is stored in the project's database.

Blog