NLP Project: Wikipedia Article Crawler & Classification - Corpus Reader

Natural Language Processing is a fascinating area of machine leaning and artificial intelligence. This blog posts starts a concrete NLP project about working with Wikipedia articles for clustering, classification, and knowledge extraction. The inspiration, and the general approach, stems from the book Applied Text Analysis with Python.

Over the course of the next articles, I will show how to implement a Wikipedia article crawler, how to collect articles into a corpus, how to apply text preprocessing, tokenization, encoding and vectorization, and finally applying machine learning algorithms for clustering and classification.

The technical context of this article is Python v3.11 and several additional libraries, most important nltk v3.8.1 and wikipedia-api v0.6.0. All examples should work with newer versions too.

This article originally appeared at my blog admantium.com.

Project Outline

The projects’ goal is to download, process, and apply machine learning algorithms on Wikipedia articles. First, selected articles from Wikipedia are downloaded and stored. Second, a corpus is generated, the totality of all text documents. Third, each documents text is preprocessed, e.g. by removing stop words and symbols, then tokenized. Fourth, the tokenized text is transformed to a vector for receiving a numerical representation. Finally, different machine learning algorithms are applied.

In this first article, steps one and two are explained.

Prerequisites

I like to work in a Jupyter Notebook and use the excellent dependency manager Poetry. Run the following commands in a project folder of your choice to install all required dependencies and to start the Jupyter notebook in your browser.

# Complete the interactive project creation
poetry init

# Add core dependencies
poetry add nltk@^3.8.1 jupyterlab@^4.0.0 scikit-learn@^1.2.2 wikipedia-api@^0.5.8 matplotlib@^3.7.1 numpy@^1.24.3 pandas@^2.0.1

# Add NLTK dependencies
python3 -c "import nltk; \
    nltk.download('punkt'); \
    nltk.download('averaged_perceptron_tagger'); \
    nltk.download('reuters'); \
    nltk.download('stopwords');"

# Start jupyterhub
poetry run jupyterlab

A fresh Jupyter Notebook should open in your browser.

Python Libraries

In this blog post, the following Python libraries will be used:

wikipedia-api:
- Page objects representing Wikipedia articles with their title, text, categories, and related pages.
NLTK
- PlaintextCorpusReader for a traversable object that gives access to documents, provides tokenization methods, and computes statistics about all files
- sent_tokenizer and word_tokenizer for generating tokens

Part 1: Wikipedia Article Crawler

The project starts with the creation of a custom Wikipedia crawler. Although we can work with Wikipedia corpus datasets from various sources, such as built-in corpus in NLTK, the custom crawler provides best control about file format, content, and the contents actuality.

Downloading and processing raw HTML can time consuming, especially when we also need to determine related links and categories from this. A very handy library comes to the rescue. The wikipedia-api does all of these heavy lifting for us. Based on this, lets develop the core features in a stepwise manner.

First, we create a base class that defines its own Wikipedia object and determines where to store the articles.

import os
import re
import wikipediaapi as wiki_api

class WikipediaReader():
    def __init__(self, dir = "articles"):
        self.pages = set()
        self.article_path = os.path.join("./", dir)
        self.wiki = wiki_api.Wikipedia(
                language = 'en',
                extract_format=wiki_api.ExtractFormat.WIKI)
        try:
            os.mkdir(self.article_path)
        except Exception as e:
            pass

This also defines the pages, a set of page objects that the crawler visited. This page object is tremendously helpful because it gives access to an articles title, text, categories, and links to other pages.

Second, we need helper methods that receives an article name, and if it exists, it will add a new page object to the set. We need to wrap the call in a try except block, because some articles containing special characters could not be processed correctly, such as Add article 699/1000 Tomasz Imieliński. Also, there are several meta-articles that we do not need to store.

def add_article(self, article):
    try:
        page = self.wiki.page(self._get_page_title(article))
        if page.exists():
            self.pages.add(page)
            return(page)
    except Exception as e:
        print(e)

Third, we want to extract the categories of an article. Each Wikipedia article defines categories in both visible sections at the bottom of a page - see the following screenshot - as well as in metadata that is not rendered as HTML. Therefore, the initial list of categories might sound confusing. Take a look at this example:

wr = WikipediaReader()
wr.add_article("Machine Learning")
ml = wr.list().pop()

print(ml.categories)
# {'Category:All articles with unsourced statements': Category:All articles with unsourced statements (id: ??, ns: 14),
#  'Category:Articles with GND identifiers': Category:Articles with GND identifiers (id: ??, ns: 14),
#  'Category:Articles with J9U identifiers': Category:Articles with J9U identifiers (id: ??, ns: 14),
#  'Category:Articles with LCCN identifiers': Category:Articles with LCCN identifiers (id: ??, ns: 14),
#  'Category:Articles with NDL identifiers': Category:Articles with NDL identifiers (id: ??, ns: 14),
#  'Category:Articles with NKC identifiers': Category:Articles with NKC identifiers (id: ??, ns: 14),
#  'Category:Articles with short description': Category:Articles with short description (id: ??, ns: 14),
#  'Category:Articles with unsourced statements from May 2022': Category:Articles with unsourced statements from May 2022 (id: ??, ns: 14),
#  'Category:Commons category link from Wikidata': Category:Commons category link from Wikidata (id: ??, ns: 14),
#  'Category:Cybernetics': Category:Cybernetics (id: ??, ns: 14),
#  'Category:Learning': Category:Learning (id: ??, ns: 14),
#  'Category:Machine learning': Category:Machine learning (id: ??, ns: 14),
#  'Category:Short description is different from Wikidata': Category:Short description is different from Wikidata (id: ??, ns: 14),
#  'Category:Webarchive template wayback links': Category:Webarchive template wayback links (id: ??, ns: 14)}

Therefore, we do not store these special categories at all by applying multiple regular expression filters.

def get_categories(self, title):
    page = self.add_article(title)
    if page:
        if (list(page.categories.keys())) and (len(list(page.categories.keys())) > 0):
            categories = [c.replace('Category:','').lower() for c in list(page.categories.keys())
                if c.lower().find('articles') == -1
                and c.lower().find('pages') == -1
                and c.lower().find('wikipedia') == -1
                and c.lower().find('cs1') == -1
                and c.lower().find('webarchive') == -1
                and c.lower().find('dmy dates') == -1
                and c.lower().find('short description') == -1
                and c.lower().find('commons category') == -1

            ]
            return dict.fromkeys(categories, 1)
    return {}

Fourth, we now define the crawl method. It’s a customizable breadth-first search that starts from an article, gets all related pages, ads these to the page objects, and then process them again until the number of total articles is exhausted, or the depth level reached. To be honest: I only ever crawled 1000 articles with it.

def crawl_pages(self, article, depth = 3, total_number = 1000):
    print(f'Crawl {total_number} :: {article}')

    page = self.add_article(article)
    childs = set()

    if page:
        for child in page.links.keys():
            if len(self.pages) < total_number:
                print(f'Add article {len(self.pages)}/{total_number} {child}')
                self.add_article(child)
                childs.add(child)

    depth -= 1
    if depth > 0:
        for child in sorted(childs):
            if len(self.pages) < total_number:
                self.crawl_pages(child, depth, len(self.pages))

Lets start crawling the machine learning articles:

reader = WikipediaReader()
reader.crawl_pages("Machine Learning")

print(reader.list())
# Crawl 1000 :: Machine Learning
# Add article 1/1000 AAAI Conference on Artificial Intelligence
# Add article 2/1000 ACM Computing Classification System
# Add article 3/1000 ACM Computing Surveys
# Add article 4/1000 ADALINE
# Add article 5/1000 AI boom
# Add article 6/1000 AI control problem
# Add article 7/1000 AI safety
# Add article 8/1000 AI takeover
# Add article 9/1000 AI winter

Finally, when a set of page objects is available, we extract their text content and store them in files in which the file name represents a cleaned-up version of its title. A caveat: file names need to retain capitulation of their article name, or else we cannot get the page object again because a search with a lower-cased article name does not return results.

def process(self, update=False):
    for page in self.pages:
        filename = re.sub('\s+', '_', f'{page.title}')
        filename = re.sub(r'[\(\):]','', filename)
        file_path = os.path.join(self.article_path, f'{filename}.txt')
        if update or not os.path.exists(file_path):
            print(f'Downloading {page.title} ...')
            content = page.text
            with open(file_path, 'w') as file:
                file.write(content)
        else:
            print(f'Not updating {page.title} ...')

Here is the complete source code of the WikipediaReader class.

import os
import re
import wikipediaapi as wiki_api

class WikipediaReader():
    def __init__(self, dir = "articles"):
        self.pages = set()
        self.article_path = os.path.join("./", dir)
        self.wiki = wiki_api.Wikipedia(
                language = 'en',
                extract_format=wiki_api.ExtractFormat.WIKI)
        try:
            os.mkdir(self.article_path)
        except Exception as e:
            pass

    def _get_page_title(self, article):
        return re.sub(r'\s+','_', article)

    def add_article(self, article):
        try:
            page = self.wiki.page(self._get_page_title(article))
            if page.exists():
                self.pages.add(page)
                return(page)
        except Exception as e:
            print(e)

    def list(self):
        return self.pages

    def process(self, update=False):
        for page in self.pages:
            filename = re.sub('\s+', '_', f'{page.title}')
            filename = re.sub(r'[\(\):]','', filename)
            file_path = os.path.join(self.article_path, f'{filename}.txt')
            if update or not os.path.exists(file_path):
                print(f'Downloading {page.title} ...')
                content = page.text
                with open(file_path, 'w') as file:
                    file.write(content)
            else:
                print(f'Not updating {page.title} ...')

    def crawl_pages(self, article, depth = 3, total_number = 1000):
        print(f'Crawl {total_number} :: {article}')

        page = self.add_article(article)
        childs = set()

        if page:
            for child in page.links.keys():
                if len(self.pages) < total_number:
                    print(f'Add article {len(self.pages)}/{total_number} {child}')
                    self.add_article(child)
                    childs.add(child)

        depth -= 1
        if depth > 0:
            for child in sorted(childs):
                if len(self.pages) < total_number:
                    self.crawl_pages(child, depth, len(self.pages))

    def get_categories(self, title):
        page = self.add_article(title)
        if page:
            if (list(page.categories.keys())) and (len(list(page.categories.keys())) > 0):
                categories = [c.replace('Category:','').lower() for c in list(page.categories.keys())
                   if c.lower().find('articles') == -1
                   and c.lower().find('pages') == -1
                   and c.lower().find('wikipedia') == -1
                   and c.lower().find('cs1') == -1
                   and c.lower().find('webarchive') == -1
                   and c.lower().find('dmy dates') == -1
                   and c.lower().find('short description') == -1
                   and c.lower().find('commons category') == -1

                ]
                return dict.fromkeys(categories, 1)
        return {}

Let’s use the Wikipedia crawler to download articles related to machine learning.


reader = WikipediaReader()
reader.crawl_pages("Machine Learning")

print(reader.list())
# Downloading The Register ...
# Not updating Bank ...
# Not updating Boosting (machine learning) ...
# Not updating Ian Goodfellow ...
# Downloading Statistical model ...
# Not updating Self-driving car ...
# Not updating Behaviorism ...
# Not updating Statistical classification ...
# Downloading Search algorithm ...
# Downloading Support vector machine ...
# Not updating Deep learning speech synthesis ...
# Not updating Expert system ...

Part 2: Wikipedia Corpus

All articles are downloaded as text files in the article folder. To provide an abstraction over all these individual files, the NLTK library provides different corpus reader objects. This object not only provides a quick access to individual files, but can also generate statistical information’s, such as the vocabulary, the total number of individual tokens, or the document with the most amount of words.

Lets use the PlaintextCorpusReader class as a starting point, and just initialize it so that it points to the articles:

import nltk
from  nltk.corpus.reader.plaintext import PlaintextCorpusReader
from time import time

class WikipediaCorpus(PlaintextCorpusReader):
    pass

corpus = WikipediaCorpus('articles', r'[^\.ipynb].*', cat_pattern=r'[.*]')
prin(corpus.fileids())

# ['2001_A_Space_Odyssey.txt',
#  '2001_A_Space_Odyssey_film.txt',
#  '2001_A_Space_Odyssey_novel.txt',
#  '3D_optical_data_storage.txt',
#  'A*_search_algorithm.txt',
#  'A.I._Artificial_Intelligence.txt',
#  'AAAI_Conference_on_Artificial_Intelligence.txt',
#  'ACM_Computing_Classification_System.txt',

Ok, this is good enough. Let’s extend it with two methods to compute the vocabulary and the maximum number of words. For the vocabulary, we will use the NLTK helper class FreqDist, which is a dictionary object with all word occurrences, this method consumes all texts with the simple helper corpus.words(), from which non-text and non-numbers are removed.

def vocab(self):
    return nltk.FreqDist(re.sub('[^A-Za-z0-9,;\.]+', ' ', word).lower() for word in corpus.words())

To get the maximum number of words, we traverse all documents with fileids() , then determine the length of words(doc), and record the highest value

def max_words(self):
    max = 0
    for doc in self.fileids():
        l = len(self.words(doc))
        max = l if l > max else max
    return max

Finally, lets add a describe method for generating statistical information (this idea also stems from the above mentioned book Applied Text Analysis with Python).

This method starts a timer to record how long the campus processing lasts, and then it uses the built-in methods of the corpus reader object and the just now created methods to compute the number of files, paragraphs, sentences, words, the vocabulary and the maximum number of words inside a document.

def describe(self, fileids=None, categories=None):
    started = time()

    return {
        'files': len(self.fileids()),
        'paras': len(self.paras()),
        'sents': len(self.sents()),
        'words': len(self.words()),
        'vocab': len(self.vocab()),
        'max_words': self.max_words(),
        'time': time()-started
        }
    pass

Here is the final WikipediaCorpus class:

import nltk
from  nltk.corpus.reader.plaintext import PlaintextCorpusReader
from time import time

class WikipediaCorpus(PlaintextCorpusReader):

    def vocab(self):
        return nltk.FreqDist(re.sub('[^A-Za-z0-9,;\.]+', ' ', word).lower() for word in corpus.words())

    def max_words(self):
        max = 0
        for doc in self.fileids():
            l = len(self.words(doc))
            max = l if l > max else max
        return max

    def describe(self, fileids=None, categories=None):
        started = time()

        return {
            'files': len(self.fileids()),
            'paras': len(self.paras()),
            'sents': len(self.sents()),
            'words': len(self.words()),
            'vocab': len(self.vocab()),
            'max_words': self.max_words(),
            'time': time()-started
            }
        pass

At the time of writing, after crawling the Wikipedia articles about artificial intelligence and machine learning, the following statistic could be obtained:

corpus = WikipediaCorpus('articles', r'[^\.ipynb].*', cat_pattern=r'[.*]')
corpus.describe()
{'files': 1163,
 'paras': 96049,
 'sents': 238961,
 'words': 4665118,
 'vocab': 92367,
 'max_words': 46528,
 'time': 32.60307598114014}

Conclusion

This article is the starting point for an NLP project to download, process, and apply machine learning algorithms on Wikipedia articles. Two aspects were covered in this article. First, the creation of the WikipediaReader class that finds articles by its name, and can extract its title, content, category and mentioned links. The crawler is controlled with two variables: The total number of crawled articles, and the depth of crawling. Second, the WikipediaCorpus, an extension of the NLTK PlaintextCorpusReader. This object provides convenient access to individual files, sentences, and words, as well as total corpus data like the number files or the vocabulary, the number of unique tokens. The next article continues with building a text processing pipeline.

Blog