Python NLP Libary: NLTK

NLTK is a sophisticated library. Continuously developed since 2009, it supports all classical NLP tasks, from tokenization, stemming, part-of-speech tagging, and including semantic index and dependency parsing. It also has a rich set of additional features, such as built-in corpora, different models for its NLP tasks, and integration with SciKit Learn and other Python libraries.

This article is a concise introduction to NLTK. You will see NLTK in action, short code-snippet that you can use for a variety of NLP tasks.

This article originally appeared at my blog admantium.com.

The technical context of this article is Python v3.11 and NLTK v3.8.1. All examples should work with newer versions too.

NLTK Library Installation

NLTK can be installed via Python pip:

python3 -m pip install nltk

Several NLTK features require additional data to be used, such as stop words or integrated corpus. For this, the built-in downloader is used. Here is an example:

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('reuters')

Other parts, like specialized tokenizer or stop words, require Java libraries to be installed. See this Github Gist to get started.

NLP Tasks

NLTK supports several NLP tasks. Here is a short overview, and the next sections provide more details:

Text Processing
- Tokenization
- Stemming
- Lemmatization
Text Syntax
- Part-of-Speech Tagging
Text Semantics
- Named Entity Recognition
Document Semantics
- Clustering
- Classification

Furthermore, NLTK supports these additional features:

Datasets
Corpus Management
Machine Learning Clustering and Classification Models

Text Processing

Tokenization

Tokenizing is an essential first step in text processing. In general, the tokenization approach should be chosen dependent on project requirements and subsequent NLP tasks. For example, when a text contains multi-nouns words that represent entities or persons, but the tokenizer just splits by whitespace, named entity recognition becomes hard.

NLTK provides a simple whitespace tokenizer, several built-in tokenizers, such as NIST or Stanford, and options for custom tokenizers based on regular expressions.

Here is an example of the built-in sentence and word tokenizer:

from nltk.tokenize import sent_tokenize, word_tokenize

# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
paragraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''

sentences = []
for sent in sent_tokenize(paragraph):
  sentences.append(word_tokenize(sent))

sentences[0]
# ['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline'

Stemming and Lemmatization

Like tokenization, choosing suitable stemming (replace inflected words with their word stem, like cooking with cook) and lemmatization (replace word groups with their lemma) approaches are dependent on the subsequent NLP tasks. Lemmatization has a special role because it requires some part-of-speech tagging or word sense disambiguation to correctly identify the word groups.

NLTK provides several stemmer modules, such as Porter, Lancaster and Isri. For lemmatization, only Wordnet is provided.

Lets compare stemming and lemmatization of the first sentence in the Wikipedia article about artificial intelligence.

from nltk.stem import LancasterStemmer

sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'

stemmer = LancasterStemmer()

stemmed_sent = [stemmer.stem(word) for word in word_tokenize(sent)]
print(stemmed_sent)
# ['art', 'intellig', 'was', 'found', 'as', 'an', 'academ', 'disciplin',

And the same sentence processed with the WordNet lemmatizer:

from nltk.stem import WordNetLemmatizer

sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'

lemmatizer = WordNetLemmatizer()

lemmas = [lemmatizer.lemmatize(word) for word in word_tokenize(sent)]
print(lemmas)
# ['Artificial', 'intelligence', 'wa', 'founded', 'a', 'an', 'academic', 'discipline'

Text Syntax

Part-of-Speech Tagging

NLTK also provides different part of speech taggers (pos). With the built-in tagger, following annotations are produced:

Tag	Meaning
ADJ	adjective
ADP	adposition
ADV	adverb
CONJ	conjunction
DET	determiner, article
NOUN	noun
NUM	numeral
PRT	particle
PRON	pronoun
VERB	verb
.	punctuation marks
X	other

Taking the first sentence from the Wikipedia article about artificial intelligence, part of speech tagging produces the following result.

from nltk import pos_tag

sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'

pos_tag(sentences[0])

# [('Artificial', 'JJ'),
#  ('intelligence', 'NN'),
#  ('was', 'VBD'),
#  ('founded', 'VBN'),
#  ('as', 'IN'),
#  ('an', 'DT'),
#  ('academic', 'JJ'),
#  ('discipline', 'NN'),

To use the other NLTK pos taggers, such as Stanford or Brill, external Java libraries need to be downloaded.

Text Semantics

Named Entity Recognition

NLTK includes pre-trained NER taggers, but several additional packages need to be downloaded first.

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')

The NER tagger consumes a POS tagged sentence and adds the classification labels to the representation. Using it on the sample paragraph yields no results, so the following example takes another sentence from the Wikipedia article in which persons are mentioned.

from nltk.tokenize import sent_tokenize

# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
sentence= '''
In 2011, in a Jeopardy! quiz show exhibition match, IBM's question answering system, Watson, defeated the two greatest Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin.
'''

tagged_sentence = nltk.pos_tag(word_tokenize(sentence))
tagged_sentence
# [('In', 'IN'),
#  ('2011', 'CD'),
#  (',', ','),
#  ('in', 'IN'),
#  ('a', 'DT'),
#  ('Jeopardy', 'NN'),

print(nltk.ne_chunk(tagged_sentence))
# (S
#   In/IN
#   2011/CD
#   ,/,
#   in/IN
#   a/DT
#   Jeopardy/NN
#   !/.
#   quiz/NN
#   show/NN
#   exhibition/NN
#   match/NN
#   ,/,
#   (ORGANIZATION IBM/NNP)
#   's/POS
#   question/NN
#   answering/NN
#   system/NN
#   ,/,
#   (PERSON Watson/NNP)

As you see, the person Watson and the organization IBM are recognized.

Document Semantics

Clustering

Three clustering algorithms are supported, see the complete documentation.

K-Means
EM Cluster
Group Average Agglomerative Clusterer (GAAC)

Classification

Following classifiers are implemented in NLTK, also see the complete documention.

Decision Tree
Maximum Entropy Modelling
Megam maxent optimization
Naive Bayes (and variants)

External packages, like TextCat for language identification, the Java library Weka, or SciKitLearn classifiers are supported.

Additional Features

Datasets

NLTK provides more than 100 built-in corpora, see the complete list. Some examples: Reuters news articles, Treebank 2 Wall Street Journal Campus, Twitter news or the WordNet lexical database.

Here is an example how to access the Reuters corpus.

from nltk.corpus import reuters

print(reuters.categories()[:10])
#['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']

print(reuters.fileids()[:10])
# ['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']

sample = 'test/14829'
categories = reuters.categories(sample)

print(categories)
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']

content = ""
with reuters.open(sample) as stream:
    content = stream.read()

print(f"Categories #{categories} / file #{sample}")
# Categories #['crude', 'nat-gas'] / file #test/14829

print(f"Content:\#{content}")
# Content:\#JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWARDS
# The Ministry of International Trade and
# Industry (MITI) will revise its long-term energy supply/demand
# outlook by August to meet a forecast downtrend in Japanese
# energy demand, ministry officials said.
#     MITI is expected to lower the projection for primary energy
# supplies in the year 2000 to 550 mln kilolitres (kl) from 600
# mln, they said.
#     The decision follows the emergence of structural changes in
# Japanese industry following the rise in the value of the yen
# and a decline in domestic electric power demand.
#     MITI is planning to work out a revised energy supply/demand
# outlook through deliberations of committee meetings of the
# Agency of Natural Resources and Energy, the officials said.
#     They said MITI will also review the breakdown of energy
# supply sources, including oil, nuclear, coal and natural gas.
#     Nuclear energy provided the bulk of Japan's electric power
# in the fiscal year ended March 31, supplying an estimated 27
# pct on a kilowatt/hour basis, followed by oil (23 pct) and
# liquefied natural gas (21 pct), they noted.

Corpus Management

Corpus Reader

NLTKs corpus reader objects provide reading, filtering, decoding, and preprocessing structured file lists or zip files.

Many different corpus reader objects exist, see the full list. The most common readers are:

PlaintextCorpusReader: Read text documents in which paragraphs are split into blank lines.
Markdown: Process markdown files in which its categories are represented in the file names
Tagged: Special corpus reader object that expect already tagged corpus, such as the Conl. Note that for several built-in corpus objects tagged versions already exist.
Twitter: Process tweets in JSON format
XML: Process XML files

As a short example, here is a PlaintextCorpusReader that will read *.txt files.

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')

print(corpus.fileids())
# ['AI_alignment.txt', 'AI_safety.txt', 'Artificial_intelligence.txt', 'Machine_learning.txt']

print(corpus.sents())
# [['In', 'the', 'field', 'of', 'artificial', 'intelligence', '(', 'AI', '),', 'AI', 'alignment', 'research', 'aims', 'to', 'steer', 'AI', 'systems', 'towards', 'humans', '’', 'intended', 'goals', ',', 'preferences', ',', 'or', 'ethical', 'principles', '.'], ['An', 'AI', 'system', 'is', 'considered', 'aligned', 'if', 'it', 'advances', 'the', 'intended', 'objectives', '.'], ...]

Text Collection

Another utility to access structured information from a corpus is the TextCollection class. Instantiated on tokenized texts, it provides the following functions:

collocations(num, window_size): Return up to num tuples of window_size length with words appearing collocated
collocation_list(num, window_size): Outputs collocated words as a list of tuples
common_contexts(word, num): Print the context in which word appears
concordance(word, width, lines): Prints the concordance for the given word (individual words or a sentencs)
concordance_list(word, width, lines): Prints a lists of tuples
count(word): Absolute appearance of word
tf, idf, tf_idf: Frequencies of words
generate: Create random text based on a trigram language model.
vocab: frequency distribution of all tokens
plot: Draw the frequency distribution

Here is an example:

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.text import TextCollection

corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')
col = TextCollection(corpus.sents())

print(col.count('the'))
# 973

print(col.common_contexts(['intelligence']))
# artificial_( general_( artificial_. artificial_is general_,
# artificial_, artificial_in artificial_". artificial_and "_"
# artificial_was general_and general_. artificial_; artificial_" of_or
# artificial_– artificial_to artificial_: and_.

Machine Learning Clustering and Classification Models

NLTK provides several clustering and classification algorithms. But before using any algorithm, features need to be manually designed and extracted from the texts.

On the API documentation page about classification, the steps are defined as follows:

Define the features that are relevant to the ML task
Implement methods that extract the features from the corpora (e.g. word frequency from documents)
Create a Python dictionary object that contains individual tuples with (feature_name, labels) and pass them to the training algorithm

Let’s illustrate this with an example from the NLTK Handbook to build a text classifier.

First, we build a vocabulary of all words:

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')

vocab = nltk.FreqDist(w.lower() for w in corpus.words())
#  FreqDist({'the': 65590, ',': 63310, '.': 52247, 'of': 39000, 'and': 30868, 'a': 30130, 'to': 27881, 'in': 24501, '-': 19867, '(': 18243, ...})

all_words = nltk.FreqDist(w.lower() for w in corpus.words())
word_features = list(all_words)
# ['the', ':']

Second, we define a method that returns the one-hot-encoded word vector that expresses if a word is present in the document or not. The resulting feature vector must contain boolean values in order do be useable for classification tasks.

def document_features(document):
    document_words = set(corpus.words(document))
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

f = document_features('Artificial_intelligence.txt')
# {'contains(the)': True,
#  'contains(,)': True,
#  'contains(.)': True,

Third, we select a classification algorithm and pass the featurized documents to it.

featuresets = [(document_features(d), d) for d in corpus.fileids()]
featuresets
# featuresets = [(document_features(d), d) for d in corpus.fileids()]

train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
# <nltk.classify.naivebayes.NaiveBayesClassifier at 0x185ec5dd0>

Summary

NLTK is a versatile library that supports several NLP tasks. For the core tasks of tokenizing, stemming/lemmatization, and part of speech tagging, built-in methods as well as methods from scientific papers are include. For managing a corpus of documents, NLTK handles Text, Markdown, XML and other formats, and provides an API to fetch files, categories, sentences and words. Especially helpful is the TextCollection class that enables the gathering of word collocations and computing term frequencies. Finally NLTK also offers clustering and classification algorithms such as KMeans, Decision Trees or Naive Bayes.

Blog