First steps in text processing with NLTK: text tokenization and analysis
jesusramirezs
Posted on November 17, 2020
I have already had the opportunity to talk about NLTK in two of my previous articles (link#1, link#2).
In this article, I would like to review some possibilities of NLTK.
The kind of examples discussed in articles like this one fall under what is called natural language processing (NLP). We can apply these techniques to different categories of texts to obtain very varied results:
- Automatic summaries
- Sentiment analysis
- Keyword extraction for search engines
- Content recommendation
- Opinion research (in marketplaces, aggregators, etc...)
- Offensive language filters
- ...
NLTK is not the only one package in this field. There are other alternatives to NLTK for these types of tasks, such as:
- Apache OpenNLP.
- Stanford NLP suite.
- Gate NLP library.
To start experimenting, in the first place we will install NLTK from the Python CLI, which is very simple:
pip install nltk
Once the NLTK library is installed, we can install different packages from the Python command-line interface, like the Punkt sentence tokenizer :
import nltk
nltk.download('punkt')
One of the most important things to do before tackling any natural language processing task is "text tokenization". This phase can be critical because otherwise, it will be much more challenging to process the text.
Tokenization, also known as text segmentation or linguistic analysis, consists of conceptually dividing text or text strings into smaller parts such as sentences, words, or symbols. As a result of the tokenization process, we will get a list of tokens.
NLTK includes both a phrase tokenizer and a word tokenizer. A text can be converted into sentences; sentences can be tokenized into words, etc.
We have, for example, this text (from Wikipedia - Stoicism):
para = "Stoicism is a school of Hellenistic philosophy founded by Zeno of Citium in Athens in the early 3rd century BC. It is a philosophy of personal ethics informed by its system of logic and its views on the natural world. According to its teachings, as social beings, the path to eudaimonia (happiness, or blessedness) is found in accepting the moment as it presents itself, by not allowing oneself to be controlled by the desire for pleasure or by the fear of pain, by using one's mind to understand the world and to do one's part in nature's plan, and by working together and treating others fairly and justly."
To perform Tokenizing in Python is simple. We import the NLTK library and precisely the sent_tokenize function that will return a vector with a token for each phrase.
from nltk.tokenize import sent_tokenize
tokenized_l1 = sent_tokenize(para)
print(tokenized_l1)
and we will get the following result:
['Stoicism is a school of Hellenistic philosophy founded by Zeno of Citium in Athens in the early 3rd century BC.', 'It is a philosophy of personal ethics informed by its system of logic and its views on the natural world.', "According to its teachings, as social beings, the path to eudaimonia (happiness, or blessedness) is found in accepting the moment as it presents itself, by not allowing oneself to be controlled by the desire for pleasure or by the fear of pain, by using one's mind to understand the world and to do one's part in nature's plan, and by working together and treating others fairly and justly."]
Likewise, we can tokenize a sentence to obtain a list of words:
from nltk.tokenize import word_tokenize
sentence1 = tokenized_l1[0]
print(word_tokenize(sentence1))
['Stoicism', 'is', 'a', 'school', 'of', 'Hellenistic', 'philosophy', 'founded', 'by', 'Zeno', 'of', 'Citium', 'in', 'Athens', 'in', 'the', 'early', '3rd', 'century', 'BC', '.']
Well, let's do something a little closer to a real case; for example, extract some statistics from an article. We can take the content of the web page and then analyze the text to draw some conclusions from the text.
For this, we can use urllib.request to get the HTML content of our target page:
import urllib.request
response = urllib.request.urlopen('https://en.wikipedia.org/wiki/Stoicism')
html = response.read()
print(html)
and use BeautifulSoup. This is a very useful Python library for extracting data from HTML and XML documents and setting different filtering and noise removal levels. We can extract only the text of the page without HTML markup by using get_text() or a custom solution like in the example code bellow.
pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html.parser")
title = soup.select("#firstHeading")[0].text
paragraphs = soup.select("p")
intro = '\n'.join([ para.text for para in paragraphs[0:4]])
print(intro)
Finally, we can go on to convert the text obtained into tokens by dividing the text as described above:
tokens = word_tokenize(intro)
print (tokens)
From here, we can apply different tools to “standardize” our token set. For example, to convert all tokens to lowercase:
new_tokens = []
for token in tokens:
new_token = token.lower()
new_tokens.append(new_token)
tokens = new_tokens
...remove punctuation:
import re
new_tokens = []
for token in tokens:
new_token = re.sub(r'[^\w\s]', '', token)
if new_token != '':
new_tokens.append(new_token)
tokens = new_tokens
...replace numbers with their textual representation using Inflect, which is a library that generate plurals, singular nouns, ordinals, indefinite articles and convert numbers to words:
pip install inflect
import inflect
p = inflect.engine()
new_tokens = []
for token in tokens:
if token.isdigit():
new_token = p.number_to_words(token)
new_tokens.append(new_token)
else:
new_tokens.append(token)
tokens = new_tokens
... and remove stopwords, which are words that don't add significantly any sense to the text.
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
new_tokens = []
for token in tokens:
if token not in stopwords.words('english'):
new_tokens.append(token)
tokens = new_tokens
Finally, lemmatisation will allow us to extract the root of each word and thus ignore any inflection (verbal conjugations, plurals...)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = []
for token in tokens:
lemma = lemmatizer.lemmatize(token, pos='v')
lemmas.append(lemma)
tokens = lemmas
Tokenization and lemmatization are techniques widely used by me in my last project.
Once all this standardization process is done, we can move on to simple analysis, for example, calculate the frequency distribution of those tokens using a function in NLTK called FreqDist() that does the job correctly:
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
print (str(key) + ':' + str(val))
material:1
found:2
ethical:1
phrase:2
present:1
others:1
justly:1
teach:2
athens:1
natural:2
especially:1
corruptions:1
health:1
live:1
believe:1
sufficient:1
emotionally:1
misfortune:1
citium:1
moral:1
behave:1
sage:2
mind:1
everything:1
zeno:1
radical:1
epictetus—emphasized:1
pain:1
pleasure:1
moment:1
blessedness:1
allow:1
hold:1
philosophy:3
also:1
two:1
know:1
rule:1
adiaphora:1
people:1
fear:1
bc:1
use:1
vicious:1
together:1
accord:1
maintain:1
fairly:1
say:1
prohairesis:1
act:1
ethics:3
form:1
work:1
nature:3
would:1
human:1
social:1
wealth:1
stoicism:1
understand:2
three:1
one:5
mean:1
think:2
system:1
best:1
judgment:1
oneself:1
pleasure:1
accept:1
hellenistic:1
school:1
virtue:4
personal:1
person:2
errors:1
century:1
call:1
equally:1
eudaimonia:1
order:1
free:1
find:1
similar:1
destructive:1
calm:1
plan:1
value:1
alongside:1
indication:1
accordance:1
view:2
certain:1
good:3
seneca:1
bad:1
many:1
result:1
consider:1
root:1
belief:1
resilient:1
path:1
truly:1
control:1
include:1
logic:1
early:1
desire:1
world:2
life:1
inform:1
major:1
happiness:2
part:1
tradition:1
though:1
emotions:1
treat:1
since:1
stoics:3
individual:1
aim:1
aristotelian:1
be:2
upon:1
external:1
stoic:3
approach:1
And finally, we can graphically represent the result in this way.
pip install matplotlib
freq.plot(20, cumulative=False)
This first analysis can help us classify a text and determine how to index we could frame the article within a content aggregator.
We can apply formal techniques to this classification, such as a Naive Bayes classifier; most simply, in its "naive" mode, we use the conditional probabilities of the words in a text to determine which category it belongs to. This algorithm is called "naive" because it calculates each word's conditional probabilities separately as if they were independent of each other. Once we have each word's conditional probabilities in a review, we calculate the joint probability of all of them by using a Pi-product to determine the likelihood that it belongs to the category.
I would like to cover a simple example of applying a Naive Bayes classifier in another article.
To finish this article we can tackle an apparently difficult task wich becomes easy to achieve in Python, at least at a first level of approach: sentiment analysis. By using NLTK we can analyze the feeling of each sentence in the article. Sense analysis is a machine learning technique based on natural language processing, aiming to obtain subjective information from a series of texts or documents.
To do this, we must download the next package:
nltk.download('vader_lexicon')
This package implements VADER ( Valence Aware Dictionary for Sentiment Reasoning), a model used for text sentiment analysis that is sensitive to both positive/negative and strength of emotion.
Next, we will cut the text to be analyzed by using a tokenization process that allows us to divide the different sentences of a paragraph, obtaining each one of them separately.
tokenized_text = sent_tokenize(text)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
text = "The habit of reading is one of the greatest resources of mankind; and we enjoy reading books that belong to us much more than if they are borrowed. A borrowed book is like a guest in the house; it must be treated with punctiliousness, with a certain considerate formality. You must see that it sustains no damage; it must not suffer while under your roof. You cannot leave it carelessly, you cannot mark it, you cannot turn down the pages, you cannot use it familiarly. And then, some day, although this is seldom done, you really ought to return it...
(William Lyon Phelps- The pleasure of Books, from http://www.historyplace.com/speeches/phelps.htm)
...to finally instanciate the sentiment analyzer and apply it to each sentence.
analyzer = SentimentIntensityAnalyzer()
for sentence in tokenized_text:
print(sentence)
scores = analyzer.polarity_scores(sentence)
for key in scores:
print(key, ': ', scores[key])
print()
As a result we can examine each of the phrases separately.
These are some example results:
The habit of reading is one of the greatest resources of mankind; and we enjoy reading books that belong to us much more than if they are borrowed.
neg : 0.0
pos : 0.222
neu : 0.778
compound : 0.8126
A borrowed book is like a guest in the house; it must be treated with punctiliousness, with a certain considerate formality.
neg : 0.0
pos : 0.333
neu : 0.667
compound : 0.7579
You must see that it sustains no damage; it must not suffer while under your roof.
neg : 0.254
pos : 0.134
neu : 0.612
compound : -0.3716
You cannot leave it carelessly, you cannot mark it, you cannot turn down the pages, you cannot use it familiarly.
neg : 0.0
pos : 0.138
neu : 0.862
compound : 0.2235
...
And there is no doubt that in these books you see these men at their best.
neg : 0.215
pos : 0.192
neu : 0.594
compound : 0.128
For each sentence, several different scores are obtained, which we will see in the output a little further down:
- neg (negative): to tell us how negative this sentence would be.
- neu (neutral): this second value indicates the neutrality of a phrase and a score between zero and one.
- pos (positive): Same as the previous ones, but indicating how positive a phrase is.
- compound: this is a value between -1 and 1 that indicates at once whether the phrase is positive or negative. Values close to -1 suggest that it is very negative, close to zero would mean that it is neutral, and close to, that it would be very positive.
If you are interested in this field of analysis probably you can find more suitable texts available for sentiment analysis, like political opinions, product reviews... this is just an example.
Thanks for reading this article. If you have any questions, feel free to comment below.
Posted on November 17, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.