Mahmoud Sehsah
Posted on January 27, 2024
Introduction
NLTK (Natural Language Toolkit), one of the most popular libraries in Python for working with human language data (i.e., text). This tutorial will guide you through the installation process, basic concepts, and some key functionalities of NLTK.
1.Installation
First, you need to install NLTK. You can do this easily using pip. In your command line (Terminal, Command Prompt, etc.), enter the following command:
!pip install nltk
2.Understanding the Role of nltk.download() in NLTK Setup
Use nltk.download() to fetch datasets and models for text processing with NLTK, ensuring updated resources and easing setup.
import nltk
nltk.download()
3.Tokenization
Tokenization is the process of splitting a text into meaningful units, such as words or sentences.
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Hello there! How are you? I hope you're learning a lot from this tutorial."
# Sentence Tokenization
sentences = sent_tokenize(text)
print(sentences)
# Word Tokenization
words = word_tokenize(text)
print(words)
4. Part-of-Speech (POS) Tagging
POS tagging means labeling words with their part of speech (noun, verb, adjective, etc.).
from nltk import pos_tag
​
words = word_tokenize("I am learning NLP with NLTK")
pos_tags = pos_tag(words)
print(pos_tags)
5. Stopwords
Stopwords are common words that are usually removed from text because they carry little meaningful information.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize("Hello there! How are you? I hope you're learning a lot from this tutorial.")
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if not word in stop_words]
print(filtered_words)
6. Stemming
Stemming is a process of stripping suffixes from words to extract the base or root form, known as the 'stem'. For example, the stem of the words 'waiting', 'waited', and 'waits' is 'wait'.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "It's important to be waiting patiently when you're learning to code."
words = word_tokenize(sentence)
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
7. Lemmatization
Lemmatization is the process of reducing a word to its base or dictionary form, known as the 'lemma'. Unlike stemming, lemmatization considers the context and converts the word to its meaningful base form. For instance, 'is', 'are', and 'am' would all be lemmatized to 'be'.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet', download_dir='/usr/share/nltk_data/corpora/wordnet') # specify your NLTK data directory if it's not in the default location
lemmatizer = WordNetLemmatizer()
sentence = "The leaves on the ground were raked by the gardener, who was also planting bulbs for the coming spring."
words = word_tokenize(sentence)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
8.Frequency Distribution
This is used to find the frequency of each vocabulary item in the text.
from nltk.probability import FreqDist
words = word_tokenize("I need to write a very, very simple sentence")
fdist = FreqDist(words)
print(fdist.most_common(1))
9. Named Entity Recognition (NER)
NER is used to identify entities like names, locations, dates, etc., in the text.
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
sentence = "I will travel to Spain"
# Tokenize the sentence
words = word_tokenize(sentence)
# Part-of-speech tagging
pos_tags = pos_tag(words)
# Named entity recognition
named_entities = ne_chunk(pos_tags)
# Print named entities
print(named_entities)
Posted on January 27, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.