Introduction

NLTK (Natural Language Toolkit), one of the most popular libraries in Python for working with human language data (i.e., text). This tutorial will guide you through the installation process, basic concepts, and some key functionalities of NLTK.

Link for the Notebook

1.Installation

First, you need to install NLTK. You can do this easily using pip. In your command line (Terminal, Command Prompt, etc.), enter the following command:

!pip install nltk

2.Understanding the Role of nltk.download() in NLTK Setup

Use nltk.download() to fetch datasets and models for text processing with NLTK, ensuring updated resources and easing setup.

import nltk
nltk.download()

3.Tokenization

Tokenization is the process of splitting a text into meaningful units, such as words or sentences.

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello there! How are you? I hope you're learning a lot from this tutorial."

# Sentence Tokenization
sentences = sent_tokenize(text)
print(sentences)

# Word Tokenization
words = word_tokenize(text)
print(words)

4. Part-of-Speech (POS) Tagging

POS tagging means labeling words with their part of speech (noun, verb, adjective, etc.).

from nltk import pos_tag

words = word_tokenize("I am learning NLP with NLTK")
pos_tags = pos_tag(words)
print(pos_tags)

5. Stopwords

Stopwords are common words that are usually removed from text because they carry little meaningful information.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

words = word_tokenize("Hello there! How are you? I hope you're learning a lot from this tutorial.")
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if not word in stop_words]
print(filtered_words)

6. Stemming

Stemming is a process of stripping suffixes from words to extract the base or root form, known as the 'stem'. For example, the stem of the words 'waiting', 'waited', and 'waits' is 'wait'.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "It's important to be waiting patiently when you're learning to code."
words = word_tokenize(sentence)
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

7. Lemmatization

Lemmatization is the process of reducing a word to its base or dictionary form, known as the 'lemma'. Unlike stemming, lemmatization considers the context and converts the word to its meaningful base form. For instance, 'is', 'are', and 'am' would all be lemmatized to 'be'.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('wordnet', download_dir='/usr/share/nltk_data/corpora/wordnet')  # specify your NLTK data directory if it's not in the default location

lemmatizer = WordNetLemmatizer()
sentence = "The leaves on the ground were raked by the gardener, who was also planting bulbs for the coming spring."
words = word_tokenize(sentence)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

8.Frequency Distribution

This is used to find the frequency of each vocabulary item in the text.

from nltk.probability import FreqDist
words = word_tokenize("I need to write a very, very simple sentence")
fdist = FreqDist(words)
print(fdist.most_common(1))

9. Named Entity Recognition (NER)

NER is used to identify entities like names, locations, dates, etc., in the text.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "I will travel to Spain"
# Tokenize the sentence
words = word_tokenize(sentence)
# Part-of-speech tagging
pos_tags = pos_tag(words)
# Named entity recognition
named_entities = ne_chunk(pos_tags)
# Print named entities
print(named_entities)