Processing Hindi text with SpaCy
Rahul Gupta
Posted on August 21, 2020
Note: I understand that this post can be hard to follow for non-Hindi readers, so I have included English translation of those words after the Hindi words.
Tons of resources are available for processing English(and most roman languages) text, but not so much for other languages. In this post, we will explore How we can use spaCy for processing Hindi text.
Here we will be using spaCy module for processing and indic-nlp-datasets for getting data. We will be using text from Devdas novel by Sharat Chandra for demonstrating common NLP tasks here.
Let's install these two libraries.
pip install spacy
pip install indic-nlp-datasets
from idatasets.devdas import load_devdas
devdas = load_devdas()
# devdas.data is a generator of paragraphs
paragraphs = list(devdas.data)
text = " ".join(paragraphs)
words = text.split(" ")
So, words
has list of all the words in the novel.
from collections import Counter
cnt = Counter(words)
cnt.most_common(10)
# print
# [('के', 696), // of
# ('ने', 676),
# ('नही', 672), // not
# ('से', 626), // to
# ('मे', 562), // in
# ('की', 480), //
# ('है', 444), // is
# ('देवदास', 437),// Devdas
# ('को', 336), // 's
# ('पार्वती', 332)] // Parvati
What we see that top words are not specially meaningful, mostly connectors and articles. Let's use the spacy's hindi stop word list to get rid of those.
from spacy.lang.hi import STOP_WORDS as STOP_WORDS_HI
not_stop_words = [word for word in words if word not in set(STOP_WORDS_HI) ]
non_stop_cnt = Counter(non_stop_words)
non_stop_cnt.most_common(10)
# prints
# [('नही', 782), // not
# ('देवदास', 472), // Devdas
# ('कहा-', 390), // said
# ('पार्वती', 345), // Parvati
# ('क्या', 237), // what
# ('दिन', 187), // day
# ('बात', 168),// Talk
# ('तुम', 168), // you
# ('मै', 160), // I
# ('चन्द्रमुखी', 154)] // Chadramukhi
Now we see more interesting words appearing as common words. Three out of these 10 most common words (namely, 'देवदास', 'पार्वती', 'चन्द्रमुखी')[Devdas, Parvati, Chandramukhi] corresponds to three main characters around which whole love-triangle story revolves.
Printing most common word is great, isn't enough to justify a cushy data scientist job. :D So, Let's make it prettier using WordCloud.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(
width=400,
height=300,
max_font_size=50,
max_words=1000,
background_color="white",
stopwords=STOP_WORDS_HI,
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
This gives us this plot below.
Wait, where are all the words gone ??
After googling a bit, the github issue below talks about how we needs to devnagri fonts to render the image correctly.
https://github.com/amueller/word_cloud/issues/70
so, we modify the code to accept a custom font file
font="gargi.ttf"
wordcloud = WordCloud(
width=400,
height=300,
max_font_size=50,
max_words=1000,
background_color="white",
stopwords=STOP_WORDS_HI,
font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
This yields the image below
You may notice that the WordCloud renders the Hindi letters, but it doesn't contain the most frequent words that we saw before. Also, it doesn't have any of the accent("मात्रा"). So, what's happening here ?
The issue below talks about how "\w+" regex pattern doesn't work as expected in languages other than English. An easy work-around is to pass our own regex which matches with all Hindi letters including accents.
https://github.com/amueller/word_cloud/issues/272
So, let's fix that
wordcloud = WordCloud(
width=400,
height=300,
max_font_size=50,
max_words=1000,
background_color="white",
stopwords=STOP_WORDS_HI,
regexp=r"[\u0900-\u097F]+",
font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
This looks alright. Few things to note here.
- Names of all the prominent characters show up in the word cloud.
- "नहीं"(Not) word appear a lot. Which signals that characters are often not in agreement with each other.
Next up, we will talk about how you can do some other tasks such as part of speech analysis, finding names of characters/cities/organzations in a Sentence automatically.
Hope you enjoyed reading it.
If you want to play around with it in colab, checkout the link below.
Posted on August 21, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 17, 2024
November 14, 2024