Processing Hindi text with SpaCy

rahul1990gupta

Rahul Gupta

Posted on August 21, 2020

Processing Hindi text with SpaCy

Note: I understand that this post can be hard to follow for non-Hindi readers, so I have included English translation of those words after the Hindi words.

Tons of resources are available for processing English(and most roman languages) text, but not so much for other languages. In this post, we will explore How we can use spaCy for processing Hindi text.

Here we will be using spaCy module for processing and indic-nlp-datasets for getting data. We will be using text from Devdas novel by Sharat Chandra for demonstrating common NLP tasks here.

Let's install these two libraries.

pip install spacy 
pip install indic-nlp-datasets
Enter fullscreen mode Exit fullscreen mode
from idatasets.devdas import load_devdas

devdas = load_devdas()
# devdas.data is a generator of paragraphs
paragraphs = list(devdas.data)
text = " ".join(paragraphs)
words = text.split(" ")
Enter fullscreen mode Exit fullscreen mode

So, words has list of all the words in the novel.

from collections import Counter 
cnt = Counter(words)

cnt.most_common(10)
# print 
# [('के', 696), // of
#  ('ने', 676), 
#  ('नही', 672), // not
#  ('से', 626), // to 
#  ('मे', 562), // in 
#  ('की', 480), // 
#  ('है', 444), // is 
#  ('देवदास', 437),// Devdas
#  ('को', 336), // 's
#  ('पार्वती', 332)] // Parvati
Enter fullscreen mode Exit fullscreen mode

What we see that top words are not specially meaningful, mostly connectors and articles. Let's use the spacy's hindi stop word list to get rid of those.

from spacy.lang.hi import STOP_WORDS as STOP_WORDS_HI
not_stop_words = [word for word in words if word not in set(STOP_WORDS_HI) ]

non_stop_cnt = Counter(non_stop_words)

non_stop_cnt.most_common(10)

# prints 
# [('नही', 782), // not
#  ('देवदास', 472), // Devdas 
#  ('कहा-', 390), // said
#  ('पार्वती', 345), // Parvati
#  ('क्या', 237), // what 
#  ('दिन', 187), // day 
#  ('बात', 168),// Talk 
#  ('तुम', 168), // you
#  ('मै', 160), // I 
#  ('चन्द्रमुखी', 154)] // Chadramukhi
Enter fullscreen mode Exit fullscreen mode

Now we see more interesting words appearing as common words. Three out of these 10 most common words (namely, 'देवदास', 'पार्वती', 'चन्द्रमुखी')[Devdas, Parvati, Chandramukhi] corresponds to three main characters around which whole love-triangle story revolves.

Printing most common word is great, isn't enough to justify a cushy data scientist job. :D So, Let's make it prettier using WordCloud.

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Enter fullscreen mode Exit fullscreen mode

This gives us this plot below.
Alt Text
Wait, where are all the words gone ??

After googling a bit, the github issue below talks about how we needs to devnagri fonts to render the image correctly.
https://github.com/amueller/word_cloud/issues/70

so, we modify the code to accept a custom font file


font="gargi.ttf"

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Enter fullscreen mode Exit fullscreen mode

This yields the image below
Alt Text
You may notice that the WordCloud renders the Hindi letters, but it doesn't contain the most frequent words that we saw before. Also, it doesn't have any of the accent("मात्रा"). So, what's happening here ?

The issue below talks about how "\w+" regex pattern doesn't work as expected in languages other than English. An easy work-around is to pass our own regex which matches with all Hindi letters including accents.
https://github.com/amueller/word_cloud/issues/272

So, let's fix that


wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    regexp=r"[\u0900-\u097F]+", 
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Enter fullscreen mode Exit fullscreen mode

This yields the image below.
Alt Text

This looks alright. Few things to note here.

  • Names of all the prominent characters show up in the word cloud.
  • "नहीं"(Not) word appear a lot. Which signals that characters are often not in agreement with each other.

Next up, we will talk about how you can do some other tasks such as part of speech analysis, finding names of characters/cities/organzations in a Sentence automatically.

Hope you enjoyed reading it.
If you want to play around with it in colab, checkout the link below.
Open In Colab

💖 💪 🙅 🚩
rahul1990gupta
Rahul Gupta

Posted on August 21, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related