Processing Hindi text with spaCy(2): Finding Synonyms

rahul1990gupta

Rahul Gupta

Posted on August 21, 2020

Processing Hindi text with spaCy(2): Finding Synonyms

In this post, we will explore word embedding and how can we used them to determine similarities for words, sentences and documents.

So, let's use spacy to convert raw text into spaCy docs/tokens and look at the vector embeddings.

from spacy.lang.hi import Hindi 
nlp = Hindi()
sent1 = 'मुझे भोजन पसंद है।'
doc = nlp(sent1)
doc[0].vector
# array([], dtype=float32)

Enter fullscreen mode Exit fullscreen mode

Oops! There is no vector corresponding to the token. As we can see that there are no word embeddings available for Hindi words. Luckily, there are word embeddings available online under fasttext project by facebook. So, we will download them and load that in spaCy.

import requests 
url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz"
r = requests.get(url, allow_redirects=True)
fpath = url.split("/")[-1]
with open(fpath, "wb") as fw:
  fw.write(r.content)
Enter fullscreen mode Exit fullscreen mode

The word-vector file is about 1 GB in size. So, it will take some time to download.
Let's see how we can use external word embeddings in spaCy
Here is a link to spaCy documentation on how to do this https://spacy.io/usage/vectors-similarity#converting

Once word-vectors are downloaded, let's load them into spaCy model on command line

python -m spacy init-model hi ./hi_vectors_wiki_lg --vectors-loc cc.hi.300.vec.gz
Enter fullscreen mode Exit fullscreen mode

Let's load the model now in spacy to do some work
import spacy

nlp_hi = spacy.load("./hi_vectors_wiki_lg")
doc = nlp_hi(sent1)
doc[0].vector
Enter fullscreen mode Exit fullscreen mode

Now, we see that the vector is available to use in spaCy. Let's use these embedding to determine similarity of two sentences. Let's use these vectors to compare two very similar sentences

sent2 = 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।'
doc1 = nlp_hi(sent1)
doc2 = nlp_hi(sent2)

# Both the sent1 and sent2 are very similar, so, we expect their similarity score to be high
doc1.similarity(doc2) # prints 0.86
Enter fullscreen mode Exit fullscreen mode

Now, let's use these embeddings to find synonyms of a word.

def get_similar_words(word):
  vector = word.vector
  results = nlp_hi.vocab.vectors.most_similar(vector.reshape(1, 300))

  ret = []
  for result in results:    
    try:
      candidate = nlp_hi.vocab[result[0][0]]
      ret.append(candidate.text)
    except KeyError:
      pass
    return ret
get_similar_words(doc[1]) # prints ['भोजन']
Enter fullscreen mode Exit fullscreen mode

That's not very useful.
Maybe word vectors are very sparse and trained on very small vocabulary.
Let's look into nltk library to see if we can use Hindi WordNet to find similar words of a word. However, NLTK documentation mentions that they don't support hin language yet. So, the search continues.

After a bit of googling, I found out that a research group at IITB has been developing WordNet for Indian languages for quite a while
Checkout this link more details.
They published a python library pyiwn for easy accessibility. They haven't yet put it in nltk yet because coverage of Hindi synsets isn't enough to be integrated in NLTK yet.
With that, Let's install this library

pip install pyiwn 
Enter fullscreen mode Exit fullscreen mode
import pyiwn 
iwn = pyiwn.IndoWordNet(lang=pyiwn.Language.HINDI)
aam_all_synsets = iwn.synsets('आम') # Mango
aam_all_synsets

# [Synset('कच्चा.adjective.2283'),
# Synset('अधपका.adjective.2697'),
# Synset('आम.noun.3462'),
# Synset('आम.noun.3463'),
# Synset('सामान्य.adjective.3468'),
# Synset('सामूहिक.adjective.3469'),
# Synset('आँव.noun.6253'),
# Synset('आँव.noun.8446'),
# Synset('आम.adjective.39736')]
Enter fullscreen mode Exit fullscreen mode

It's very interesting to see that our synsets of the word include both meaning of the word: Mango and common. Let's pick one synset and different synonyms in the synset

aam = aam_all_synsets[2]

# Let's took at the definition 
aam.gloss()
# prints 'एक फल जो खाया या चूसा जाता है'

# This will print examples where the word is being used
aam.examples()
# ['तोता पेड़ पर बैठकर आम खा रहा है ।',
# 'शास्त्रों ने आम को इंद्रासनी फल की संज्ञा दी है ।']

# Now, let's look at the synonyms for the word 
aam.lemma_names()
# ['आम',
# 'आम्र',
# 'अंब',
# 'अम्ब',
# 'आँब',
# 'आंब',
# 'रसाल',
# 'च्यूत',
# 'प्रियांबु',
# 'प्रियाम्बु',
# 'केशवायुध',
# 'कामायुध',
# 'कामशर',
# 'कामांग']
Enter fullscreen mode Exit fullscreen mode

Let's print some Hyponyms for our synset
A is a Hyponym of B if A is a type of B. For example pigeon is a bird, so pigeon is a hyponym of Bird

iwn.synset_relation(aam, pyiwn.SynsetRelations.HYPONYMY)[:5]
# [Synset('सफेदा.noun.1294'),
# Synset('अंबिया.noun.2888'),
# Synset('सिंदूरिया.noun.8636'),
# Synset('जरदालू.noun.4724'),
# Synset('तोतापरी.noun.6892')]
Enter fullscreen mode Exit fullscreen mode

Conclusion

Now that we have played around with wordnet for a while. Let's recap what a WordNet is. WordNet aims to store the meaning of words along with relationships between words. So, in a sense Wordnet = Language Dictionary + Thesauras + Hierarchical IS-A relationships for nouns + More.

Note: If you want to play around with the notebooks, you can click the link below

Open word-embeddings-with-spacy in Colab

Open synonyms-with-pyiwn in Colab

💖 💪 🙅 🚩
rahul1990gupta
Rahul Gupta

Posted on August 21, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related