Processing Hindi text with spaCy(2): Finding Synonyms
Rahul Gupta
Posted on August 21, 2020
In this post, we will explore word embedding and how can we used them to determine similarities for words, sentences and documents.
So, let's use spacy to convert raw text into spaCy docs/tokens and look at the vector embeddings.
from spacy.lang.hi import Hindi
nlp = Hindi()
sent1 = 'मुझे भोजन पसंद है।'
doc = nlp(sent1)
doc[0].vector
# array([], dtype=float32)
Oops! There is no vector corresponding to the token. As we can see that there are no word embeddings available for Hindi words. Luckily, there are word embeddings available online under fasttext project by facebook. So, we will download them and load that in spaCy.
import requests
url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz"
r = requests.get(url, allow_redirects=True)
fpath = url.split("/")[-1]
with open(fpath, "wb") as fw:
fw.write(r.content)
The word-vector file is about 1 GB in size. So, it will take some time to download.
Let's see how we can use external word embeddings in spaCy
Here is a link to spaCy documentation on how to do this https://spacy.io/usage/vectors-similarity#converting
Once word-vectors are downloaded, let's load them into spaCy model on command line
python -m spacy init-model hi ./hi_vectors_wiki_lg --vectors-loc cc.hi.300.vec.gz
Let's load the model now in spacy to do some work
import spacy
nlp_hi = spacy.load("./hi_vectors_wiki_lg")
doc = nlp_hi(sent1)
doc[0].vector
Now, we see that the vector is available to use in spaCy. Let's use these embedding to determine similarity of two sentences. Let's use these vectors to compare two very similar sentences
sent2 = 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।'
doc1 = nlp_hi(sent1)
doc2 = nlp_hi(sent2)
# Both the sent1 and sent2 are very similar, so, we expect their similarity score to be high
doc1.similarity(doc2) # prints 0.86
Now, let's use these embeddings to find synonyms of a word.
def get_similar_words(word):
vector = word.vector
results = nlp_hi.vocab.vectors.most_similar(vector.reshape(1, 300))
ret = []
for result in results:
try:
candidate = nlp_hi.vocab[result[0][0]]
ret.append(candidate.text)
except KeyError:
pass
return ret
get_similar_words(doc[1]) # prints ['भोजन']
That's not very useful.
Maybe word vectors are very sparse and trained on very small vocabulary.
Let's look into nltk library to see if we can use Hindi WordNet to find similar words of a word. However, NLTK documentation mentions that they don't support hin
language yet. So, the search continues.
After a bit of googling, I found out that a research group at IITB has been developing WordNet for Indian languages for quite a while
Checkout link more details.
They published a python library pyiwn
for easy accessibility. They haven't yet put it in nltk yet because coverage of Hindi synsets isn't enough to be integrated in NLTK yet.
With that, Let's install this library
pip install pyiwn
import pyiwn
iwn = pyiwn.IndoWordNet(lang=pyiwn.Language.HINDI)
aam_all_synsets = iwn.synsets('आम') # Mango
aam_all_synsets
# [Synset('कच्चा.adjective.2283'),
# Synset('अधपका.adjective.2697'),
# Synset('आम.noun.3462'),
# Synset('आम.noun.3463'),
# Synset('सामान्य.adjective.3468'),
# Synset('सामूहिक.adjective.3469'),
# Synset('आँव.noun.6253'),
# Synset('आँव.noun.8446'),
# Synset('आम.adjective.39736')]
It's very interesting to see that our synsets of the word include both meaning of the word: Mango and common. Let's pick one synset and different synonyms in the synset
aam = aam_all_synsets[2]
# Let's took at the definition
aam.gloss()
# prints 'एक फल जो खाया या चूसा जाता है'
# This will print examples where the word is being used
aam.examples()
# ['तोता पेड़ पर बैठकर आम खा रहा है ।',
# 'शास्त्रों ने आम को इंद्रासनी फल की संज्ञा दी है ।']
# Now, let's look at the synonyms for the word
aam.lemma_names()
# ['आम',
# 'आम्र',
# 'अंब',
# 'अम्ब',
# 'आँब',
# 'आंब',
# 'रसाल',
# 'च्यूत',
# 'प्रियांबु',
# 'प्रियाम्बु',
# 'केशवायुध',
# 'कामायुध',
# 'कामशर',
# 'कामांग']
Let's print some Hyponyms for our synset
A is a Hyponym of B if A is a type of B. For example pigeon is a bird, so pigeon is a hyponym of Bird
iwn.synset_relation(aam, pyiwn.SynsetRelations.HYPONYMY)[:5]
# [Synset('सफेदा.noun.1294'),
# Synset('अंबिया.noun.2888'),
# Synset('सिंदूरिया.noun.8636'),
# Synset('जरदालू.noun.4724'),
# Synset('तोतापरी.noun.6892')]
Conclusion
Now that we have played around with wordnet for a while. Let's recap what a WordNet is. WordNet aims to store the meaning of words along with relationships between words. So, in a sense Wordnet = Language Dictionary + Thesauras + Hierarchical IS-A relationships for nouns + More.
Note: If you want to play around with the notebooks, you can click the link below
Posted on August 21, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 17, 2024
November 14, 2024