Word Embeddings in NLP
Ravi
Posted on September 7, 2024
What is word embeddings?
Word embeddings are basically the dense vector representation of the words in a continuous vector space.
One of the key feature of this word embeddings is capturing the meaning and the relationship between the word. that means when we represent the word with the help of this word emitting, the words which are having the similar meaning or similar relationship are closer together in the overall vector space. So this will enable the mathematical operation.
Word emitting allow us to perform the mathematical operation on the words like adding and subtracting the vectors in order to find the analogies like eg: king - man + women = queen.
Popular Word Embedding Methods:
Word2Vec:
Two main architectures :
- Continous Bag-of-Words (CBOW)
- Skip-gram
CBOW predicts a word based on its surrounding context while skip-gram predicts the context words given a target word. Each of these architecture help us to better learn the vector representations by looking its usage with other words.
CBOW vs Skipgram
let's start by understanding what exactly is it all about. So if we consider the continuous bag of words, so what happens in the scenario is we are going to predict the target word based on the surrounding word. So that's the main idea behind this continuous bag of words.
Now if I consider the skip gram, in the case of the skip gram, we are going to predict the surrounding context words given the target word.
So let's consider an example and understand each of these.
Let's consider this continuous bag of words. So in the case of this continuous bag of words, we have mentioned that we are going to predict the word from the given context. So what we will be doing is we'll be using a context window for a given text data. Then the context window of this word set are surrounding a target word. That would be my input. And the hidden layer is going to average the word embeddings for the context word. And then we'll have an output which is going to predict the target word based on the averaged embedding. Then we'll have an intuition to understand the intuition.
Let's suppose that I have a text that says, as the cat sat on a mat. So let's suppose the on is basically the target word. So what we are saying is assuming that we are considering the context size four or the window size four, what we are going to do is we'll take two steps towards left and we'll take two steps towards right, that is two words before and two words after. Now with this approach, my training data will be cat sat, the mat, okay? And I want to predict on, so we'll be creating that training pair such that if I have the word cat, I want to predict on, so if I have the word sat,
I want to predict the word on. So my input data will be cat sat, the mat and my output is basically the on. So in this way we would go ahead and train the model. Once the training is completed, the hidden representation, weights will be the word embedding representation for the words that have got. So that's the idea of this continuous bag of words approach.
Now if I consider the other approach that is the skip gram approach. So in the scenario of the skip gram, the input will be the single target word, and the hidden layer will be the word representation of that target word. And the output will be the prediction of the context words that are surrounding the target word. Then intuition is, suppose if I've got the text that says as the cat sat on the mat. I want to predict that if I have sat, I want to predict cat. If I have sat, I want to predict the mat and so on., So to get the better understanding on this.
Preparing my training sample says it'll be fox, brown fox, quick fox jumps, Fox over. So this is how I'll be preparing my training samples. We'll say Fox will be my input. We'll have some hidden layer, and this hidden layer is going to finally generate the output, which is going to predict me the quick. like this, I'll do it for each and every training sample that I've generated. And at the end, the learned ways of this hidden layer will be the word embeddings of my target word. So this is the approach of skip-gram approach. Now that you have learned both the methods.
let's take a look into the difference between the continuous bag of words and the skip-gram approach.
GloVe (Global Vectors for Word Representation)
Combines global matrix factorization and local context window methods.
Captures both global statistics and local co-occurence information.
fastText:
Extension of Word2Vec that takes into account subword information (character n-grams)
Handles out-of-vocabulary (OOV) words better.
So this fast text handles the out of vocabulary words in a much better way. Now let's understand how do we actually use this word embeddings in the perspective of natural language processing?
Using Word Embeddings in NLP
1. Pre-trained embeddings: Start with pre-trained embeddings from large corpus. (e.g., Word2Vec, GLOVe on Wikipedia)
2. Input Features: Use word embeddings as an input features for the NLP model.
3. Fine-Tuning (Optional): Optionally, fine-tune the embeddings on the specific task and the data set for better performance.
4. Downstream Tasks: Train your NLP model (e.g., text classifer, machine translator) using the word embeddings.
Visualization of Word Embeddings:
You can clearly observe that the words such as computer and chair, which are having the similar context, they are together in the vector space. And on the other hand, if you look into the mother and father, they're actually together in the vector space compared to king and queen, isn't it? So the words which are having the similar context in its usage will have the similar vector representation in the vector space. So that's the core idea about this word embeddings.
Now we will be thinking as why do we require this word embedding
and what's so special about it?
semantic relationship: Captures the semantic relationship between the words, improve the overall performance of the NLP models.
Reduced dimensionality: Compared to the one-hot encoding, these word embeddings significantly reduce the dimensionality of the word representation.
Transfer learning: Pre-trained word embeddings can be used to boost the performance on the tasks with the limited data.
Posted on September 7, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.