Cracking the Code: Understanding and Developing the NLP Core of Contexto.me Using GloVe Technique

What is contexto.me game?

Contexto.me game play Contexto.me is a compelling linguistic game, taking inspiration from Semantle.com, that harnesses the power of semantic distances in language. The objective is simple yet captivating: players must discern a hidden word, with the game providing feedback on the ‘distance’ between the words the players input and the target answer.

This ‘distance’ refers not to any physical measurement, but to the semantic gap between words, as determined by their use and relatedness in natural language.

How Machines Learn the Semantic Distance Between Words: A Simple Explanation

Imagine teaching a computer to understand language like we do. To make this possible, the computer needs to see words not as individual letters or sounds, but as points in a vast space based on their meanings and uses. This is what we call ‘semantic distance’: words used similarly are closer, unrelated words are farther apart.

First, the machine scans a massive amount of text. It’s like a detective,** noticing how often and where each word appears**. This part of the process will be beautifully shown in our animation.

Next, the machine starts placing these words in the ‘semantic space’.** At first, words are scattered randomly. As the machine learns from the text, it moves similar words closer together**. This is like a dance of words settling into their right places, which our animation will also illustrate.

The magic behind all this? A technique called GloVe (Global Vectors for Word Representation). GloVe observes how words interact and uses this to create the semantic space. This is how Contexto.me can tell you how ‘close’ your word is to the target word, making each game a journey through language.

Transforming Words into Vectors: Mathematical Operations Made Possible

Wait are you telling me that I can add words? or multiply words? …. Yes

One of the fascinating aspects of converting words into vectors, or numerical representations, is that we can perform mathematical operations on them, much like we would with traditional numbers. Let’s explore this using familiar examples.

Consider the words ‘woman’ and ‘crown’. In our semantic space, each word is represented as a vector — a point with a specific direction. When we add the vectors for ‘woman’ and ‘crown’, the result aligns closely with the vector for ‘queen’. This is because, in our language usage, the combination of a ‘woman’ with a ‘crown’ often relates to the concept of a ‘queen’.

Taking it one step further, if we add ‘queen’ and ‘land’, we move towards the vector for ‘kingdom’. This is because a ‘queen’ associated with ‘land’ frequently signifies a ‘kingdom’.

In Contexto.me, these concepts form the foundation for gameplay. The primary operation it performs isn’t addition, but distance calculation. It assesses the ‘distance’ or difference between the vector for the player’s input word and the target word. By understanding these distances, players can navigate the semantic space, using their linguistic knowledge and insights to reach the target word.

The Magic of GloVe: A Friendly Guide to Understanding Its Python Implementation

The Python script shared here illustrates how the GloVe (Global Vectors for Word Representation) technique can be implemented. Let’s dissect it into key parts to understand the entire process better.

import os
import numpy as np
from scipy.sparse import lil_matrix
from sklearn.preprocessing import normalize
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

The script begins by importing the necessary Python libraries. These include ‘os’ for handling file paths, ‘numpy’ for numerical operations, ‘lil_matrix’ from ‘scipy.sparse’ for creating a matrix, ‘TruncatedSVD’ from ‘sklearn.decomposition’ for singular value decomposition (SVD), and ‘cosine_similarity’ from ‘sklearn.metrics.pairwise’ to calculate the similarity between word vectors.

def create_co_occurrence_matrix(corpus, window_size=4):

This function constructs a co-occurrence matrix, which is essential in GloVe. It records how often each word (row) occurs with every other word (column). The ‘window_size’ parameter sets the number of words to the left and right of a given word considered as its context.

def perform_svd(matrix, n_components=300):

This function applies Singular Value Decomposition (SVD) on the co-occurrence matrix to reduce its dimensionality while preserving its most important semantic features. The ‘n_components’ parameter sets the number of dimensions for the output vectors.

def create_word_embeddings(corpus):

This function calls the previous two functions to create the word embeddings. It generates the co-occurrence matrix and then applies SVD to create the final word vectors or ‘embeddings’.

def get_word_similarity(embeddings, word2id, word1, word2):

Lastly, this function calculates the cosine similarity between any two word vectors. It provides a measure of how semantically similar the words are, with 1 representing identical words and 0 indicating no semantic similarity.

similarity = get_word_similarity(embeddings, word2id, 'sun', 'sky')
print(f"The distance between the two words is: {similarity}")

# The distance between the two words is: 0.9828447113750172

Finally, the script calculates and prints the semantic distance between ‘sun’ and ‘sky’, giving a glimpse of how Contexto.me uses this kind of calculation in its gameplay.

Full implementation of the code

# Import the necessary libraries:
import os  # For reading files and managing paths
import numpy as np  # For performing mathematical operations
from scipy.sparse import lil_matrix  # For handling sparse matrices
from sklearn.decomposition import TruncatedSVD  # For Singular Value Decomposition (SVD)
from sklearn.metrics.pairwise import cosine_similarity  # For calculating cosine similarity between vectors

# Define the path to the corpus folder and obtain the list of text files
corpus_folder = "./corpus"
file_names = [f for f in os.listdir(corpus_folder) if f.endswith(".txt")]

# Initialize an empty list to store the words from the corpus
corpus = []

# Read each text file in the corpus folder and append the words to the corpus list
for file_name in file_names:
    file_path = os.path.join(corpus_folder, file_name)
    with open(file_path, "r") as corpusFile:
        for linea in corpusFile:
            word_line = linea.strip().split()
            corpus.extend(word_line)

# Function to create a co-occurrence matrix from the corpus with a given window size
def create_co_occurrence_matrix(corpus, window_size=4):
    vocab = set(corpus)  # Create a set of unique words in the corpus
    word2id = {word: i for i, word in enumerate(vocab)}  # Create a word-to-index dictionary for the words
    id2word = {i: word for i, word in enumerate(vocab)}  # Create an index-to-word dictionary for the words
    matrix = lil_matrix((len(vocab), len(vocab)))  # Initialize an empty sparse matrix of size len(vocab) x len(vocab)

    # Iterate through the corpus to fill the co-occurrence matrix
    for i in range(len(corpus)):
        for j in range(max(0, i - window_size), min(len(corpus), i + window_size)):
            if i != j:
                matrix[word2id[corpus[i]], word2id[corpus[j]]] += 1

    return matrix, word2id, id2word

# Function to perform SVD on the co-occurrence matrix and reduce the dimensionality
def perform_svd(matrix, n_components=300):
    n_components = min(n_components, matrix.shape[1] - 1)
    svd = TruncatedSVD(n_components=n_components)
    return svd.fit_transform(matrix)

# Function to create word embeddings from the corpus using the co-occurrence matrix and SVD
def create_word_embeddings(corpus):
    matrix, word2id, id2word = create_co_occurrence_matrix(corpus)  # Create the co-occurrence matrix
    word_embeddings = perform_svd(matrix)  # Perform SVD on the matrix
    return word_embeddings, word2id, id2word

# Create the word embeddings from the given corpus
embeddings, word2id, id2word = create_word_embeddings(corpus)

# Function to calculate the cosine similarity between two word vectors
def get_word_similarity(embeddings, word2id, word1, word2):
    word1_vector = embeddings[word2id[word1]]  # Get the vector representation of word1
    word2_vector = embeddings[word2id[word2]]  # Get the vector representation of word2

    # Compute the cosine similarity between the two vectors
    similarity = cosine_similarity(word1_vector.reshape(1, -1), word2_vector.reshape(1, -1))

    return similarity[0][0]

# Example usage: Calculate the similarity between the word embeddings for 'sun' and 'sky'
similarity = get_word_similarity(embeddings, word2id, 'sun', 'sky')
print(f"The distance between the two words is: {similarity}")

You don’t want to run the code on your machine, do it in Google Colab.
You can also clone it from github (with corpus included).

The Importance of Corpus in GloVe Technique

The lifeblood of any Natural Language Processing (NLP) technique, including GloVe, is a ‘corpus’. A corpus is a large and structured set of texts that the algorithm learns from. Just as humans learn language by reading, listening, and understanding context, machines need a corpus to learn the semantic relationships between words.

In the realm of GloVe, the corpus plays a pivotal role. The algorithm scrutinizes the corpus to determine how often each pair of words co-occurs within a certain context window. From this, GloVe constructs a co-occurrence matrix that serves as the foundation for generating word vectors. Essentially, the quality and diversity of the corpus directly influence the ability of GloVe to capture and quantify semantic meanings accurately.

Where can you find corpora for your own NLP projects? Fortunately, numerous resources are available online. Here are a few:

Project Gutenberg: Offers over 60,000 free eBooks, primarily from the public domain. It’s an excellent resource for historical and classic texts.
Wikipedia dump: The entire text of English Wikipedia is available for download, providing a vast and diverse language resource.
The Brown Corpus: Compiled at Brown University, this corpus contains 500 samples of English-language text, totaling roughly one million words.
The Reuters Corpus: Contains 10,788 news documents totaling 1.3 million words. It’s specifically useful for applications like news article classification.
Common Crawl: An open repository of web crawl data that can be accessed and analyzed by everyone.

One last detail to understand GloVe

A larger, diverse corpus leads to more accurate results in word similarity and semantic distance measurements. This is due to the wider range of word contexts it provides for the learning algorithm. Hence, a substantial corpus is essential for optimal outcomes.

Blog