Algorithm explained: Text similarity using a vector space model
Pascal Thormeier
Posted on November 16, 2020
Part 3 of Algorithms explained! Every few weeks I write about an algorithm and explain and implement it!
Is there an algorithm you always wanted to know about? Leave a comment!
During my studies I attended a course called "NLP" where we learned different techniques of natural language processing. One exceptionally nifty technique I learned about was text similarity using a vector space model. I was amazed by how little it actually takes to be able to compare two or more texts.
In this post I'll explain how text similarity can be measured with a vector space model and do a practical example in PHP!
Getting and preprocessing data
We'll use a Wikipedia dump for this. I downloaded the contents of the Wikipedia articles for Albert Einstein, Marie Curie (both scientists), Ludwig van Beethoven and Florence Price (both composers).
The hypothesis is that the vector space model should yield a higher similarity value for the two scientists and the two composers than for a scientist and a composer.
First, we need to load and preprocess those articles a bit:
<?php
/**
* Removes clutter from a text.
*
* @param string $text
* @return string
*/
function preprocess(string $text): string {
// Remove any citation marks.
$text = preg_replace('/\[\d+]/', '', $text);
// Remove any phonetics.
$text = preg_replace('/\/.*\//', '', $text);
return $text;
}
/**
* Reads out an array of texts from a given directory.
*
* @param string $dir
* @return array
*/
function getTexts(string $dir): array {
$texts = [];
foreach (glob($dir . '/*.txt') as $file) {
$texts[] = preprocess(file_get_contents($file));
}
return $texts;
}
$texts = getTexts('./texts');
Since text itself isn't very useful, we'll create arrays out of all of those texts and filter them for relevant information.
Splitting it into tokens
Next, we need to tokenize those texts. This means that we create an array of tokens (words) for each text. First, we need to get rid of any punctuation and special characters by only keeping characters a to z, umlauts, spaces, numbers and apostrophes. We lowercase the text first, though, to make further processing simpler.
<?php
// ...
/**
* Splits a given text into words.
*
* @param string $text
* @return array
*/
function tokenize(string $text): array {
$text = strtolower($text);
preg_match_all('/[a-z0-9äöü\']+/', $text, $tokens);
return $tokens[0];
}
Stopwords - what to ignore
The documents we downloaded are huge. The vector space we're going to create will have several hundred dimensions, which gets impractical, especially to debug. In order to reduce the number of dimensions, we need to filter out stopwords.
Python's NLTK has an extensive set of corpora we can use for all kinds of things. Number 70 is a "Stopword" corpus that includes a list of common words in many different languages that carry little to no semantic value ("the", "it", "been", "because", "both", etc.).
We download the ZIP-file and extract the English list of stopwords near the script. Then we load the stopwords and make an array out of those. We'll also define a function to delete stopwords from a given array of tokens:
<?php
// ...
$stopwords = explode("\n", trim(file_get_contents('./stopwords_english.txt')));
/**
* Deletes all stopwords from a list of text tokens.
*
* @param array $textTokens
* @param array $stopwords
* @return array
*/
function deleteStopwords(array $textTokens, array $stopwords): array {
return array_values(array_diff($textTokens, $stopwords));
}
This makes sure that we only ever look at the relevant pieces of information by reducing the noise created by these stopwords.
So far so good. Now what?
How to actually compare texts with a vector space model
The vector space model for text similarity is pretty straight-forward: It creates a vector space where each dimension represents a single word. Words are taken from all texts that are considered.
One document is a single vector within the vector space. Each dimension of a single document vector represents how often this word appears in the text.
To compare two documents, a cosine similarity is used. This generates a value between 0 and 1, 0 meaning no similarity and 1 meaning perfect match.
Let's make a quick example: We want to compare these four texts:
1. "Hello World"
2. "Hello Code World"
3. "Hello Code"
4. "Code World Hello Code"
We first need to figure out the vector space. Quickly glossing over these texts yields three distinct words:
Hello
World
Code
This is our vector space - 3 dimensions. Now we can create vectors for this space out of our four texts:
1. "Hello World" -> (1, 1, 0)
2. "Hello Code World" -> (1, 1, 1)
3. "Hello Code" -> (1, 0, 1)
4. "Code World Hello Code" -> (1, 1, 2)
Since we're in a 3D vector space, we can represent them:
Now, with the cosine similarity, we can calculate the similarity between all texts:
Where is the angle between the two vectors, is the number of occurances of the word at place in the vector space in text and is the number of dimensions.
Let's say we want to find the text most similar to 1. "Hello World" - We can calculate right away: That's .
Now, let's calculate the cosine similarity for the first and the second text (1. "Hello World"/(1,1,0) and 2. "Hello Code World"/(1,1,1)):
For the third and the fourth text we get values of and .
So the most similar text to "Hello World" is "Hello Code World", followed by "Code World Hello Code" and "Hello Code."
Lets' program this.
Building the vector space
The previous steps were necessary to reduce the number of dimensions of our vector space. To create a usable vector space model, we start with a zero vector:
<?php
// ...
/**
* Creates a zero vector of all words in all tokenized texts.
*
* @param array $tokenizedTexts
* @return array
*/
function createZeroVector(array $tokenizedTexts): array {
$keys = array_unique(array_merge(...$tokenizedTexts));
return array_fill_keys($keys, 0);
}
This creates an associative array with each word of every text as keys and 0 as their values.
Now we can create a vector for each text:
<?php
// ...
/**
* Creates a vector for a given tokenized text.
*
* @param array $tokenizedText
* @param array $zeroVector
* @return array
*/
function createTextVector(array $tokenizedText, array $zeroVector): array {
return array_merge(
$zeroVector,
array_count_values($tokenizedText)
);
}
And apply those to our texts:
<?php
// ...
// Load and proprocess texts.
$texts = getTexts('./texts');
// Create tokenized versions of these texts.
$tokenizedTexts = array_map('tokenize', $texts);
$tokenizedTexts = array_map(function ($textTokens) use ($stopwords) {
return deleteStopwords($textTokens, $stopwords);
}, $tokenizedTexts);
// Create vector space model.
$zeroVector = createZeroVector($tokenizedTexts);
var_dump(count($zeroVector)); // 5541
// Create vectors for every text.
$vectors = array_map(function ($textTokens) use ($zeroVector) {
return createTextVector($textTokens, $zeroVector);
}, $tokenizedTexts);
Wow, a vector space with 5541 dimensions. That's a lot. But a bit less than the 5664 we would've worked with if we didn't remove stopwords. PHP, being good at arrays, shouldn't have any trouble handling this.
Doing the comparison
Remember the formula for cosine similarity?
Since there's a lot of sums involved here, we can go for some array_reduce
and array_map
combinations for the norms of the vectors
and the dot product
:
<?php
// ...
/**
* Convenience function: Squares a number.
*
* @param int $x
* @return int
*/
function square(int $x): int {
return $x**2;
}
/**
* Convenience function: Summarizes two values, to be used with array_reduce.
*
* @param int $carry
* @param int $x
* @return int
*/
function summarize(int $carry, int $x): int {
return $carry + $x;
}
/**
* Calculates the cosine similarity of two vectors.
*
* @param array $a
* @param array $b
* @return float
*/
function getSimilarity(array $a, array $b): float {
$dotProduct = array_reduce(
array_map(function ($x, $y) {
return $x * $y;
}, $a, $b),
'summarize',
0
);
$normA = array_reduce(array_map('square', $a), 'summarize', 0);
$normB = array_reduce(array_map('square', $b), 'summarize', 0);
return $dotProduct / (sqrt($normA) * sqrt($normB));
}
Awesome, now we've got everything set up to compare our previously loaded texts:
<?php
// ...
$m = [
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
];
for ($x = 0; $x < 4; $x++) {
for ($y = 0; $y < 4; $y++) {
$m[$y][$x] = sprintf('%0.3f', getSimilarity($vectors[$y], $vectors[$x]));
}
}
echo ' | Curie | Einstein | Price | Beethoven |' . "\n";
echo '------------+--------+----------+-------+-----------+' . "\n";
echo ' Curie | '.$m[0][0].' | '.$m[1][0].' | '.$m[2][0].' | '.$m[3][0].' |' . "\n";
echo '------------+--------+----------+-------+-----------+' . "\n";
echo ' Einstein | '.$m[0][1].' | '.$m[1][1].' | '.$m[2][1].' | '.$m[3][1].' |' . "\n";
echo '------------+--------+----------+-------+-----------+' . "\n";
echo ' Price | '.$m[0][2].' | '.$m[1][2].' | '.$m[2][2].' | '.$m[3][2].' |' . "\n";
echo '------------+--------+----------+-------+-----------+' . "\n";
echo ' Beethoven | '.$m[0][3].' | '.$m[1][3].' | '.$m[2][3].' | '.$m[3][3].' |' . "\n";
echo '------------+--------+----------+-------+-----------+' . "\n";
Which yields this result table:
| Curie | Einstein | Price | Beethoven |
------------+--------+----------+-------+-----------+
Curie | 1.000 | 0.221 | 0.157 | 0.178 |
------------+--------+----------+-------+-----------+
Einstein | 0.221 | 1.000 | 0.151 | 0.189 |
------------+--------+----------+-------+-----------+
Price | 0.157 | 0.151 | 1.000 | 0.305 |
------------+--------+----------+-------+-----------+
Beethoven | 0.178 | 0.189 | 0.305 | 1.000 |
------------+--------+----------+-------+-----------+
Although the similarity isn't too great with either text combination, we do see that Price and Beethoven have a higher similarity (0.305) than Price and Einstein (0.151). Also, the similarity between Curie and Einstein is higher than the similarity between Curie and Price.
The model works!
Takeaway thoughts
This method has some significant advantages and disadvantages: First of all, there's no ML involved here. No need to train a neural network or anything - it works with linear algebra. It also allows to rank similarities and calculate possible relevance of documents compared to a single one document.
The method loses text context, though: Synonyms are not counted as such, the order of terms, which may also matter, is also ignored. This can be worked around by using N-grams or lexical databases, though. Stemming the terms could also improve the performance (for example "test" and "testing" are now considered different words).
All in all, this method is a handy tool to compare different texts, but needs additional tooling to reduce the number of false positives or false negatives.
Also, here's a PHPSandbox with the code and the texts! (You can execute it by opening up the terminal and typing php ./index.php
.)
I hope you enjoyed reading this article! I write tech articles in my free time and like to drink coffee every once in a while.
If you want to support my efforts, please consider buying me a coffee or follow me on Twitter!
Posted on November 16, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.