Rust Keyword Extraction: Creating the YAKE! algorithm from scratch
Afonso Barracha
Posted on April 27, 2024
Disclosure: The material in this Tutorial has not been reviewed, endorsed, or approved of by the Rust Foundation. For more information on the Rust Foundation Trademark Policy, click here.
Series Intro
You know, like many of you, I was blown away by Chat GPT, especially GPT-4, which reignited my love for machine learning after a two-year break. But once I was back, I hit the same issue that made me quit: slow training times.
I tried unsupervised NLP models and keyword extraction, but even those were too slow for a simple VPS. That is when I realised Rust and Actix-Web could be the answer.
Therefore, let's explore Rust for NLP together and dive into the awesome world of text mining and keyword extraction.
What will I cover in this series
I will mainly cover 3 algorithms:
- TF-IDF: Term Frequency-Inverse Document Frequency algorithm converts raw text into insightful numerical data.
- Co-occurrence Matrix: the Co-occurrence Matrix covers underlying patterns and associations in a language by examining word pairs and building semantic networks in textual data.
- RAKE: Rapid Automatic Keyword Extraction Algorithm identifies crucial words and phrases to enhance text summarization, topic modeling, and information retrieval in any text-based project.
Introduction
Although I was not expecting to implement in this series, in this article we will dive into YAKE! (Yet Another Keyword Extraction Algorithm). This powerful unsupervised algorithm lets us extract meaningful keywords without relying on dictionaries or massive datasets, based on the following YAKE! Collection-Independent Automatic Keyword Extractor paper.
NOTE: due to my recent implementation of YAKE! (within the past week of this publication), I am covering this algorithm before RAKE to prioritize discussing its implementation details.
Components
The algorithm consists of six main components:
- Text pre-processing: preparing the text for analysis.
- Feature extraction: identifying important characteristics of words.
- Individual term scoring: assigning a score to each potential keyword.
- Candidate keyword list generation: creating a list of potential keywords.
- Data deduplication: removing redundant keywords.
- Ranking: ordering keywords by their computed scores.
Scoring
Each individual term is ranked based on the following characteristics:
- Casing: whether the term has uppercased versions;
- Word Positional: the term's position within the document (e.g., appearing at the beginning);
- Word Frequency: how often the term appears in the document;
- Word Relatedness to Context: how strongly the term is related to surrounding words (i.e., does it frequently appear alongside other significant words?);
- Word DifSentence: the number of different sentences in which the term appears.
The individual term score is calculated using this formula:
H = (Relatedness * Position) / (Casing + (Frequency / Relatedness) + (DifSentence / Relatedness))
After calculating the n-gram keyword, its score will be determined as follows:
S(kw) = Π(H) / TF(kw)(1 + Σ(H))
Since higher scores indicate less important keywords, the final keyword score is:
Score = 1 / S(kw)
Implementation
For a more programmatically efficient approach, we will slightly deviate from the order presented in the original paper. Our implementation will involve the following steps:
-
Text pre-processing and tokenization:
- Prepare the text for analysis (e.g., lowercasing, removing punctuation, etc.);
- Divide the text into individual tokenized sentences.
-
Candidate selection:
- Candidate creation: for an appropriate n-gram size, create a hashmap to store candidate keywords;
- Deduplication: use a separate hashmap to track unique candidate words, preventing redundancy.
-
Context building:
- Occurrence tracking: create a hashmap to store stemmed word forms, along with their casing information and locations (sentences) within the text;
- Left/Right context: employ a window size and a sliding window approach to build a hashmap that captures stemmed words and their left and right neighboring words.
-
Feature extraction and individual term scoring:
- Calculate the relevant features (Casing, Word Position, Word Frequency, Word Relatedness to Context, Word DifSentence) for each candidate word;
- Apply the provided formula to compute individual word scores.
-
Keyword scoring:
- Utilize the calculated individual word scores and provided formulas to determine the final score for each keyword.
-
Ranking with similar candidates elimination:
- We only eliminate similar candidates when getting the N best, as it is an expensive O(log(n) * n) operation.
Set-up
Create a library lib.rs
if you don't have one, and create a folder named yake
with a mod.rs
file inside.
And add the following two crates:
$ cargo add unicode-segmentation
$ cargo add regex
Inside of the module file add the following scaffold:
use std::collections::HashMap;
pub struct Yake {
keyword_rank: HashMap<String, f32>,
term_rank: HashMap<String, f32>,
size: usize,
threshold: f32,
}
Text pre-processing and tokenization
Create two files, one for pre-processing and one for building sentences (do not forget to add them to the mod file).
-
text_pre_processor.rs
:
use std::borrow::Cow; use regex::Regex; // Create a regex for removing special white spaces (tabs, new lines, etc) fn get_space_regex() -> Option<Regex> { Regex::new(r"[\n\t\r]").ok() } // Export a struct to encapsulate the logic pub struct TextPreProcessor; impl<'a> TextPreProcessor { pub fn process_text(text: &'a str) -> Cow<'a, str> { let space_regex = get_space_regex(); // 1. Trim whitespaces from both sides of the text let trimmed_text = text.trim(); if let Some(regex) = space_regex { // 2. Remove special whitespaces regex.replace_all(trimmed_text, " ") } else { // This should never happen but we do not want the system to panic trimmed_text.into() } } }
-
sentences_builder.rs
:
use std::borrow::Cow; use regex::Regex; use unicode_segmentation::UnicodeSegmentation; fn get_special_char_regex() -> Option<Regex> { Regex::new(r"('s|,|\.|\s)").ok() } // The struct for a tokenized sentence pub struct Sentence<'a> { // Cased words pub words: Vec<Cow<'a, str>>, // Lowercased words pub stemmed: Vec<String>, // Length of the sentence pub length: usize, } impl<'a> Sentence<'a> { pub fn new(s: &'a str, special_char_regex: &Option<Regex>) -> Self { let words = s .split_word_bounds() .filter_map(|w| { // Trim the word to remove whitespaces let trimmed = w.trim(); if trimmed.is_empty() { return None; } // Remove special characters from strings if let Some(regex) = special_char_regex { let value = regex.replace_all(trimmed, ""); if value.is_empty() { return None; } Some(value) } else { Some(trimmed.into()) } }) .collect::<Vec<Cow<'a, str>>>(); Self { stemmed: words .iter() .map(|w| w.to_lowercase()) .collect::<Vec<String>>(), length: words.len(), words, } } } // Struct to encapsolate the logic pub struct SentencesBuilder; impl<'a> SentencesBuilder { // Split the pre-processed text into tokenized sentences pub fn build_sentences(text: &'a str) -> Vec<Sentence<'a>> { let special_char_regex = get_special_char_regex(); // Use unicode sentences to get the text individual sentences text.unicode_sentences() .map(|s| Sentence::new(s.trim(), &special_char_regex)) .collect() } }
Candidate selection
Since candidate selection and context building depend on a loop through all the sentences, create a file for all its logic candidate_selection_and_context_builder.rs
, which will have the logic for:
- Candidates selection;
- Deduplication map creation.
Plus the context builder logic as well.
Candidate Selection
Start by creating the Candidate
struct with both:
-
lexical_form
(stemmed form) field; -
surface_forms
the way the candidate appears in the text field.
pub struct Candidate<'a> {
// Stemmed form
pub lexical_form: Vec<&'a str>,
// All the ways that stemmed form appears in the text
pub surface_forms: Vec<Vec<&'a str>>,
}
impl<'a> Candidate<'a> {
// A helper to create a new candidate
pub fn new(stems: Vec<&'a str>) -> Self {
Self {
lexical_form: stems,
surface_forms: Vec::new(),
}
}
// A helper to mutate the candidate variations
fn add(&mut self, words: Vec<&'a str>) {
self.surface_forms.push(words);
}
}
type Candidates<'a> = HashMap<String, Candidate<'a>>;
To make a candidate we need to have take every right-to-left combination of n-gram values within each sentence. Hence the code needs to do the following:
- Find combinations of words within each sentence: we look at different n-gram groupings of words to see if they could be meaningful keywords;
- Filters out irrelevant candidates: We skip candidates with things like punctuation, common words ("the", "and", etc.), and numbers to focus on the most important terms;
-
Tracks unique candidates: Ensures we only keep one version of a potential keyword, even if it appears in slightly different forms (each new form is added to the
Candidate.surface_forms
).
use std::{
cmp::min,
collections::{HashMap, HashSet, VecDeque},
};
use unicode_segmentation::UnicodeSegmentation;
use super::sentences_builder::Sentence;
// ...
fn is_punctuation(word: &str, punctuation: &HashSet<&str>) -> bool {
word.is_empty() || ((word.graphemes(true).count() == 1) && punctuation.contains(word))
}
fn is_invalid_word(word: &str, punctuation: &HashSet<&str>, stop_words: &HashSet<&str>) -> bool {
is_punctuation(word, punctuation) || stop_words.contains(word) || word.parse::<f32>().is_ok()
}
pub struct CandidateSelectionAndContextBuilder;
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
sentences: &'a [Sentence],
ngram: usize,
window_size: usize,
stop_words: HashSet<&'a str>,
punctuation: HashSet<&'a str>,
) -> (
// Candidate Selection
Candidates<'a>,
// ...
) {
// Iterate through every sentence
sentences.iter().enumerate().fold(
(
Candidates::new(),
// ...
),
|(mut candidates), (i, sentence)| {
sentence.words.iter().enumerate().fold(
VecDeque::<(&str, usize)>::with_capacity(window_size + 1),
|mut buffer, (j, w1)| {
// Iterate through each n-gram combination
// Ensure n-grams don't extend beyond the end of the sentence
(j + 1..=min(j + ngram, sentence.length)).for_each(|k: usize| {
// Take the stems for the candidate, this will later for the unique key
let stems = sentence.stemmed[j..k]
.iter()
.map(|s| s.as_str())
.collect::<Vec<&'a str>>();
// Filter out punctuation, stop words and numbers
if stems
.iter()
.any(|w| is_invalid_word(&w, &punctuation, &stop_words))
{
return;
}
let words = sentence.words[j..k]
.iter()
.map(|s| s.as_ref())
.collect::<Vec<&'a str>>();
// Join the stems into a unique candidate key
let key = stems.join(" ");
let entry = match candidates.get_mut(&key) {
Some(entry) => entry,
None => {
// ...
candidates.entry(key).or_insert(Candidate::new(stems))
}
};
entry.add(words);
});
});
(candidates)
})
}
}
Explanation: We do 3 iterations:
- Iterate through every sentence;
- Iterate through every word in a sentence;
- Iterate through the next n-gram - 1 words after the current word.
Deduplication map creation
Since our algorithm prioritizes combinations of words (n-grams) over single words, we need a way to track how often individual words appear within important multi-word keywords. This is where the deduplication map comes in.
Add a new HashMap
to select_candidates_and_build_context
:
// ...
type DedupMap<'a> = HashMap<&'a str, f32>;
// ...
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
// ...
) -> (
// Candidate Selection
Candidates<'a>,
DedupMap<'a>,
// ...
) {
// Iterate through every sentence
sentences.iter().enumerate().fold(
(
Candidates::new(),
DedupMap::new(),
// ...
),
|(mut candidates, mut dedup_map), (i, sentence)| {
// ...
(candidates, dedup_map)
})
}
}
And add the stems
words into the map if they have a length greater than 1:
// ...
type DedupMap<'a> = HashMap<&'a str, f32>;
// ...
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
// ...
) -> (
// Candidate Selection
Candidates<'a>,
DedupMap<'a>,
// ...
) {
// Iterate through every sentence
sentences.iter().enumerate().fold(
(
Candidates::new(),
DedupMap::new(),
// ...
),
|(mut candidates, mut dedup_map), (i, sentence)| {
sentence.words.iter().enumerate().fold(
VecDeque::<(&str, usize)>::with_capacity(window_size + 1),
|mut buffer, (j, w1)| {
// ...
let entry = match candidates.get_mut(&key) {
Some(entry) => entry,
None => {
// Check if stems has multiple words
if stems.len() > 1 {
// Iterate through every word
// and add them or increment it
stems.iter().for_each(|w| {
let entry = dedup_map.entry(*w).or_insert(0.0);
*entry += 1.0;
});
}
candidates.entry(key).or_insert(Candidate::new(stems))
}
};
entry.add(words);
});
});
(candidates, dedup_map)
})
}
}
Explanation: Each time we find a potential keyword with multiple words, we increment the count of its individual parts in the dedup_map
.
Context building
Building the term context is divided in two steps:
- Left/Right context building;
- Occurrences tracking.
Left/Right context building
To understand the relationships between words, we need to identify which words frequently appear to the left and right of other important terms. This is where the LeftRightContext
map comes in.
Sliding Window: To build these contexts, we will use a sliding window technique. Imagine it as a queue that moves along the sentence, capturing a fixed size (window_size
) of words at a time.
Add a new HashMap
that tracks the left and right neighbours of a term:
// ...
pub type LeftRightContext<'a> = HashMap<&'a str, (Vec<&'a str>, Vec<&'a str>)>;
// ...
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
// ...
) -> (
// Candidate Selection
Candidates<'a>,
DedupMap<'a>,
// Context Builder
// ...
LeftRightContext<'a>,
) {
// Iterate through every sentence
sentences.iter().enumerate().fold(
(
Candidates::new(),
DedupMap::new(),
// ...
LeftRightContext::new(),
),
|(mut candidates, mut dedup_map, mut lr_contexts), (i, sentence)| {
// ...
(candidates, dedup_map, lr_contexts)
})
}
}
Create a buffer that tracks the current words for a given window size:
// ...
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
// ...
) -> (
// Candidate Selection
Candidates<'a>,
DedupMap<'a>,
// Context Builder
// ...
LeftRightContext<'a>,
) {
sentences.iter().enumerate().fold(
(
Candidates::new(),
DedupMap::new(),
// ...
LeftRightContext::new(),
),
|(mut candidates, mut dedup_map, mut occurrences, mut lr_contexts), (i, sentence)| {
sentence.words.iter().enumerate().fold(
VecDeque::<(&str, usize)>::with_capacity(window_size + 1),
|mut buffer, (j, w1)| {
// ...
// Context Building
let key1 = sentence.stemmed[j].as_str();
let w1_str = w1.as_ref();
// ...
// map through the sliding window
buffer.iter().for_each(|(w2, k)| {
// add left context
let entry_1 =
lr_contexts.entry(key1).or_insert((Vec::new(), Vec::new()));
entry_1.0.push(*w2);
// add right context
let entry_2 = lr_contexts
.entry(sentence.stemmed[*k].as_str())
.or_insert((Vec::new(), Vec::new()));
entry_2.1.push(w1_str);
});
buffer.push_back((w1_str, j));
// Keep the slidding window at the window size
if buffer.len() > window_size {
buffer.pop_front();
}
buffer
},
);
(candidates, dedup_map, lr_contexts)
},
)
}
}
Explanation: Each time the window moves, we record the terms appearing to the left and right of the central term.
Occurrences tracking
To understand the significance of a term, we need to know in which sentences it appears and the different forms it takes (e.g., capitalized, lowercase). This is where the Occurrences
map plays a role.
Create a new HashMap
that tracks the term format and its sentence index:
// ...
pub type Occurrences<'a> = HashMap<&'a str, Vec<(&'a str, usize)>>;
// ...
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
// ...
) -> (
// Candidate Selection
Candidates<'a>,
DedupMap<'a>,
// Context Builder
Occurrences<'a>,
LeftRightContext<'a>,
) {
sentences.iter().enumerate().fold(
(
Candidates::new(),
DedupMap::new(),
Occurrences::new(),
LeftRightContext::new(),
),
|(mut candidates, mut dedup_map, mut occurrences, mut lr_contexts), (i, sentence)| {
// ...
(candidates, dedup_map, occurrences, lr_contexts)
},
)
}
}
Filter out invalid words before adding terms to the occurrences:
// ...
pub type Occurrences<'a> = HashMap<&'a str, Vec<(&'a str, usize)>>;
// ...
impl<'a> CandidateSelectionAndContextBuilder {
pub fn select_candidates_and_build_context(
// ...
) -> (
// Candidate Selection
Candidates<'a>,
DedupMap<'a>,
// Context Builder
Occurrences<'a>,
LeftRightContext<'a>,
) {
sentences.iter().enumerate().fold(
(
Candidates::new(),
DedupMap::new(),
Occurrences::new(),
LeftRightContext::new(),
),
|(mut candidates, mut dedup_map, mut occurrences, mut lr_contexts), (i, sentence)| {
sentence.words.iter().enumerate().fold(
VecDeque::<(&str, usize)>::with_capacity(window_size + 1),
|mut buffer, (j, w1)| {
// Context Building
let key1 = sentence.stemmed[j].as_str();
let w1_str = w1.as_ref();
// Remove invalid words
if !is_invalid_word(key1, &punctuation, &stop_words) {
// Add all valid occurrences
let entry = occurrences.entry(key1).or_default();
entry.push((w1_str, i));
}
// ...
},
);
(candidates, dedup_map, occurrences, lr_contexts)
},
)
}
}
Explanation: Each time we encounter a valid term, we add it to the Occurrences
map along with the specific form it has in the sentence and the sentence index.
Feature extraction and individual term scoring
Feature extraction and individual term scoring, as the name entails is composed of two steps, that follow the following formulas.
-
Feature extraction:
-
Explanation:
- Casing: does the term appear capitalized or in uppercase?
- Frequency: how often the normalized term occurs in the text.
- Position: where the term tends to appear within sentences, earlier appearances are more important.
- Relatedness: how strongly the term is connected to other significant words in the text. Does the term act as a stop word?
- Difference: does the term appear in a variety of sentences?
-
Formulas:
-
Casing:
MAX(TF_U, TF_C) / (1 + log(TF))
- TF_U: the uppercased term frequency;
- TF_C: the capitalized term frequency;
- TF: term frequency.
-
Frequency:
TF / (MEAN(TOTAL_TF) + STD(TOTAL_TF))
- TOTAL_TF: the term frequency of all words.
-
Position:
LOG(3 + MEDIAN(IDX))
- IDX: sentences indexes
-
Relatedness:
1 + (((WDL / WIL) + (WDR / WIR)) * (TF / MAX(TF_TOTAL))
- WDL: word degree information left, or unique terms directly neighboring the central term on the left;
- WIL: word interaction strengh left, sum of edges neighboring the central term on the left;
- WDR: same as WDL but for the right side;
- WIR: same as WIL but for the right side.
-
Difference:
UNIQUE(IDX) / LEN(IDX)
-
Casing:
-
Explanation:
-
Individual term scoring:
- Explanation: we combine the individual scores into a single score that determines the term inverse importance.
-
Formula:
(Relatedness * Position) / (Casing + (Frequency / relatedness) + (Difference / Relatedness))
These formulas are derived from the original YAKE! python implementation found here.
Set up
Create a new file named feature_extraction.rs
and add a struct to encapsulate this step logic (score_words
):
use std::collections::{HashMap, HashSet};
use unicode_segmentation::UnicodeSegmentation;
use super::candidate_selection_and_context_builder::{LeftRightContext, Occurrences};
pub struct FeatureExtractor;
impl<'a> FeatureExtractor {
pub fn score_words(
occurrences: Occurrences<'a>,
contexts: LeftRightContext<'a>,
sentences_len: f32,
) -> HashMap<&'a str, f32> {
// ...
}
}
Before scoring each score, we need to calculate the TOTAL_TF
, its max, mean and standard deviation, as it depends on all occurrences:
// ...
impl<'a> FeatureExtractor {
pub fn score_words(
occurrences: Occurrences<'a>,
contexts: LeftRightContext<'a>,
sentences_len: f32,
) -> HashMap<&'a str, f32> {
// Calculate aggregate statistics across all occurrences
// Calculate the total and max term frequency
let (tf_total, tf_max_u) = occurrences.values().fold((0, 0), |(tf_total, tf_max), v| {
let length = v.len();
(tf_total + length, tf_max.max(length))
});
let tf_max = tf_max_u as f32;
// Calculate the mean of term frequencies
let tf_mean = tf_total as f32 / (occurrences.len() as f32 + f32::EPSILON);
// Calculate standard deviation of term frequencies
let tf_std = occurrences
.values()
.fold(0.0_f32, |a, v| a + (v.len() as f32 - tf_mean).powi(2))
.sqrt();
// ...
}
}
Feature extraction
-
Casing:
impl<'a> FeatureExtractor { pub fn score_words( // ... ) -> HashMap<&'a str, f32> { // ... occurrences .into_iter() .map(|(word, occurrence)| { // The term frequency is the number of its occurrences let tf = occurrence.len() as f32; let (tf_upper, tf_capitalized) = occurrence.iter().fold( (0.0_f32, 0.0_f32), |(tf_upper, tf_capitalized), (w, _)| { ( // Check if word is all uppercased and has a length greater than 1 tf_upper + if w.graphemes(true).count() > 1 && &w.to_uppercase().as_str() == w { 1.0 } else { 0.0 }, // Check if word starts with an uppercase character and contains any lowercased character tf_capitalized + if w.chars().next().unwrap_or(' ').is_uppercase() && (w.graphemes(true).count() == 1 || w.chars().skip(1).any(|c| c.is_lowercase())) { 1.0 } else { 0.0 }, ) }, ); // Apply the `MAX(TF_U, TF_C) / (1 + log(TF))` formula let casing = tf_upper.max(tf_capitalized) / (1.0 + tf.ln()); // ... }) .collect() } }
-
Frequency:
impl<'a> FeatureExtractor { pub fn score_words( // ... ) -> HashMap<&'a str, f32> { // ... occurrences .into_iter() .map(|(word, occurrence)| { // The term frequency is the number of its occurrences let tf = occurrence.len() as f32; // ... // Apply the `TF / (MEAN(TOTAL_TF) + STD(TOTAL_TF))` formula let frequency = tf / (tf_mean + tf_std + f32::EPSILON); // ... }) .collect() } }
-
Position:
impl<'a> FeatureExtractor { pub fn score_words( // ... ) -> HashMap<&'a str, f32> { // ... occurrences .into_iter() .map(|(word, occurrence)| { // ... // Take the tf in usize let occ_len = occurrence.len(); // Calculate the median sentence index let median = if occ_len == 0 { 0.0 } else if occ_len == 1 { occurrence[0].1 as f32 } else if occ_len % 2 == 0 { occurrence[occ_len / 2].1 as f32 } else { let mid = occ_len / 2; (occurrence[mid].1 + occurrence[mid - 1].1) as f32 / 2.0 }; // Apply the `LOG(3 + MEDIAN(IDX))` formula let position = (3.0 + median).ln().ln(); // ... }) .collect() } }
-
Relatedness:
impl<'a> FeatureExtractor { pub fn score_words( // ... ) -> HashMap<&'a str, f32> { // ... occurrences .into_iter() .map(|(word, occurrence)| { // ... // Get the context let left_right_context = contexts .get(word) .map(|(v1, v2)| (v1.as_slice(), v2.as_slice())) .unwrap_or((&[], &[])); // Calculate the `WDL` by creating a hashset of the left neighbours let left_context_unique = left_right_context .0 .iter() .copied() .collect::<HashSet<&str>>(); // Calculate `(WDL / WIL)` let wl = if !left_context_unique.is_empty() { left_context_unique.len() as f32 / (left_right_context.0.len() as f32 + f32::EPSILON) } else { 0.0_f32 }; // Calculate the `WDR` by creating a hashset of the rigth neighbours let right_context_unique = left_right_context .1 .iter() .copied() .collect::<HashSet<&str>>(); // Calculate `(WDL / WIL)` let wr = if !right_context_unique.is_empty() { right_context_unique.len() as f32 / (left_right_context.1.len() as f32 + f32::EPSILON) } else { 0.0_f32 }; // Apply the `1 + (((WDL / WIL) + (WDR / WIR)) * (TF / MAX(TF_TOTAL))` formula let relatedness = 1.0 + ((wl + wr) * (tf / tf_max)); // ... }) .collect() } }
-
Difference:
impl<'a> FeatureExtractor { pub fn score_words( // ... ) -> HashMap<&'a str, f32> { // ... occurrences .into_iter() .map(|(word, occurrence)| { // ... // Calculate the unique sentences by creating a hashmap let unique_sentences = occurrence .iter() .map(|occ| occ.1) .collect::<HashSet<usize>>(); // Apply the `UNIQUE(IDX) / LEN(IDX)` formula let different = unique_sentences.len() as f32 / (sentences_len + f32::EPSILON); // ... }) .collect() } }
Individual term scoring
For this step we just need to calculate the inverse score by using all the previous calculated features in the previous formula: (Relatedness * Position) / (Casing + (Frequency / relatedness) + (Difference / Relatedness))
.
impl<'a> FeatureExtractor {
pub fn score_words(
// ...
) -> HashMap<&'a str, f32> {
// ...
occurrences
.into_iter()
.map(|(word, occurrence)| {
// ...
(
word,
(relatedness * position)
/ (casing + (frequency / relatedness) + (different / relatedness)),
)
})
.collect()
}
}
Keyword scoring
After calculating scores for individual terms, we combine them to determine the importance of multi-word keywords (n-grams). YAKE! uses the following formula to calculate a keyword's "inverse score":
S(kw) = Π(H) / TF(kw)(1 + Σ(H))
Where the equation is composed by:
-
S(kw)
: the keyword's inverse score (lower scores indicate more important keywords); -
Π(H)
: the product of individual term scores within the keyword; -
TF(kw)
: the total frequency of the keyword in the text; -
Σ(H)
: the sum of individual term scores within the keyword.
We will also score the indevidual terms just by inversing the inverse score:
W = 1 / H
This section will be divided in three steps:
- Keyword scoring;
- Term scoring;
- Building YAKE!.
Set up
Create a new file named yake_logic.rs
and add a struct to encapsulate all functions necessary to create Yake:
-
YakeLogic
struct andbuild_yake
:
use std::collections::{HashMap, HashSet}; use super::{ candidate_selection_and_context_builder::{Candidate, CandidateSelectionAndContextBuilder}, feature_extraction::FeatureExtractor, sentences_builder::SentencesBuilder, text_pre_processor::TextPreProcessor, }; pub struct YakeLogic; impl YakeLogic { pub fn build_yake( text: &str, stop_words: HashSet<&str>, punctuation: HashSet<&str>, ngram: usize, window_size: usize, ) -> (HashMap<String, f32>, HashMap<String, f32>) { // ... } // ... }
-
score_candidates
:
impl YakeLogic { // ... fn score_candidates<'a>( candidates: HashMap<String, Candidate<'a>>, dedup_hashmap: HashMap<&'a str, f32>, word_scores: &HashMap<&'a str, f32>, ) -> HashMap<String, f32> { // ... } // ... }
-
score_terms
:
impl YakeLogic { // ... fn score_terms(word_scores: HashMap<&str, f32>) -> HashMap<String, f32> { // ... } }
Keyword scoring
Scoring the keyword comprises of 3 steps:
- Iterate throuch each candidate;
- Calculate the score, and inverse it;
- Normalize the score by the max score.
impl YakeLogic {
// ...
fn score_candidates<'a>(
candidates: HashMap<String, Candidate<'a>>,
dedup_hashmap: HashMap<&'a str, f32>,
word_scores: &HashMap<&'a str, f32>,
) -> HashMap<String, f32> {
let candidates_len = candidates.len();
// Iterate throuch each candidate
let (vec_scores, max) = candidates.into_iter().fold(
(
Vec::<(String, f32)>::with_capacity(candidates_len),
f32::EPSILON,
),
|(mut acc, max), (k, pc)| {
// Calculate the product and sum
let (prod, sum) = pc.lexical_form.iter().fold(
// Penalize single words that are used in other keywords
(*dedup_hashmap.get(k.as_str()).unwrap_or(&1.0), 0.0),
|acc, w| {
let weight = *word_scores.get(*w).unwrap_or(&0.0);
(acc.0 * weight, acc.1 + weight)
},
);
let tf = pc.surface_forms.len() as f32;
let sum = if sum == -1.0 { 1.0 - f32::EPSILON } else { sum };
// Apply the `S(kw) = Π(H) / TF(kw)(1 + Σ(H))` formula
let score = prod / (tf * (1.0 + sum));
// Since the less is better, inverse the score
let inverse_score = 1.0 / score;
acc.push((k, inverse_score));
(acc, max.max(inverse_score))
},
);
// Normalize the scores
vec_scores
.into_iter()
.map(|(k, v)| (k, v / max))
.collect::<HashMap<String, f32>>()
}
// ...
}
Term scoring
Like the "Keyword scoring", term scoring comprises of 3 steps:
- Iterate throuch each term;
- Inverse the inverse term score;
- Normalize the score by the max score.
impl YakeLogic {
// ...
fn score_terms(word_scores: HashMap<&str, f32>) -> HashMap<String, f32> {
let word_scores_len = word_scores.len();
// Iterate throuch each term
let (vec_scores, max) = word_scores.into_iter().fold(
(
Vec::<(String, f32)>::with_capacity(word_scores_len),
f32::EPSILON,
),
|(mut acc, max), (k, score)| {
// Inverse the score
let inverse_score = 1.0 / score;
acc.push((k.to_string(), inverse_score));
(acc, max.max(inverse_score))
},
);
// Normalize the scores
vec_scores
.into_iter()
.map(|(k, v)| (k, v / max))
.collect::<HashMap<String, f32>>()
}
}
Building yake
Putting the previous sections together, we can just build the algorithm model:
impl YakeLogic {
pub fn build_yake(
text: &str,
stop_words: HashSet<&str>,
punctuation: HashSet<&str>,
ngram: usize,
window_size: usize,
) -> (HashMap<String, f32>, HashMap<String, f32>) {
// Preprocess the input text
let text = TextPreProcessor::process_text(text);
// Split the text into sentences
let sentences = SentencesBuilder::build_sentences(&text);
// Identify keywords and build terms context
let (candidates, dedup_hashmap, occurrences, lr_contexts) =
CandidateSelectionAndContextBuilder::select_candidates_and_build_context(
&sentences,
ngram,
window_size,
stop_words,
punctuation,
);
// Calculate feature scores for individual terms
let word_scores =
FeatureExtractor::score_words(occurrences, lr_contexts, sentences.len() as f32);
// Score keywords and individual terms
(
Self::score_candidates(candidates, dedup_hashmap, &word_scores),
Self::score_terms(word_scores),
)
}
// ...
}
The build_yake
function would be represent by this flow:
Ranking with similar candidates elimination
By reading the paper we see that:
In the fifth step, we eliminate similar candidates coming from the previous steps. For this, we use the Levenshtein distance
Hence we need to implement the Levenshtein distance algorithm before ranking the keywords.
Levenshtein distance
The Levenshtein distance algorithm measures the "edit distance" between two words. In more technical terms, it calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
use std::cmp::{max, min};
use unicode_segmentation::UnicodeSegmentation;
fn calculate_distance(str1: &str, str2: &str) -> usize {
// Return early with minimal computation
if (str1.is_empty() && str2.is_empty()) || str1 == str2 {
return 0;
}
// Split string slices into user-perceived characters
let graphemes1 = str1.graphemes(true);
let graphemes2 = str2.graphemes(true);
// Create a table to store transformations
let len = graphemes2.clone().count() + 1;
let mut prev_row = (0..len).collect::<Vec<usize>>();
let last_row = graphemes1
.enumerate()
.fold(prev_row.clone(), |row, (i, char1)| {
let mut new_row = vec![i + 1; len];
graphemes2.clone().enumerate().for_each(|(j, char2)| {
// Calculating the cost of converting substrings
// The cost is 0 if the characters are the same, otherwise 1 (for a substitution)
let cost = if char1 == char2 { 0 } else { 1 };
new_row[j + 1] = min(row[j + 1] + 1, min(new_row[j] + 1, row[j] + cost));
});
prev_row = row;
new_row
});
last_row[len - 1]
}
pub struct Levenshtein<'a>(&'a str, &'a str, usize);
impl<'a> Levenshtein<'a> {
pub fn new(str1: &'a str, str2: &'a str) -> Self {
Self(str1, str2, calculate_distance(str1, str2))
}
pub fn ratio(&self) -> f32 {
let max_len = max(
self.0.graphemes(true).count(),
self.1.graphemes(true).count(),
);
// Calculate the percentage of change between the two string slices
1.0 - (self.2 as f32 / max_len as f32)
}
}
Ranking keywords/terms
Ranking keywords is composed of two steps:
- Sorting the keywords/terms maps;
- Building the ranking vector by removing similar candidates.
On the mod.rs
file add the following functions:
-
sort_ranked_map
:
fn sort_ranked_map<'a>(map: &'a HashMap<String, f32, RandomState>) -> Vec<(&'a String, &'a f32)> { let mut map_values = map.iter().collect::<Vec<(&'a String, &'a f32)>>(); map_values.sort_by(|a, b| { let order = b.1.partial_cmp(a.1).unwrap_or(Ordering::Equal); if order == Ordering::Equal { return a.0.cmp(b.0); } order }); map_values }
-
build_ranked_scores
:
fn build_ranked_scores(vec: &mut Vec<(String, f32)>, word: &str, score: f32, threshold: f32) { if vec .iter() .any(|(w, _)| Levenshtein::new(w, word).ratio() >= threshold) { return; } vec.push((word.to_string(), score)); }
Putting all together
Implement a new
function on the Yake
struct:
// ...
mod levenshtein;
// ...
pub struct Yake {
keyword_rank: HashMap<String, f32>,
term_rank: HashMap<String, f32>,
size: usize,
threshold: f32,
}
impl Yake {
pub fn new(
text: &str,
stop_words: &'a [String],
puctuation: &'a [String],
threshold: f32,
ngram: usize,
window_size: usize,
) -> Self {
let (keyword_rank, term_rank) = YakeLogic::build_yake(
text,
stop_words.iter().map(|s| s.as_str()).collect(),
match puctuation {
Some(p) => p.iter().map(|s| s.as_str()).collect(),
None => PUNCTUATION.iter().copied().collect(),
},
ngram,
window_size,
);
Self {
size: keyword_rank.len(),
keyword_rank,
term_rank,
threshold,
}
}
// ...
}
And add a function to get the ranked keyword scores:
impl Yake {
// ...
pub fn get_ranked_keyword_scores(&self, n: usize) -> Vec<(String, f32)> {
// Calculate the capacity
let capacity = min(self.size, n);
let result = sort_ranked_map(&self.keyword_rank).into_iter().try_fold(
Vec::<(String, f32)>::with_capacity(capacity),
|mut acc, (word, score)| {
// Short-circuit and exit if Vec reached capacity
if acc.len() == capacity {
return Err(acc);
}
build_ranked_scores(&mut acc, word, *score, self.threshold);
Ok(acc)
},
);
match result {
Ok(v) => v,
Err(v) => v,
}
}
}
Conclusion
In this article we explored the YAKE! algorithm, a versatile and efficient unsupervised keyword extraction technique. We delved into its implementation, covering the following steps:
- Pre-processing: preparing the text for analysis;
- Candidate selection and context building: identifying potential keywords and their relationships;
- Feature extraction and scoring: calculating features to determine word and keyword importance;
- Keyword ranking and deduplication: sorting keywords by their calculated scores and eliminating redundant candidates.
All the code discussed in this article can be accessed through this repository. For integration with existing projects consider using keyword_extraction crate available on crates.io.
About the Author
Hey there! I am Afonso Barracha, a back-end developer with a soft spot for GraphQL. If you enjoyed reading this article, why not show some love by buying me a coffee?
Lately, I have been diving deep into more advanced subjects. As a result, I have switched from sharing my thoughts every week to posting once or twice a month. This way, I can make sure to bring you the highest quality content possible.
Do not miss out on any of my latest articles – follow me here on dev and LinkedIn to stay updated. I would be thrilled to welcome you to our ever-growing community! See you around!
Posted on April 27, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
February 17, 2024