Distributed High-Speed Spell Checking & Correction In Node.js
Mohamed El-Bably
Posted on May 3, 2023
Introduction
Spell-checking is a widely used feature in software applications such as search engines, email clients, word processors and chatbots.
SymSpellEx (Extended SymSpell)
Node.js spelling correction & Fuzzy search library based on Symmetric Delete Spelling Correction algorithm (SymSpell)
Why SymSpellEx
- Simple spell checking algorithms, such as the Norvig Algorithm, are not efficient and can be slow.
- I needed a spell checker that would work on a distributed system with multiple nodes, to perform spelling correction without duplicating data on each node.
- I needed to ensure that any changes or updates to the training data would be reflected across all connected nodes.
Edit Distance
Edit distance is a measure of how different two strings are in terms of the minimum number of operations required to transform one string into the other.
Edit distance algorithms compute the minimum number of edit operations required to transform one string into another, where edit operations include insertion, deletion, and substitution of characters. Some popular edit distance algorithms include:
- Levenshtein distance: This algorithm computes the minimum number of insertions, deletions, and substitutions required to transform one string into another.
- Damerau-Levenshtein distance: This is similar to Levenshtein distance, but also allows for transpositions (i.e., swapping adjacent characters).
- Jaro distance: This algorithm computes the similarity between two strings based on the number of matching characters and the number of transpositions needed to make the strings identical.
- Jaro-Winkler distance: This is an extension of the Jaro distance that assigns higher weights to prefix matches.
Example
The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:
kitten → sitten (substitute "s" for "k")
sitten → sittin (substitute "i" for "e")
sittin → sitting (insert "g" at the end)
The total number of operations needed to transform "kitten" into "sitting" is three, which is the edit distance between the two strings.
Naive approach
It involves checking each word in the dictionary to find the word with the smallest edit distance to the misspelled word.
For example, if the misspelled word is "teh", the algorithm will check every word in the dictionary to find the word with the smallest edit distance. The algorithm may find "the" as a possible correction, since it only requires one edit (a substitution of "h" with "e")
This approach is computationally very expensive, which makes it not practical to use.
Peter Norvig Algorithm
Peter Norvig Algorithm is a spell-checking algorithm that generates all possible terms with a given edit distance from the query term and searches them in a dictionary to find the correct spelling. The edit distance is calculated based on the number of operations needed to transform the query term to the correct spelling, where an operation can be a deletion, transposition, replacement, or insertion of a single character.
The algorithm follows these steps:
Generate all possible candidate words within an edit distance of
n
from the input word, by applying deletions, transpositions, replacements, and insertions to the input word.Filter the candidate words to keep only the ones that appear in a pre-existing corpus of text, in this case, a large English text corpus.
Rank the remaining candidate words based on their frequency of occurrence in the corpus.
Return the most likely candidate word as the corrected spelling.
Example:
Suppose the input word is "the", and we have a large English text corpus to draw from. We first generate all possible candidate words within an edit distance of 2 from "the". This gives us a set of words such as "thee", "them", "then", "there", "these", "theta", and so on.
Next, we filter this set to only include the candidate words that appear in the English text corpus. Suppose this narrows down our set to just "the" and "they".
Finally, we rank the remaining candidate words by their frequency of occurrence in the corpus. Since "the" is a very common word in English, it is much more likely to be the intended spelling than "they". Therefore, "the" is returned as the corrected spelling for the input word "the".
However, this method can be expensive and language-dependent. For a word of length n, an alphabet size a, and an edit distance of 1, there will be n
deletions, n-1
transpositions, a * n
alterations, and a * (n+1)
insertions.
Candidate words number can be calculated using:
2n+2an+a-1
or 54n + 25
(Assuming we are using english alphabet having a = 26
), will lead to 90902
candidate words search when n = 8
.
In some languages, such as Chinese, the alphabet can be enormous, resulting in even more possible words to search through.
Symmetric Delete Spelling Correction algorithm
The Symmetric Delete spelling correction algorithm simplifies the process of generating possible spellings and searching in a dictionary. It does so by only using deletion as an operation, instead of deletion combined with transpose, replace, and insert operations. As a result, it is much faster, being six orders of magnitude faster than traditional methods for edit distance of 3. Moreover, it is language-independent.
Why the algorithm is fast?
Pre-calculation in training phase, while all possible spelling error variants as generated (deletes only) and stored in hash table, which makes it also fast when searching with an average search time complexity of O(1).
This makes the algorithm very fast, but it also required a large memory footprint, and the training phase takes a considerable amount of time to build the dictionary first time.
For more details check SymSpell
SymSpellEx
SymSpellEx is building on SymSpell to provide extensibility by implementing different edit distance algorithms or implementing different data store.
Features
- Very fast
- Word suggestions
- Word correction
- Multiple languages supported - The algorithm, and the implementation are language independent
- Extendable - Edit distance and data stores can be implemented to extend library functionalities
Main Components
1. Tokenizer
This interface can be implemented to provide a different tokenizer for the library, a simple core tokenizer is provided.
export interface Tokenizer {
tokenize(input: string): Array<Token>;
}
2. Edit Distance
This interface can be implemented to provide different algorithms to use to calculate edit distance between two words.
interface EditDistance {
name: String;
calculateDistance(source: string, target: string): number;
}
Built-in Edit Distance Algorithms
- Damerau–Levenshtein distance
3. Data Store
This interface can be implemented to provide additional method to store data other than built-in stores (Memory, Redis)
Data store should handle storage for these 2 data types:
- Terms: List data structure to store terms and retrieve it by index
- Entries: Hash Table data structure to store dictionary entries and retrieve data by term (Key)
Data store should also handle storage for multiple languages and switch between them, check implemented data stores.
export interface DataStore {
name: string;
initialize(): Promise<void>;
isInitialized(): boolean;
setLanguage(language: string): Promise<void>;
pushTerm(value: string): Promise<number>;
getTermAt(index: number): Promise<string>;
getTermsAt(indexes: Array<number>): Promise<Array<string>>;
getEntry(key: string): Promise<Array<number>>;
getEntries(keys: Array<string>): Promise<Array<Array<number>>>;
setEntry(key: string, value: Array<number>): Promise<boolean>;
hasEntry(key: string): Promise<boolean>;
maxEntryLength(): Promise<number>;
clear(): Promise<void>;
}
Built-in data stores
- Memory: Stores data in memory, using array structure for terms and high speed hash table (megahash) to manage dictionary entries
May be limited by node process memory limits, which can be overridden
- Redis: Stores data into Redis database using list structure to store terms and hash to store dictionary data
Very efficient way to train and store data, it will allow accessing by multiple processes and/or nodes/machines, training data one time on centralized distributed store, with enough memory data can be updated in production using different namespace on redis without interruptions, also dumping and migrating data will be easy.
Redis data store uses LUA scripting to efficiently set entries using multiple Redis commands on server side, by defining custom command hSetEntry
async initialize(): Promise<void> {
await this._redis.defineCommand('hSetEntry', {
numberOfKeys: 2,
lua:
`
local olen = redis.call("hget", "${this._configNamespace}", ARGV[2])
local value = redis.call("hset", KEYS[1], KEYS[2], ARGV[1])
local nlen = #KEYS[2]
if(not olen or nlen > tonumber(olen)) then
redis.call("hset", "${this._configNamespace}", ARGV[2], nlen)
end
return value
`
});
this._initialized = true;
}
Usage
Training
For single term training you can use add
function:
import {SymSpellEx, MemoryStore} from 'symspell-ex';
const LANGUAGE = 'en';
// Create SymSpellEx instnce and inject store new store instance
symSpellEx = new SymSpellEx(new MemoryStore());
await symSpellEx.initialize();
// Train data
await symSpellEx.add("argument", LANGUAGE);
await symSpellEx.add("computer", LANGUAGE);
For multiple terms (Array) you can use train
function:
const terms = ['argument', 'computer'];
await symSpellEx.train(terms, 1, LANGUAGE);
Searching
search
function can be used to get multiple suggestions if available up to the maxSuggestions
value
Arguments:
-
input
String (Wrong/Invalid word we need to correct) -
language
String (Language to be used in search) -
maxDistance
Number, optional, default =2
(Maximum distance for suggestions) -
maxSuggestions
Number, optional, default =5
(Maximum suggestions number to return)
Return: Array<Suggetion>
Array of suggestions
Example
await symSpellEx.search('argoments', 'en');
Example Suggestion Object
:
{
"term": "argoments",
"suggestion": "arguments",
"distance": 2,
"frequency": 155
}
Correction
correct
function can be used to get the best suggestion for input word or sentence in terms of edit distance and frequency
Arguments:
-
input
String (Wrong/Invalid word we need to correct) -
language
String (Language to be used in search) -
maxDistance
Number, optional, default =2
(Maximum distance for suggestions)
Return: Correction
object which contains original input
and corrected output
string, with array of suggestions
Example
await symSpellEx.correct('Special relatvity was orignally proposed by Albert Einstein', 'en');
Returns this Correction
object:
This output is totally depending on the quality of the training data that was push into the store
{
"suggestions": [],
"input": "Special relatvity was orignally proposed by Albert Einstein",
"output": "Special relativity was originally proposed by Albert Einstein"
}
Github Repository
https://github.com/m-elbably/symspell-ex
Conclusion
Spell-checking is a widely used feature in software applications. Simple spell-checking algorithms are not efficient and can be slow.
There are many popular edit distance algorithms such as Levenshtein distance, Damerau-Levenshtein distance, Jaro distance, and Jaro-Winkler distance.
Naive approach involves checking each word in the dictionary to find the word with the smallest edit distance to the misspelled word, which is computationally expensive.
Peter Norvig Algorithm is a spell-checking algorithm that generates all possible terms with a given edit distance from the query term and searches them in a dictionary to find the correct spelling.
Symmetric Delete spelling correction algorithm simplifies the process of generating possible spellings and searching in a dictionary by only using deletion as an operation, resulting in being much faster and language-independent.
References
Posted on May 3, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.