Distributed High-Speed Spell Checking & Correction In Node.js

melbably

Mohamed El-Bably

Posted on May 3, 2023

Distributed High-Speed Spell Checking & Correction In Node.js

Introduction

Spell-checking is a widely used feature in software applications such as search engines, email clients, word processors and chatbots.

SymSpellEx (Extended SymSpell)

Node.js spelling correction & Fuzzy search library based on Symmetric Delete Spelling Correction algorithm (SymSpell)

Why SymSpellEx

  • Simple spell checking algorithms, such as the Norvig Algorithm, are not efficient and can be slow.
  • I needed a spell checker that would work on a distributed system with multiple nodes, to perform spelling correction without duplicating data on each node.
  • I needed to ensure that any changes or updates to the training data would be reflected across all connected nodes.

Edit Distance

Edit distance is a measure of how different two strings are in terms of the minimum number of operations required to transform one string into the other.

Edit distance algorithms compute the minimum number of edit operations required to transform one string into another, where edit operations include insertion, deletion, and substitution of characters. Some popular edit distance algorithms include:

  • Levenshtein distance: This algorithm computes the minimum number of insertions, deletions, and substitutions required to transform one string into another.
  • Damerau-Levenshtein distance: This is similar to Levenshtein distance, but also allows for transpositions (i.e., swapping adjacent characters).
  • Jaro distance: This algorithm computes the similarity between two strings based on the number of matching characters and the number of transpositions needed to make the strings identical.
  • Jaro-Winkler distance: This is an extension of the Jaro distance that assigns higher weights to prefix matches.

Example

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

kitten → sitten (substitute "s" for "k")
sitten → sittin (substitute "i" for "e")
sittin → sitting (insert "g" at the end)
Enter fullscreen mode Exit fullscreen mode

The total number of operations needed to transform "kitten" into "sitting" is three, which is the edit distance between the two strings.

Naive approach

It involves checking each word in the dictionary to find the word with the smallest edit distance to the misspelled word.

For example, if the misspelled word is "teh", the algorithm will check every word in the dictionary to find the word with the smallest edit distance. The algorithm may find "the" as a possible correction, since it only requires one edit (a substitution of "h" with "e")

This approach is computationally very expensive, which makes it not practical to use.

Peter Norvig Algorithm

Peter Norvig Algorithm is a spell-checking algorithm that generates all possible terms with a given edit distance from the query term and searches them in a dictionary to find the correct spelling. The edit distance is calculated based on the number of operations needed to transform the query term to the correct spelling, where an operation can be a deletion, transposition, replacement, or insertion of a single character.

The algorithm follows these steps:

  1. Generate all possible candidate words within an edit distance of n from the input word, by applying deletions, transpositions, replacements, and insertions to the input word.

  2. Filter the candidate words to keep only the ones that appear in a pre-existing corpus of text, in this case, a large English text corpus.

  3. Rank the remaining candidate words based on their frequency of occurrence in the corpus.

  4. Return the most likely candidate word as the corrected spelling.

Example:
Suppose the input word is "the", and we have a large English text corpus to draw from. We first generate all possible candidate words within an edit distance of 2 from "the". This gives us a set of words such as "thee", "them", "then", "there", "these", "theta", and so on.

Next, we filter this set to only include the candidate words that appear in the English text corpus. Suppose this narrows down our set to just "the" and "they".

Finally, we rank the remaining candidate words by their frequency of occurrence in the corpus. Since "the" is a very common word in English, it is much more likely to be the intended spelling than "they". Therefore, "the" is returned as the corrected spelling for the input word "the".

However, this method can be expensive and language-dependent. For a word of length n, an alphabet size a, and an edit distance of 1, there will be n deletions, n-1 transpositions, a * n alterations, and a * (n+1) insertions.

Candidate words number can be calculated using:
2n+2an+a-1 or 54n + 25 (Assuming we are using english alphabet having a = 26), will lead to 90902 candidate words search when n = 8.

In some languages, such as Chinese, the alphabet can be enormous, resulting in even more possible words to search through.

Symmetric Delete Spelling Correction algorithm

The Symmetric Delete spelling correction algorithm simplifies the process of generating possible spellings and searching in a dictionary. It does so by only using deletion as an operation, instead of deletion combined with transpose, replace, and insert operations. As a result, it is much faster, being six orders of magnitude faster than traditional methods for edit distance of 3. Moreover, it is language-independent.

Why the algorithm is fast?

Pre-calculation in training phase, while all possible spelling error variants as generated (deletes only) and stored in hash table, which makes it also fast when searching with an average search time complexity of O(1).

This makes the algorithm very fast, but it also required a large memory footprint, and the training phase takes a considerable amount of time to build the dictionary first time.

For more details check SymSpell

SymSpellEx

SymSpellEx is building on SymSpell to provide extensibility by implementing different edit distance algorithms or implementing different data store.

SymSpellEx Design

Features

  • Very fast
  • Word suggestions
  • Word correction
  • Multiple languages supported - The algorithm, and the implementation are language independent
  • Extendable - Edit distance and data stores can be implemented to extend library functionalities

Main Components

1. Tokenizer

This interface can be implemented to provide a different tokenizer for the library, a simple core tokenizer is provided.

export interface Tokenizer {
    tokenize(input: string): Array<Token>;
}
Enter fullscreen mode Exit fullscreen mode

2. Edit Distance

This interface can be implemented to provide different algorithms to use to calculate edit distance between two words.

interface EditDistance {
    name: String;
    calculateDistance(source: string, target: string): number;
}
Enter fullscreen mode Exit fullscreen mode

Built-in Edit Distance Algorithms

  • Damerau–Levenshtein distance

3. Data Store

This interface can be implemented to provide additional method to store data other than built-in stores (Memory, Redis)

Data store should handle storage for these 2 data types:

  • Terms: List data structure to store terms and retrieve it by index
  • Entries: Hash Table data structure to store dictionary entries and retrieve data by term (Key)

Data store should also handle storage for multiple languages and switch between them, check implemented data stores.

export interface DataStore {
    name: string;
    initialize(): Promise<void>;
    isInitialized(): boolean;
    setLanguage(language: string): Promise<void>;
    pushTerm(value: string): Promise<number>;
    getTermAt(index: number): Promise<string>;
    getTermsAt(indexes: Array<number>): Promise<Array<string>>;
    getEntry(key: string): Promise<Array<number>>;
    getEntries(keys: Array<string>): Promise<Array<Array<number>>>;
    setEntry(key: string, value: Array<number>): Promise<boolean>;
    hasEntry(key: string): Promise<boolean>;
    maxEntryLength(): Promise<number>;
    clear(): Promise<void>;
}
Enter fullscreen mode Exit fullscreen mode

Built-in data stores

  • Memory: Stores data in memory, using array structure for terms and high speed hash table (megahash) to manage dictionary entries

May be limited by node process memory limits, which can be overridden

  • Redis: Stores data into Redis database using list structure to store terms and hash to store dictionary data

Very efficient way to train and store data, it will allow accessing by multiple processes and/or nodes/machines, training data one time on centralized distributed store, with enough memory data can be updated in production using different namespace on redis without interruptions, also dumping and migrating data will be easy.

Redis data store uses LUA scripting to efficiently set entries using multiple Redis commands on server side, by defining custom command hSetEntry

async initialize(): Promise<void> {
        await this._redis.defineCommand('hSetEntry', {
            numberOfKeys: 2,
            lua:
                `
                local olen = redis.call("hget", "${this._configNamespace}", ARGV[2])
                local value = redis.call("hset", KEYS[1], KEYS[2], ARGV[1])
                local nlen = #KEYS[2]
                if(not olen or nlen > tonumber(olen)) then
                  redis.call("hset", "${this._configNamespace}", ARGV[2], nlen)
                end
                return value
                `
        });
        this._initialized = true;
    }
Enter fullscreen mode Exit fullscreen mode

Usage

Training

For single term training you can use add function:

import {SymSpellEx, MemoryStore} from 'symspell-ex';

const LANGUAGE = 'en';
// Create SymSpellEx instnce and inject store new store instance
symSpellEx = new SymSpellEx(new MemoryStore());
await symSpellEx.initialize();
// Train data
await symSpellEx.add("argument", LANGUAGE);
await symSpellEx.add("computer", LANGUAGE);
Enter fullscreen mode Exit fullscreen mode

For multiple terms (Array) you can use train function:

const terms = ['argument', 'computer'];
await symSpellEx.train(terms, 1, LANGUAGE);
Enter fullscreen mode Exit fullscreen mode

Searching

search function can be used to get multiple suggestions if available up to the maxSuggestions value

Arguments:

  • input String (Wrong/Invalid word we need to correct)
  • language String (Language to be used in search)
  • maxDistance Number, optional, default = 2 (Maximum distance for suggestions)
  • maxSuggestions Number, optional, default = 5 (Maximum suggestions number to return)

Return: Array<Suggetion> Array of suggestions

Example

await symSpellEx.search('argoments', 'en');
Enter fullscreen mode Exit fullscreen mode

Example Suggestion Object:

{
  "term": "argoments",
  "suggestion": "arguments",  
  "distance": 2,
  "frequency": 155
}
Enter fullscreen mode Exit fullscreen mode

Correction

correct function can be used to get the best suggestion for input word or sentence in terms of edit distance and frequency

Arguments:

  • input String (Wrong/Invalid word we need to correct)
  • language String (Language to be used in search)
  • maxDistance Number, optional, default = 2 (Maximum distance for suggestions)

Return: Correction object which contains original input and corrected output string, with array of suggestions

Example

await symSpellEx.correct('Special relatvity was orignally proposed by Albert Einstein', 'en');
Enter fullscreen mode Exit fullscreen mode

Returns this Correction object:

This output is totally depending on the quality of the training data that was push into the store

{
  "suggestions": [],
  "input": "Special relatvity was orignally proposed by Albert Einstein",
  "output": "Special relativity was originally proposed by Albert Einstein"
}
Enter fullscreen mode Exit fullscreen mode

Github Repository

https://github.com/m-elbably/symspell-ex

Conclusion

Spell-checking is a widely used feature in software applications. Simple spell-checking algorithms are not efficient and can be slow.

There are many popular edit distance algorithms such as Levenshtein distance, Damerau-Levenshtein distance, Jaro distance, and Jaro-Winkler distance.

Naive approach involves checking each word in the dictionary to find the word with the smallest edit distance to the misspelled word, which is computationally expensive.

Peter Norvig Algorithm is a spell-checking algorithm that generates all possible terms with a given edit distance from the query term and searches them in a dictionary to find the correct spelling.

Symmetric Delete spelling correction algorithm simplifies the process of generating possible spellings and searching in a dictionary by only using deletion as an operation, resulting in being much faster and language-independent.

References

💖 💪 🙅 🚩
melbably
Mohamed El-Bably

Posted on May 3, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related