String Matching Algorithm Uses, Interactive Implementations, & Free Source Code
aruna-x
Posted on November 4, 2021
Ahhh autocorrect. How many times has it changed a four-letter expletive to "duck"? Though when autocorrect works as planned, it enables us to have smoother, more intuitive experiences with technology rather than hindering our free expression. At the heart of autocorrect is a well-intentioned string matching algorithm. There are many such tools, including:
- username and password hash matching
- search engines
- autocorrect
- spell checkers
- email spam filters
- plagiarism detection software
- bioinformatics and DNA sequencing tools
- quiz games!
Exact String Matching
There are two types of string matching: exact and fuzzy. Exact string matching is precisely as it sounds. Only identical strings pass the test, so to speak.
Something similar to this simple implementation seems useful for username and password hash matching. (Note: I've made this case sensitive for simplicity.)
Click the green play button to run this code. To edit code, create a replit account, fork this code, and have at it.
After pressing the green play button, you can feed the function your own strings in this console! Try entering: isExactMatch("string", "ring")
But perhaps we don't want to match whole strings. If we wanted to search large data for some exact substring query, we could redefine our criteria: exact common substring matches of length 4 or more found anywhere within either string, let's say. Then "apple" and "grappled" would pass.
The below implementation is called Longest Common Substring. Let's make it case insensitive. And if you've found this blog looking for a version that won't just check the first n characters (mystifyingly abundant online), but will return a match for any substring found anywhere within either string (way more useful imho), you're in luck:
Replace "4" on line 4 in the expression "end - beg > 4" with any number that allows your test data to succeed robustly.
Common substring has its limitations. For example, it fails "receipt vs reciept", a common typo. We'll circle back to this a bit later in this article.
There are indeed more powerful algorithms like the Boyer–Moore string-search algorithm, which avoids searching strings one character at a time. Instead, to increase efficiency, it explores the string being searched by jumping past ranges of characters and performs tail-first matching on the query string itself (which is assumed to be shorter). Fancy.
There's also the Meyers diff algorithm, used to highlight diffs in Github!
But for now, we'll move on to fuzzy string matching. Maybe I'll circle back to follow up on the Boyer–Moore string-search algorithm and the Meyer's diff algorithm in future updates.
Fuzzy String Matching
Google search queries often include typos.
Autocorrect can helpfully suggest adding back the "f" in "shift" in a work email. Bioinformatics tools can find gene mutations by detecting slight changes from original sequences. And spam filters can catch variations of common red flag phrases, despite spammers' best attempts at obfuscation.
Fuzzy string matching does some heavy lifting here. With fuzzy string matching (also referred to as inexact string matching or approximate string matching) we can probabilistically and algorithmically find most likely matches.
Here I've chosen to implement Levenshtein distance, the most common example of Edit distance. People often use these terms interchangeably, though there are other Edit distances.
Levenshtein Distance is in essence quite simple. It represents the minimum number of insertions, deletions, and substitutions it takes to make one string match another. To calculate the distance we use a matrix encoded with all possible operations on all possible substrings starting from the beginnings. This allows us to find and use the minimums on each operation dynamically.
This implementation uses a threshold of < 3. You can change that on line 25 after forking or copying.
According to my research, Levenshtein distance is considered the gold standard for fuzzy string matching. It hasn't been improved in about 50 years. For a comprehensive explanation, I highly recommend Understanding the Levenshtein Distance Equation for Beginners by Ethan Nam.
Even given its cachet, Levenshtein distance also has limitations. Unlike common substring it will pass "receipt vs reciept", but it will fail "Mt Whitney vs Mount Whitney" which common substring handles beautifully. Let's talk about that below.
Final Thoughts
A few weeks ago I co-created a kawaii-style quiz game called "Cookie-Loving Monster In Danger!" that uses the tech mentioned above. (No affiliation with Sesame Street or Jeopardy!) In order to get a functional version of string matching, I used all of these:
- removal of special characters using regex
- some simple logic to handle the edge case of query strings less than 3 characters
- longest common substring (at a threshold of >4)
- Levenshtein distance (at a threshold of <3)
Here is the final code. Try running it to see the test output, then test your own cases using the format stringAnalysis("string1", "string2"):
There are ways in which the above fails. It doesn't work with absolute accuracy.
However, this code worked well enough to make "Cookie-Loving Monster In Danger!" playable. So if you're curious to see it in action, hop over and play a game. If you win, there's a fun surprise in store. Or you can watch my walkthrough video here.
Follow Up
In the future, I'd be interested in creating my own implementations of the Boyer–Moore string-search algorithm and the Meyers diff algorithm, as I did with all of the above code snippets. I'd also be interested in improving the final code snippet by refactoring and further optimizing time and space complexity. I'd include a dictionary of common reasonable substitutions (like "2" and "two"). Then I'd take into account the probability of the occurrence of letters, common misspellings, and words in context (given actual usage).
Inspiration for the latter of these improvements comes from How to Write a Spelling Corrector by Peter Norvig. Well worth the read.
Posted on November 4, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 4, 2021