Autocomplete Algorithms
Khaled Elmorsy
Posted on December 1, 2022
Disclaimer
First let's start by clearing something up, the key to fast autocomplete is in data structures.
When naïve ol' me set out to implement an autocomplete feature in the website I'm building, I searched for 'autocomplete algorithms', so this post is for fellow codepatratiots who think the same way.
1. The Problem
We all know what autocomplete outputs, but let's lay out some conventions that we'll use as we go along.
Autocomplete's primary function:
Given a query string of length , find all strings in a defined set in which is a substring of .
For any brute force enthusiasts out there already rolling up their sleeves and preparing some for
loops:
The solution should scale at most sublinearly with the number of strings. AKA, don't check all the strings...
2. Arrays
2.1 Naïve Search
First, the simplest approach we can use, which doesn't follow the added rule we imposed on ourselves, is a straightforward traversal of the string set
and check whether
is a substring of
.
function autocomplete(query, strings) {
const matches = []
for (let string of strings) {
if (string.includes(query)) matches.push(string)
}
return matches
}
With this approach, we have to visit , and do comparisons each time, making it .
Don't let the example fool you, it may not seem like a lot of operations, but scale it to an array of 100,000 display names, each up to 50 characters long, then autocompleting just a 4 character string could need up to 20 million comparisons, and we need to run and feed the output off a server while the user is typing.(Given a small debounce but still..)
So we know we need to significantly cut down the number of operations we have to do, and we can actually achieve this with arrays too.
2.2 Prefix Binary Search
Instead of traversing the whole string set, we can use a binary search to cut down on the number of comparisons, but we need to address a major caveat.
Binary search relies on having a sorted search space to be able to discard half of it each time. The issue here is that we can't just sort the strings because we don't have a basis of comparison that accounts for the presence of substrings.
Strings | apple | cake | table |
---|
Query | able |
---|
In the alphabetically sorted string set above a binary search would check both 'cake' and 'apple', but what we really want to match is 'table' since it contains our query. This fact isn't represented above though, so we need a way to do that.
Also note that we need a way to capture multiple matches, still without fully traversing the complete set. We'll see that this fairly simple, almost as easy as going for a walk 😉.
2.2.1 Finish my Sentences
Let's dumb down our primary function definition into something more basic that only matched strings that start with our query, i.e. the query is the strings' prefix. So if our query was "cake", it could match "cake recipes" and "cake calories" but not "chocolate cake" or "Why don't I like cake" (you should go get that checked).
In that case, our basis of comparison is a left to right character match. This means that we can sort the string set according to each character from left to right, AKA alphabetically.
Now we can run a binary search and match a string with our query, and since we want multiple matches and we know that our string set is sorted alphabetically, i.e. according to their prefixes, we can simple walk up and down the array from our match, grabbing every other match and stopping when our full query isn't the prefix of the next string.
function autocomplete(query, strings) {
const sortedStrings = strings.sort((a, b) => (a < b ? -1 : 1)); // Sort strings alphabetically
const matchIndex = findIndex(query);
if (matchIndex === -1) return [];
const matches = [sortedStrings[matchIndex]];
return matches.concat(farmMatches(query, sortedStrings, matchIndex));
// Binary search for one of the matches.
function findIndex(query, start = 0, end = sortedStrings.length - 1) {
if (start > end) return -1;
const midpoint = start + ((end - start) >> 1); // Bitshift for a quick floored halve
// String.startsWith(), since we're checking the beginning of the match
const string = sortedStrings[midpoint];
if (string.startsWith(query)) return midpoint;
[start, end] =
query < sortedStrings[midpoint]
? [start, midpoint - 1]
: [midpoint + 1, end];
return findIndex(query, start, end);
}
/* Walk up and down the array from our current match looking for more matches.
When we come across a non match stop. */
function farmMatches(query, index) {
const matches = [];
for (let i = index + 1; i < sortedStrings.length; i++) {
const string = sortedStrings[i];
if (!string.startsWith(query)) break;
matches.push(string);
}
for (let i = index - 1; i >= 0; i--) {
const string = sortedStrings[i];
if (!string.startsWith(query)) break;
matches.push(string);
}
return matches;
}
}
Now to match all the strings in where is a prefix, our time complexity breaks down into two main parts.
First, the binary search which is at worst , where finding out if the query is a string's prefix is .
Then, for walking up and down the array from the initial match, we get where is the number of matches in the string set.
This reduces to or
You can observe this first hand, minus the character comparisons, by trying a few queries in the 10k word autocomplete. Particularly queries that have no matches and queries that have several.
If we had 1000 times the user base of the 100k display name example from earlier, we'd only need to visit at worst strings, before finding our first match and starting the adjacent walk.
Right now, while we have a very performant algorithm, we're still limited to matching the prefixes of the strings. To generalize it to any inner substring, we won't change how we search, but what we search for.
2.3 Suffix Arrays
Suffix: A letter or group of letters added at the end of a word to make a new word -Cambridge Dictionary
Replace "word" with "string" (and "letters" with "characters") and we get the key to substring autocomplete. But we won't be adding to our strings, we'll generate all the possible suffixes of that string and add them to our string set with a reference to the original string.
For example, for the string "hello" we generate the following suffixes:
Prefix | Suffix |
---|---|
hello | |
h | ello |
he | llo |
hel | lo |
hell | o |
We then link all of these suffixes with "hello", so that if any of them are matched with our current method, say a user types "el" which matches with "ello", then "hello" is added to the actual matches.
We'll use a map/dictionary and IDs to link suffixes to their respective strings to save on memory, and to work around colliding suffixes, i.e. 'cake' and 'joke' both have 'ke' as a suffix, we'll store an array of string IDs at each suffix.
This array of suffixes is aptly names a Suffix Array, and when created for multiple strings, it's a Generalized Suffix Array. For the sake of brevity, we won't repeat 'generalized' each time.
2.3.1 Procedure
Preprocessing
const strings = ['test', 'cake', 'joke', 'best', 'hello', 'suffix', 'etc'] // ;)
// 1. Declare object/dictionary to store ID:string pairs
const stringMap = {}
// 2. Declare an object to store Suffix:ID[] pairs
const suffixMap = {} // We'll see why an object soon
// 3. Iterate over the string set and
for (let string of strings) {
// 1. Generate a unique ID based on the state of stringMap
const id = Object.values(stringMap).length
// 2. Add the string to stringMap with the ID as the key
stringMap[id] = string
// 3. Generate all suffixes - we can also just loop without storing them in an array
const suffixes = [...string].map((_,i) => string.slice(i))
// 4.Iterate over the suffixes to insert them
for (let suffix of suffixes) {
// 0. If our suffix isn't a key in suffixMap, add it with an empty array as its value
suffixMap[suffix] ||= []
// 1. Push the ID into the suffix's id array - With objects, this is O(1)
suffixMap[suffix].push(id)
}
}
// 4. Convert suffixMap to an array of {suffix, ids: id[]} objects
const suffixArray = Object.entries(suffixMap).map(([suffix, ids]) => ({suffix, ids}))
// 5. Sort the suffixArray according to the suffixes
const sortedSuffixArray = suffixArray.sort((a,b) => a.suffix < b.suffix ? -1 : 1)
Note that storing all of the suffixes in the array is unnecessarily inefficient, in actual suffix array implementations, we only need to store the suffix's position in the string, and in our multi-string generalized model, we also store a string ID. That way, we can use these two indices to generate any suffix in our model. This is discussed and visually demonstrated in section 5.2.
Preprocessing Time Complexity
Looking at the complexity of constructing a suffix array for:
- Each string of length has suffixes.
- There are suffixes in total.
- Sorting suffixes with an efficient algorithm like merge sort is
- Suffix lengths scale with so, comparing suffixes, which is linear is
So the final time complexity of building a suffix array is . This would not be performant on large sets with long strings.
Fortunately, there are more efficient algorithms that can build suffix arrays in linear time, such as DC3, essentially reducing preprocessing time complexity to .
Matching
Matching is similar to the prefix binary search example from earlier, since we're essentially prefix matching our query with suffixes instead of strings.
The only difference is that we're first collecting string ID's from each matched suffix then mapping those IDs to their relevant strings.
// 1.a Find one match with a binary search
const firstMatch = binarySearch(query, sortedSuffixArray)
// 1.b Then traverse up and down the array from that match to get the rest of the matches. We only return an array of IDs in this step.
const matchIDs = farmMatches(query, firstMatches, sortedSuffixArray)
// 2. Remove duplicate IDs
const uniqueIDs = [...new Set(matchIDs)]
// 3. Map the IDs to their respective strings
const matchedStrings = uniqueIDs.map(id => stringMap[id])
Time Complexity
Since we didn't inherently change our matching algorithm, and just tacked on the ID mapping, matching a query breaks down to:
- Finding the first match:
- Very fast. Even 100 billion suffixes needs at worst 37 comparisons.
- Farming matches around the first match:
- where is the number of matched suffixes.
- Removing duplicates:
- Mapping match IDs to strings:
- The number of unique matches is at worst, the number of matches.
This reduces to:
We can see that matching is very efficient, at worst scaling linearly with the number of suffixes that match our query since the binary search component is so fast.
2.3.2 Interactive Example
Here's the same example from before, now using suffix arrays.
Once again, try different queries in the 10k word, now 28k suffix, and get a feel for how it scales in different cases like no matches, vs several.
2.3.3 Insertion and Deletion
We won't go too deep into inserting and deleting strings here but they're pretty straightforward.
To insert a string, generate an ID and the string's suffixes. Binary search for each suffix in the suffix array. If you find it, push the ID to its ID array, if not insert it into the suffix array using the last index returned by the binary search.
To delete a string, query all its suffixes and remove its ID from each ID array, if the ID array is now empty, remove the suffix. Then remove the ID and string from the string map.
Depending on how the suffix array is implemented and the algorithm used, it could be more efficient or even necessary to rebuild it completely with each change.
2.3.4 Are Suffix Arrays a Viable Solution
Yes! Matching a query scales really well with the size of the overall suffix array, and scales linearly with the number of matches.
We do have to preprocess the data, but even that is speedy using efficient algorithms.
But can we query our data and get results even faster, regardless of the size of the string set? We sure can (but only marginally faster).
3. Trie
The key data structure that makes this possible is a Trie (pronounced Try) which is a type of tree data structure where keys/edges between nodes represent the next character from the current position.
To query a trie, we start at the root node, and traverse to the next node by following the edge/key for the next character in our string.
When we reach the final node in our query, leaf nodes, connected through an edge with a special character ('$' in our case) represent ends of strings. Their value is either the string itself or its ID key in a separate map.
3.1 Building a Trie
Building basic tries is a straightforward process:
/**
* A basic trie built from an array of strings.
* Further strings and string arrays can be inserted to it.
*
* The autocomplete method matches strings with the query as a prefix.
*/
function Trie(strings) {
const root = Node();
insertStringArray(strings);
/**
* Trie node with a value property and children object
*
* Values are reserved for leaf nodes that denote
* the end of a string, and they store the string itself.
*
* Children keys represent the next relative character and point to
* the relevant nodes.
*/
function Node(value) {
const children = {}
return {value, children}
}
/**
* Traverse the trie from root along the string characters until
* the edge for the next character isn't found, or the string is
* fully traced.
*
* returns the final traced node, and the remaining string ('' if fully traced)
*/
function trace(string, node = root) {
if (!node) return
return trace(string.slice(1), node.children[string[0]] || null) ||
{node, remainingString: string}
}
/**
* Insert string into the trie by tracing the portion of the string already
* in the trie, then adding nodes and edges for each remaining character.
*
* After the string is fully inserted, add a final node on the '$' edge with
* the string as its value, marking the end of that path.
*/
function insertString(string){
let {node, remainingString} = trace(string);
for (let char of remainingString) {
const nextNode = Node();
node.children[char] = nextNode;
node = nextNode;
}
// Add a leaf node to represent the end of a string's path
node.children['$'] = Node(string)
}
// Iterate through the array and insert each string
function insertStringArray(stringArray) {
stringArray.forEach(insertString)
}
/**
* Matches all strings in the trie that contain the query string
* as their prefix.
*
* Returns an array of the matched strings (empty if none are found).
*/
function autocomplete(string) {
const {node, remainingString} = trace(string);
// If the fully query isn't matched, return no matches
if (remainingString) return [];
const matches = [];
// Visit all children nodes and add leaf node values. Can be done in a few ways.
const stack = [node];
while (stack.length) {
const currentNode = stack.pop();
/* Only leaf nodes, which are end of strings, have values.
If we visit one, add that string to the matches */
if (currentNode.value) matches.push(currentNode.value)
// Add the current node's children to the stack
const childNodes = Object.values(currentNode.children)
stack.push(...childNodes)
}
return matches;
}
return { root, autocomplete, insertString, insertStringArray }
}
Trie (😉) different autocomplete queries, or make more tries.
Removing Strings
To remove a string from a trie, we determine the longest singular branch from the 'end of string' node towards the root, where each node has only one child. This is to avoid affecting any other string paths.
To do this, we trace the string normally from the root. At each character edge, we check if the next node only has 1 child:
- If it does, we mark that edge to be deleted.
- If there's already a marked ancestor edge, then don't change it.
- If the node has multiple children, unmark any marked ancestor edge and continue tracing.
Remove Method Code
function remove(string) {
let currentNode = root;
let parentToTrim; // Parent node to delete the edge from
let edgeToRemove;
for (let char of string) {
// Get the next node in the string path, making current node the parent
const nextNode = currentNode.children[char];
// We can only remove the next node if it has no other branches
if (Object.values(nextNode.children).length > 1) {
// If the next node has multiple children, we can't remove trim from above it
[parentToTrim, edgeToRemove] = [null, null];
} else {
/* If it's okay to remove the child, then only update the parent node
if it was previously reset */
parentToTrim ||= currentNode;
edgeToRemove ||= char;
}
currentNode = nextNode;
}
/* If we can't delete any node we visit during the trace then only
delete the 'end of string' node */
if (parentToTrim) {
delete parentToTrim.children[edgeToRemove];
} else {
delete currentNode.children['$'];
}
}
3.2 Interactive Example
Here's an example I made to visually illustrate tries and autocompleting with them.
The source code is slightly different at parts to account for visualizing the results (adding query statuses to the nodes). The example also uses a string map with IDs to reduce clutter.
3.3 Performance and Caveats
Performance
The best thing about using tries is that, inserting, removing and tracing strings of length is .
Building off of that, creating a trie of a set of strings where the average string length is , is .
Autocompleting a query , assuming the path exists in the trie, involves:
Tracing the query:
-
Visiting each leaf node that descends from the trace result:
- The average leaf depth is the average string length . So the upper bound complexity to traverse to each leaf is where .
- The number of leaf nodes to visit is the number of matches
- Time complexity is thus: which at worst is
Assuming , then autocompleting has an upper bound of .
Caveats
There are two main caveats to using basic tries as our autocomplete solution, first and foremost, it can only do prefix autocomplete, also, there are a lot of extra nodes that take up unnecessary storage space and traversal operations.
To address the prefix limitation, we can include suffixes into our model, and to reduce the node count, we'll compress all redundant node chains, where there's a straight chain from an inner node to a leaf with no inner branches.
4. Suffix Trees
Suffix trees are conceptually similar to tries, but they also also include all of the suffixes of each string, and branches are compressed with no redundant chains.
Suffix trees allow us to build highly scalable and relatively space efficient* autocomplete models.
Similar to suffix arrays, a suffix tree of multiple string is a Generalized Suffix Tree, and also like suffix arrays, we won't keep repeating 'generalized'.
* Not the most space efficient. This is discussed in section 5
4.1 Building a Suffix Tree
A key aspect of suffix tree creation and manipulation, is the branching and compression logic. This is slightly more involved that basic tries and affects how we query the tree and insert into it.
Querying the Tree
With edges being compressed multi-character strings, we need to change our querying logic slightly. The goal is still to traverse the tree in linear time depending on the length of the query.
We can do that by checking the edges for varying prefixes of the query, iteratively adding another character and checking again. Once we match an edge, we repeat with the remaining query at the next node.
function query(string, node = root) {
let queryString = '';
for (let char of string) {
// Iteratively add characters to the query string
queryString += char;
// Each time, check if an edge matches that prefix
const nextNode = node.children[queryString];
// If an edge exists, recurse on its node with the remaining string
if (nextNode) {
const remainingString = string.slice(queryString.length);
return query(remainingString, nextNode);
}
}
/* Return both the last matched node and the remaining
unmatched string (blank if it was a fully match) */
return { node, remainingString: string };
}
For a query where , this approach does indeed perform at since we simply iterate over the query, performing object property retrievals each time.
Inserting a Suffix
Let's expand a bit on insertion with compressed branches. First, our primary goal is to:
A. Have a path of edges and nodes that contains the full suffix/string we're inserting.
B. Ensure that the path ends at a leaf node containing the parent string's ID.
We can do this by:
- Querying the suffix in our tree.
- If the trace is incomplete:
- Insert an edge and node for the remaining string at the last traced node.
- If an edge from the last node shares a prefix with the remaining string, split that edge on the common prefix and insert under it.
- If the path ends on an inner node:
- Extend a leaf node from the end of the path, with a terminating character as its edge.
- Store the parent string ID at the leaf node at the end of the path.
We won't include terminating character '$' in our edges except for terminating paths that end on inner nodes. Such as with the suffix e in tree shown below. This is just to make the tree easier to look at and faster to grasp at a sight without losing functionality.
Suffix Tree of 'tree'
Translating this process to code, we get:
function insert(string, id) {
// Trace the string/suffix
let {node, remainingString} = query(string)
/* If there's no path for the full string,
Add or split an edge at the last node in the trace */
if(remainingString) {
node = addEdge(node, remainingString)
}
// If the path ends at an inner node, add a terminating leaf
let leaf = Object.values(node.children).length
? (node.children['$'] ||= Node([]))
: node;
// Add the string ID to the leaf node
leaf.value ||= [];
leaf.value.push(id);
function addEdge(node, string) {
/* Add a new edge the main node if we can't split an edge
at a common prefix */
let targetParent = node;
let newEdge = string;
// Iterate over the node's edges, looking for one to split.
for(let [edge, subtree] of Object.entries(node.children)){
// Check for the longest common prefix between the edge and string
const lcp = getLCP(edge,string)
// If a (non blank) LCP exists, split the edge at the lcp
if (lcp) {
// Add a new edge & node for the common prefix
const prefixNode = Node();
node.children[lcp] = prefixNode;
// Move the original subtree under the new prefix node
const remainingEdge = edge.slice(lcp.length); // Remove the lcp from the original edge
prefixNode.children[remainingEdge] = subtree;
delete node.children[edge]; // Delete the original edge
/* If the common prefix is the complete string, then the prefix node
is the end of the path, and we don't need to add a new node */
if (lcp === string) return prefixNode
/* Otherwise, designate the prefix node as the one to add to,
and remove the lcp from the edge to add */
targetParent = prefixNode;
newEdge = string.slice(lcp.length);
break // Stop checking edges
}
}
/* Add a new edge/node at the designated node
to complete the string path */
const endNode = Node();
targetParent.children[newEdge] = endNode;
return endNode;
}
}
getLCP()
function getLCP(...strings) {
let lcp = strings[0] || '';
// Traverse through strings and update the LCP each time
for(let string of strings.slice(1)) {
for (let [i, char] of [...lcp].entries() ) {
/* When the LCP and current string deviate stop matching
trim the LCP and move on to the next string */
if (char !== string[i]) {
lcp = lcp.slice(0, i)
break;
}
}
if(!lcp) break // Avoid unneseccary iterations when there's no LCP
}
return lcp
}
Autocompleting
Just like with tries, to autocomplete a query, we first trace it in the tree, then from the end of that trace, grab all the descendent leaf nodes.
Unlike trie, with compressed branches, there's a chance that our query's path with land in the middle of an edge, i.e. query: el, edge: ello. In that case, we consider the edge a match since it fully matches what remains of our query.
function autocomplete(string) {
// Try to fully match the string with full edges
const { node, remainingString } = query(string);
let searchRoot;
// If already fully matched, search from the end of the path
if (!remainingString) {
searchRoot = node;
} else {
/* If not fully matched, try to complete the match using part
of an edge from the last node */
for (let [edge, childNode] of Object.entries(node.children)) {
// Get the common prefix between the edge and remaining string
const lcp = getLCP(edge, remainingString);
if (!lcp) continue; // Move on if they don't share a prefix
/* If the prefix is the complete remaining string, then the
string is fully matched and we can search from that child */
if (lcp === remainingString) {
searchRoot = childNode;
}
break; // Stop checking edges
}
}
if (!searchRoot) return [];
return getAllMatches(searchRoot); // Search for leaf nodes
}
These are the main notable differences between suffix trees and basic tries that we'll discuss here. Deletion is also different with compressed branches, but I'll leave that up to your ingenuity and research 😉.
4.3 Example
Here's an interactive suffix tree example, feel free to try different string sets and queries.
4.4 Performance
Inserting a string , to a suffix tree, means adding suffixes each needing traversals that scale linearly with, hence each. Thus, adding one string has a time complexity of , and adding strings is .
Querying a string of length is , which is similar to tries and is very performant.
Autocompleting a string involves querying that string and then searching for all descendent leaf nodes. This has a time complexity of where is the number of matches.
Compressed branches also allow us to store and traverse less nodes during our search, saving memory and reducing the constant time factor in average scenarios.
Efficient Algorithms
The performance discussed above is for naïve suffix tree construction. Algorithms have been developed that build suffix trees in linear time, or
for
. A notable and known example is Ukkonen's algorithm.
5. Suffix Trees vs. Suffix Arrays
Now, having a handle on two solid methods to implement an autocomplete model, which is more viable in real world scenarios. Ultimately it all boils down to speed and size.
5.1 Speed
Looking at the operations we've been discussing till this point, we can try to draw any perceivable differences between both models.
For a model
Task | Suffix Tree | Suffix Array | Victor | Comment |
---|---|---|---|---|
Preprocess | Suffix Tree | Both in linear time with efficient algorithms for capped string lengths | ||
Insert | Suffix Tree | Depends on the implementation | ||
Autocomplete
where # of matches |
Suffix Tree | Both ultimately realistically scale with |
In practice, both models can be built in linear time with the total number of characters, and modified relatively easily.
For autocompletion, the difference in searching for the initial query search vs. is marginal since the scaling factor grows at a negligible rate with the total number of suffixes.
Thus using either suffix trees or suffix arrays is viable for a performant autocomplete model, with suffix trees giving a very slight edge in speed which would be hardly distinguishable in practical circumstances.
5.2 Storage
With the illustrative & visual implementation of suffix trees and arrays, the space complexity of both models is at worst the total number of stored suffixes .
In practice, the suffixes and prefixes aren't stored as strings at each edge/value, they're mapped to indices that denote the parent string and the location in that string, and those indices are stored instead. This significantly reduces storage complexity to a linear .
While both models now scale linearly, suffix trees tend to grow reasonably larger than suffix arrays at worst cases. This because of their inner nodes.
We can deduce this by first considering the fact that both structures must at least store the indices for each possible suffix .
With suffix arrays, there's an array element for each suffix. So elements. In suffix trees, each suffix terminates at a leaf node, so there are leaves, but suffix trees can also have up to inner nodes, making them grow at a worst case rate of .
Thus, suffix arrays, require much less space than suffix trees since they only store references to the suffixes with no extraneous data.
6. Your Autocomplete Model
Ultimately, the approach you choose will depend on your needs coupled with the latest, most space and time efficient algorithms.
Suffix arrays in particular currently need a lot less space while still providing blazingly fast autocomplete on large data sets so they're a great choice that's easy to implement.
And, not to leave you empty handed, I made autokomplete, a fast and robust autocomplete module built around suffix arrays.
It uses DC3 to construct the array in linear time, and can match even faster than the performance discussed above using an extra optimization. Instead of walking up and down comparing suffixes around the first binary search match, use two modified binary searches to find the first and last matching suffixes, making it possible to just slice, map and return the range in between.
This is slightly less efficient with less matches, but significantly faster with a large number of matches, since we can skip comparing each matching suffix with the query.
Try out the module and feel free to contribute to it or recommend additional features. I have a couple in mind already.
Posted on December 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.