Succinct Tries For Multiple Languages and Non-Latin Based Characters
Jeremy Jackson
Posted on May 8, 2024
Succinct Tries For Multiple Languages
If you're here, I'm assuming it's because you're already familiar with this blog post by Steve Hanov (or at least his succinct trie code) and you want to make it more flexible so that it can be used with more than just the latin A-Z character set.
Steve actually mentions how to do this in a reply on his blog post:
Dan:
The algorithm was written with characters from a-z in mind. To allow more characters, you only need to change the Trie.encode and FrozenTrie.getNodeByIndex functions. They represent a-z as number from 0..26 which fit into 6 bits.
(There is no need to touch the ORD and CHR functions. They are not related to the alphabet used).
He mentions the functions that need to be modified, but he doesn't say exactly what to modify, so let's clear that up here, yeah?
The code
Let's start by grabbing the code to look at. You can find the code at the following links:
- Directly from Steve's site: https://www.hanovsolutions.com/trie/Bits.js
- A copy just in case that happens to be down when you read this: https://gist.github.com/jeremyjackson89/668f5a66916618dce76d452f04192ca6
Modify the regex
Before diving into how it all works, let's start with making sure the characters you want to add can be inserted into the trie.
On line 689 you'll find the regex that's used to ensure that strings being inserted into the trie are latin based characters, A-Z.
var regex = /^[a-z]+$/;
for ( var i = 0; i < words.length; i++ ) {
// To save space, our encoding handles only the letters a-z. Ignore
// words that contain other characters.
var word = words[i].toLowerCase();
if ( word.match( /^[a-z]+$/ ) ) {
trie.insert( word );
}
}
You'll want to adjust that to allow other characters. For example, if you want to include Japanese hiragana, you'd modify it like so:
var regex = /^[a-zぁ-ゟ]+$/;
You'll also want to remove the "only letters a-z" comment (since that's no longer true) and update the .match
method to use the regex
variable:
var regex = /^[a-zぁ-ゟ]+$/;
for (var i = 0; i < words.length; i++) {
var word = words[i].toLowerCase();
if (word.match(regex)) {
trie.insert(word);
}
}
That gets us through the regex check and allows us to insert the words, but we still need to do some work so we can store them properly.
Understanding how the data is stored
As is, the library still works by assuming that the words being encoded only contain the letters A-Z. Because of that assumption, values are stored within the trie using 6 bits.
The first 5 bits are for values associated with the letters a-z. The character codes for the letters a-z (all lowercase) fit in the range of 0-25 (which fits into 5 bits) when using the character code for "a" as a base. You can see this being done in Trie.encode
:
var a = ("a").charCodeAt(0);
this.apply( function( node ) {
var value = node.letter.charCodeAt(0) - a;
if ( node.final ) {
value |= 0x20;
}
bits.write( value, 6 );
});
In the above code, the value of a
in 97
. Therefore, a - a = 0
. Now, "z".charCodeAt(0)
is 122
, so 122 - 97 = 25
.
The 6th and final bit is used as a flag to determine if that letter at that point in the trie marks the end of a word (this is explained well in Steve's blog post).
The Trie.encode
function is one of the functions Steve mentioned needing modification. What we want to do is expand the number of bits (bit-width) used in order to accommodate the character set of our target language.
Determining the bit-width needed
You'll need a min
and max
value based on the character codes.
For the min
you can either:
- determine a new
min
value from the character set - continue using the character code of
"a"
as themin
value and subtracting that from the value of your additional characters
I'd recommend determining a new min
since it'll reduce the overall file size since you'll need less bits to store data.
For example, "ぁ".charCodeAt(0)
is 12353
and "ゟ".charCodeAt(0)
is 12447
. If you used "a" here your values would be pretty large. 12447 - 97 = 12350
, so you'd need 14 bits to accommodate that number and another 1 bit as a flag, so you'd need 15 total. However, if you used "ぁ" as a min
your range becomes much smaller. 12447 - 12353 = 94
. For 94, you'd need at least 7 bits (and then the bit for the flag giving you a total of 8).
Now, you could hardcode these values if your use case allows it, but if you need something more flexible you'll need a way to get the min
and max
values.
Your options to do this are:
- pass this data in when generating your trie if you happen to know it
- extrapolate that data from the given word list you're passing in (you'll need to check your current
min
andmax
values against character code values as you iterate over words)
The first approach is going to be the fastest.
Either way, once you know those values you can determine the bit-width like so:
var highestCharCode = maxCharacterValue - minCharacterValue;
var bitWidth = highestCharCode.toString(2).length + 1;
Before we update Trie.encode
you'll also want to know the integer value of bit-width - 1
. To determine that in JavaScript you can do 2**(bitWidth - 1)
. For example, our bit-width is currently 8
, so 2**(8-1)
= 128
. You subtract the 1 here because in Trie.encode
you want to know the max value of space needed for your character range and that last bit that you're removing is the "end of word" flag.
Updating Trie.encode
First, let's update the call to trie.encode
(just after where we added words to the trie):
var highestCharCode = maxCharacterValue - minCharacterValue;
var bitWidth = highestCharCode.toString(2).length + 1;
var finalBitValue = 2**(bitWidth - 1);
// Encode the trie.
var trieData = trie.encode(bitWidth, finalBitValue, minCharacterValue);
And then we can update the Trie.encode
function to look like this:
encode: function(bitWidth, finalBitValue, minCharacterValue)
{
// Write the unary encoding of the tree in level order.
var bits = new BitWriter();
bits.write( 0x02, 2 );
this.apply( function( node ) {
for( var i = 0; i < node.children.length; i++ ) {
bits.write( 1, 1 );
}
bits.write( 0, 1 );
});
// Write the data for each node, using bitWidth bits for node. 1 bit stores
// the "final" indicator. The other N bits store one of the letters
// of the alphabet.
var a = ("a").charCodeAt(0);
this.apply( function( node ) {
var value = node.letter.charCodeAt(0) - minCharacterValue;
if ( node.final ) {
value |= finalBitValue;
}
bits.write( value, bitWidth );
});
return bits.getData();
}
Now that the data is storing correctly in the trie, lets update the output JSON so that once we have the dictionary we'll be able to load and read from it using the right data.
Updating the output JSON
We'll add two new properties to the JSON output: bitWidth
and minCharacterValue
.
var output;
output = '{\n "nodeCount": ' + trie.getNodeCount() + ",\n";
output += ' "bitWidth": ' + bitWidth + ',\n';
output += ' "minCharacterValue": ' + minCharacterValue + ',\n';
output += ' "directory": "' + directory.getData() + '",\n';
output += ' "trie": "' + trieData + '"\n';
output += "}\n";
These will be used to find the correct positions to look at and what values are associated with what characters without having to hardcode any values.
Updating FrozenTrie.getNodeByIndex
This is the last function that Steve mentions updating.
The provided lookup
function in the code creates a new instance of FrozenTrie
like so:
var ftrie = new FrozenTrie( json.trie, json.directory, json.nodeCount);
What we want to do is also pass in our new values:
var ftrie = new FrozenTrie(json.trie, json.directory, json.nodeCount, json.bitWidth, json.minCharacterValue);
Let's update FrozenTrie
, FrozenTrie.init
, and FrozenTrie.encode
to use our new JSON values:
function FrozenTrie(data, directoryData, nodeCount, bitWidth, minCharacterValue) {
this.init(data, directoryData, nodeCount, bitWidth, minCharacterValue);
}
FrozenTrie.prototype = {
init: function(data, directoryData, nodeCount, bitWidth, minCharacterValue)
{
this.data = new BitString( data );
this.directory = new RankDirectory( directoryData, data,
nodeCount * 2 + 1, L1, L2 );
// The position of the first bit of the data in 0th node. In non-root
// nodes, this would contain bitWidth-bit letters.
this.letterStart = nodeCount * 2 + 1;
this.bitWidth = bitWidth;
this.minCharacterValue = minCharacterValue;
},
/**
Retrieve the FrozenTrieNode of the trie, given its index in level-order.
This is a private function that you don't have to use.
*/
getNodeByIndex: function( index )
{
// retrieve the n-bit letter.
var final = this.data.get( this.letterStart + index * this.bitWidth, 1 ) === 1;
var letter = String.fromCharCode(
this.data.get( this.letterStart + index * this.bitWidth + 1, this.bitWidth - 1 ) +
minCharacterValue);
var firstChild = this.directory.select( 0, index+1 ) - index;
// Since the nodes are in level order, this nodes children must go up
// until the next node's children start.
var childOfNextNode = this.directory.select( 0, index + 2 ) - index - 1;
return new FrozenTrieNode( this, index, letter, final, firstChild,
childOfNextNode - firstChild );
},
Conclusion
These changes will allow you to use this succinct trie structure without having to hardcode values. I wrote this because when I was trying to do this, I didn't see any information about what to do outside of latin based character usage until I saw Steve's response on his blog.
Hopefully you've found this helpful, easy to understand, and I've helped prevent some headaches.
Posted on May 8, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.