Succinct Tries For Multiple Languages and Non-Latin Based Characters

jeremyjackson89

Jeremy Jackson

Posted on May 8, 2024

Succinct Tries For Multiple Languages and Non-Latin Based Characters

Succinct Tries For Multiple Languages

If you're here, I'm assuming it's because you're already familiar with this blog post by Steve Hanov (or at least his succinct trie code) and you want to make it more flexible so that it can be used with more than just the latin A-Z character set.

Steve actually mentions how to do this in a reply on his blog post:

Dan:
The algorithm was written with characters from a-z in mind. To allow more characters, you only need to change the Trie.encode and FrozenTrie.getNodeByIndex functions. They represent a-z as number from 0..26 which fit into 6 bits.
(There is no need to touch the ORD and CHR functions. They are not related to the alphabet used).

He mentions the functions that need to be modified, but he doesn't say exactly what to modify, so let's clear that up here, yeah?

The code

Let's start by grabbing the code to look at. You can find the code at the following links:

Modify the regex

Before diving into how it all works, let's start with making sure the characters you want to add can be inserted into the trie.

On line 689 you'll find the regex that's used to ensure that strings being inserted into the trie are latin based characters, A-Z.

var regex = /^[a-z]+$/;
for ( var i = 0; i < words.length; i++ ) {
    // To save space, our encoding handles only the letters a-z. Ignore
    // words that contain other characters.
    var word = words[i].toLowerCase();
    if ( word.match( /^[a-z]+$/ ) ) {
        trie.insert( word );
    }

}
Enter fullscreen mode Exit fullscreen mode

You'll want to adjust that to allow other characters. For example, if you want to include Japanese hiragana, you'd modify it like so:

var regex = /^[a-zぁ-ゟ]+$/;
Enter fullscreen mode Exit fullscreen mode

You'll also want to remove the "only letters a-z" comment (since that's no longer true) and update the .match method to use the regex variable:

var regex = /^[a-zぁ-ゟ]+$/;
for (var i = 0; i < words.length; i++) {
    var word = words[i].toLowerCase();
    if (word.match(regex)) {
        trie.insert(word);
    }
}
Enter fullscreen mode Exit fullscreen mode

That gets us through the regex check and allows us to insert the words, but we still need to do some work so we can store them properly.

Understanding how the data is stored

As is, the library still works by assuming that the words being encoded only contain the letters A-Z. Because of that assumption, values are stored within the trie using 6 bits.

The first 5 bits are for values associated with the letters a-z. The character codes for the letters a-z (all lowercase) fit in the range of 0-25 (which fits into 5 bits) when using the character code for "a" as a base. You can see this being done in Trie.encode:

var a = ("a").charCodeAt(0);
this.apply( function( node ) {
    var value = node.letter.charCodeAt(0) - a;
    if ( node.final ) {
        value |= 0x20;
    }

    bits.write( value, 6 );
});
Enter fullscreen mode Exit fullscreen mode

In the above code, the value of a in 97. Therefore, a - a = 0. Now, "z".charCodeAt(0) is 122, so 122 - 97 = 25.

The 6th and final bit is used as a flag to determine if that letter at that point in the trie marks the end of a word (this is explained well in Steve's blog post).

The Trie.encode function is one of the functions Steve mentioned needing modification. What we want to do is expand the number of bits (bit-width) used in order to accommodate the character set of our target language.

Determining the bit-width needed

You'll need a min and max value based on the character codes.

For the min you can either:

  • determine a new min value from the character set
  • continue using the character code of "a" as the min value and subtracting that from the value of your additional characters

I'd recommend determining a new min since it'll reduce the overall file size since you'll need less bits to store data.

For example, "ぁ".charCodeAt(0) is 12353 and "ゟ".charCodeAt(0) is 12447. If you used "a" here your values would be pretty large. 12447 - 97 = 12350, so you'd need 14 bits to accommodate that number and another 1 bit as a flag, so you'd need 15 total. However, if you used "ぁ" as a min your range becomes much smaller. 12447 - 12353 = 94. For 94, you'd need at least 7 bits (and then the bit for the flag giving you a total of 8).

Now, you could hardcode these values if your use case allows it, but if you need something more flexible you'll need a way to get the min and max values.

Your options to do this are:

  • pass this data in when generating your trie if you happen to know it
  • extrapolate that data from the given word list you're passing in (you'll need to check your current min and max values against character code values as you iterate over words)

The first approach is going to be the fastest.

Either way, once you know those values you can determine the bit-width like so:

var highestCharCode = maxCharacterValue - minCharacterValue;
var bitWidth = highestCharCode.toString(2).length + 1;
Enter fullscreen mode Exit fullscreen mode

Before we update Trie.encode you'll also want to know the integer value of bit-width - 1. To determine that in JavaScript you can do 2**(bitWidth - 1). For example, our bit-width is currently 8, so 2**(8-1) = 128. You subtract the 1 here because in Trie.encode you want to know the max value of space needed for your character range and that last bit that you're removing is the "end of word" flag.

Updating Trie.encode

First, let's update the call to trie.encode (just after where we added words to the trie):

var highestCharCode = maxCharacterValue - minCharacterValue;
var bitWidth = highestCharCode.toString(2).length + 1;
var finalBitValue = 2**(bitWidth - 1);

// Encode the trie.
var trieData = trie.encode(bitWidth, finalBitValue, minCharacterValue);
Enter fullscreen mode Exit fullscreen mode

And then we can update the Trie.encode function to look like this:

encode: function(bitWidth, finalBitValue, minCharacterValue)
{
    // Write the unary encoding of the tree in level order.
    var bits = new BitWriter();
    bits.write( 0x02, 2 );
    this.apply( function( node ) {
        for( var i = 0; i < node.children.length; i++ ) {
            bits.write( 1, 1 );
        }
        bits.write( 0, 1 );
    });

    // Write the data for each node, using bitWidth bits for node. 1 bit stores
    // the "final" indicator. The other N bits store one of the letters
    // of the alphabet.
    var a = ("a").charCodeAt(0);
    this.apply( function( node ) {
        var value = node.letter.charCodeAt(0) - minCharacterValue;
        if ( node.final ) {
            value |= finalBitValue;
        }

        bits.write( value, bitWidth );
    });

    return bits.getData();
}
Enter fullscreen mode Exit fullscreen mode

Now that the data is storing correctly in the trie, lets update the output JSON so that once we have the dictionary we'll be able to load and read from it using the right data.

Updating the output JSON

We'll add two new properties to the JSON output: bitWidth and minCharacterValue.

var output;

    output = '{\n    "nodeCount": ' + trie.getNodeCount() + ",\n";

    output += '    "bitWidth": ' + bitWidth + ',\n';
    output += '    "minCharacterValue": ' + minCharacterValue + ',\n';
    output += '    "directory": "' + directory.getData() + '",\n';

    output += '    "trie": "' + trieData + '"\n';
    output += "}\n";
Enter fullscreen mode Exit fullscreen mode

These will be used to find the correct positions to look at and what values are associated with what characters without having to hardcode any values.

Updating FrozenTrie.getNodeByIndex

This is the last function that Steve mentions updating.

The provided lookup function in the code creates a new instance of FrozenTrie like so:

var ftrie = new FrozenTrie( json.trie, json.directory, json.nodeCount);
Enter fullscreen mode Exit fullscreen mode

What we want to do is also pass in our new values:

var ftrie = new FrozenTrie(json.trie, json.directory, json.nodeCount, json.bitWidth, json.minCharacterValue);
Enter fullscreen mode Exit fullscreen mode

Let's update FrozenTrie, FrozenTrie.init, and FrozenTrie.encode to use our new JSON values:

function FrozenTrie(data, directoryData, nodeCount, bitWidth, minCharacterValue) {
    this.init(data, directoryData, nodeCount, bitWidth, minCharacterValue);
}

FrozenTrie.prototype = {
    init: function(data, directoryData, nodeCount, bitWidth, minCharacterValue)
    {
        this.data = new BitString( data );
        this.directory = new RankDirectory( directoryData, data, 
                nodeCount * 2 + 1, L1, L2 );

        // The position of the first bit of the data in 0th node. In non-root
        // nodes, this would contain bitWidth-bit letters.
        this.letterStart = nodeCount * 2 + 1;

        this.bitWidth = bitWidth;
        this.minCharacterValue = minCharacterValue;
    },

    /**
       Retrieve the FrozenTrieNode of the trie, given its index in level-order.
       This is a private function that you don't have to use.
      */
    getNodeByIndex: function( index )
    {
        // retrieve the n-bit letter.
        var final = this.data.get( this.letterStart + index * this.bitWidth, 1 ) === 1;
        var letter = String.fromCharCode(
                this.data.get( this.letterStart + index * this.bitWidth + 1, this.bitWidth - 1 ) + 
                minCharacterValue);
        var firstChild = this.directory.select( 0, index+1 ) - index;

        // Since the nodes are in level order, this nodes children must go up
        // until the next node's children start.
        var childOfNextNode = this.directory.select( 0, index + 2 ) - index - 1;

        return new FrozenTrieNode( this, index, letter, final, firstChild,
                childOfNextNode - firstChild );
    },
Enter fullscreen mode Exit fullscreen mode

Conclusion

These changes will allow you to use this succinct trie structure without having to hardcode values. I wrote this because when I was trying to do this, I didn't see any information about what to do outside of latin based character usage until I saw Steve's response on his blog.

Hopefully you've found this helpful, easy to understand, and I've helped prevent some headaches.

💖 💪 🙅 🚩
jeremyjackson89
Jeremy Jackson

Posted on May 8, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related