Go and Trie!

krasun

Dmytro Krasun

Posted on March 7, 2021

Go and Trie!

A trie is an effective data structure that could be an excellent addition to your problem-solving toolbox. Let's see which areas the trie usage is more applicable than a standard set or map implementation. And what other benefits you can get from using it.

Aperitif

I like the name trie more than prefix tree or digital tree, although prefix tree is the most accurate definition. The name trie comes from retrieval but is pronounced as "try," not "tree." The idea was to distinguish the trie from the tree.

Try to guess what a trie is. I built trie from 4 words: apple, banana, bastard, ban:

trie example

I highly recommend to use a visualization tool from the University of San Francisco to play with a trie.

Trie stores keys that are represented as a sequence (usually a string) compactly. Then you can use it as a set or a hash map to query keys. In addition to that, you get an opportunity to query keys and values by the key prefix.

Naive implementation

I do not focus on the optimal algorithm. The goal of this article is to get you familiar with trie and when you can use it.

Let's design a desired API:

// Trie represents a Trie data structure.
//
// https://en.wikipedia.org/wiki/Trie
type Trie interface {
    // Insert inserts a word into the trie and returns true
    // if the word already exists.
    Insert(word string) bool
    // Contains returns true if an exact match of the word is found, otherwise false.
    Contains(word string) bool
    // SearchByPrefix finds and returns words by prefix.
    SearchByPrefix(prefix string) []string
    // StartsWith returns true if there is any word in the trie that starts
    // with the given prefix.
    StartsWith(prefix string) bool
    // Size returns the number of keys in the tree.
    Size() int
}
Enter fullscreen mode Exit fullscreen mode

It will be enough to implement these two functions: Insert and Contains, to grasp the idea. And it will be a reasonable basis for the following improvements.

I start by implementing generic rune trie. In Go rune is a superset of ASCII and represents Unicode code points. Think of it as char type in Java or C, but for Unicode.

So, there is Insert function without any optimizations, just the first naive implementation:

// runeTrie is a rune-wise trie implementation.
type runeTrie struct {
    root *runeNode
    size int
}

type runeNode struct {
    children map[rune]*runeNode
    last     bool
}

// NewRuneTrie creates a rune-wise new trie.
func NewRuneTrie() Trie {
    return &runeTrie{root: &runeNode{make(map[rune]*runeNode), false}}
}

// Insert inserts a word into the trie and returns true
// if the word already exists.
func (t *runeTrie) Insert(word string) bool {
    exists := true
    current := t.root
    for _, letter := range word {
        n, ok := current.children[letter]
        if !ok {
            exists = false

            n = &runeNode{make(map[rune]*runeNode), false}
            current.children[letter] = n
        }

        current = n
    }
    current.last = true

    if !exists {
        t.size++
    }

    return exists
}
Enter fullscreen mode Exit fullscreen mode

The algorithm is simple:

  1. Set current node value to the root node.
  2. For each letter of the word:
    1. Check if the current node contains a letter and if it does not have it, create a new node for the letter and attach it to the current node.
    2. But if it includes the letter, get the node under this letter and set it as the current node. Return to 2.
  3. Mark the last processed node as the end of the word.

Contains function is much simpler and seems straightforward:

// Contains returns true if an exact match of the word is found, otherwise false.
func (t *runeTrie) Contains(word string) bool {
    n, _ := t.nodeByPrefix(word)

    return n != nil && n.last
}

func (t *runeTrie) nodeByPrefix(prefix string) (*runeNode, rune) {
    current := t.root
    var r rune
    for _, letter := range prefix {
        n, ok := current.children[letter]
        if !ok {
            return nil, 0
        }

        current = n
        r = letter
    }

    return current, r
}
Enter fullscreen mode Exit fullscreen mode

We traverse a tree by letters until the last node.

Let's benchmark this implementation. Go has tools to write simple benchmark tests:

func BenchmarkRuneTrieInsert(b *testing.B) {
    var r bool
    for i := 0; i < b.N; i++ {
        t := NewRuneTrie()
        for _, w := range words {
            r = t.Insert(w)
        }
    }

    benchmarkResult = r
}
Enter fullscreen mode Exit fullscreen mode

And then we run the benchmark:

$ go test -bench=. -benchmem -benchtime=1000x
goos: darwin
goarch: amd64
op         0 allocs/op
BenchmarkRuneTrieInsert-8               1000          7196 ns/op        6984 B/op        129 allocs/op
BenchmarkRuneTrieContains-8             1000           517 ns/op           0 B/op          0 allocs/op
BenchmarkWordHashSetInsert-8            1000          1406 ns/op        1100 B/op          4 allocs/op
BenchmarkWordHashSetContains-8          1000           178 ns/op           0 B/op          0 allocs/op
PASS
Enter fullscreen mode Exit fullscreen mode

Go bench tool does not count stack allocations, so I see that Contains function does not contain any heap allocations, which is excellent.

Trie structure enables us to add one more function that is not feasible for a regular hash map - SearchByPrefix:

// Finds and returns words by prefix.
func (t *runeTrie) SearchByPrefix(prefix string) []string {
    node, r := t.nodeByPrefix(prefix)

    return search(node, r, []rune(prefix[:len(prefix)-1]))
}

func (t *runeTrie) nodeByPrefix(prefix string) (*runeNode, rune) {
    current := t.root
    var r rune
    for _, letter := range prefix {
        n, ok := current.children[letter]
        if !ok {
            return nil, 0
        }

        current = n
        r = letter
    }

    return current, r
}

func search(currentNode *runeNode, currentRune rune, prefix []rune) []string {
    words := []string{}
    if currentNode == nil {
        return words
    }

    newPrefix := append(prefix, currentRune)
    if currentNode.last {
        words = append(words, string(newPrefix))
    }

    for letter, node := range currentNode.children {
        newWords := search(node, letter, newPrefix)
        words = append(words, newWords...)
    }

    return words
}
Enter fullscreen mode Exit fullscreen mode

The function is not optimized, but it shows that this is possible and not so complex. It calls the function defined earlier nodeByPrefix.

Applications

Autocomplete

Since the trie allows to search words by prefix, it is a good candidate for autocompletion algorithms.

Routing

fasthttp library for Go uses trie as an internal implementation for routing. It is not only the fastest (probably) implementation of the routing, but it also does not produce any garbage.

You can dive deeper if you wish.

Other options

If you need:

  1. A set in which you might need to perform a lookup by the prefix.
  2. Compactly store a lot of similar data.
  3. You build the structure once and query it after.

Trie might be a good candidate for you.

For example, you build a set of domains, including subdomains in memory at the application startup, and then only query it. You might consider a trie for this task.

But anyway, always check your performance assumptions with benchmarks.

If you do not need to lookup by prefix and your data structure is changing a lot, a hash map or a hash set is a better alternative.

Known Trie implementations

I found two exciting libraries with an exemplary implementation of trie:

  1. https://github.com/siongui/go-succinct-data-structure-trie
  2. https://github.com/celzero/gotrie

You can also use mine, but it is not optimized and created for learning purposes

💖 💪 🙅 🚩
krasun
Dmytro Krasun

Posted on March 7, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Go and Trie!
algorithms Go and Trie!

March 7, 2021