Dictionary Compression
Kevin Cox
Posted on March 2, 2022
A certain 5-letter word game has gotten really popular recently. A peek at the source shows an interesting choice for implementing the dictionary of allowed guesses and the answer list.
- The list of secret words is just sitting in the code, the word is picked based on the user’s local date.
- The words are a simple list in the JS.
- The answers are just another list.
The first point is clever. It means that they don’t need to update the backend every day (or have logic on the backend) and it automatically switches at midnight for each user, not some random time of day based on the server’s midnight.
The other two surprised me. I would have expected at least some simple domain-specific compression. Also, I would have expected the answers list to just be pointers into the dictionary, rather than a separate list (which requires the validation code to check both lists every time you submit a word).
I thought to myself how I would implement this, and from there started wondering about how to optimize it. A post about compressing the dictionary to fit on a Game Boy then added fuel to the fire, and eventually I cracked. I spend a fun day trying different things, and here are my findings. All the code backing this study is open source and can be viewed here.
Note: For this analysis I am using this archive of the original Wordle. The dictionary and answers have changed over time. I don’t expect any changes to make much difference to this analysis. For all general-purpose compression I am using zstd since it is a very high quality compressor.
Requirements
My requirements are pretty simple. I’ll assume a browser target, and I’m going to see how best to get the data there. I’m going to assume I can send binary data, and I’ll also approximate decoder size to ensure that I am not just moving data into the decoder. However, in practice, this wasn’t an issue for any of my solutions.
Measuring code size it is very difficult due to WASM overhead and Rust error handling overhead. To normalize for this I will be taking the code size of the smallest implementation (bitmap at 1347 B) and subtract the difference in size of the same when implemented with no possibility of panic (234 B). This means that the code sizes seen in this post will be 1113 B less than the actual compressed size. This should roughly account for the overhead of the Rust panic infrastructure. I think this makes sense because unless you are implementing just this one function in WASM you will have to pay this cost anyway, and this isn’t storing any domain-specific data. I also didn’t feel like implementing each one with unsafe to avoid the chance of panic due to slice bounds checking. This is not very scientific and mostly makes sure that no implementation results in a huge code size.
I also want to ensure that the encoding is feasible for interactive use. I’ll target 50 ms to leave lots of room in the 100 ms threshold for other work. In practice this isn’t an issue for any of my solutions by a couple orders of magnitude. All times in this article were taken by counting the average time to check all possible valid or invalid words in a random order on an AMD 5950X. This is not particularly scientific because the branch predictor will quickly learn to always say yes or no and the timings were just run once but since I am so far under my time requirement it doesn’t seem important.
Baseline
First, let’s see how the simplest options will work. The game has 12 972 words in the dictionary, 2315 answers, and 10 657 guessable words. In the game these are JSON encoded in two separate lists. Let’s see how big these are. For the “both” I took the best-case scenario of simply concatenating them with not even a newline between them.
Data | Uncompressed | Default (3) | Max (19) |
---|---|---|---|
answers | 18.088 KiB | 7.604 KiB | 6.266 KiB |
dictionary | 83.259 KiB | 31.739 KiB | 18.070 KiB |
both | 101.346 KiB | 39.985 KiB | 24.379 KiB |
Let’s be honest. Less than 25 KiB for your app data is not much these days. But let’s see how much better we can do.
Single List
First, let’s try putting the words into a single list. This should allow for better compression because all the words are in order, not the randomized order for the answers. (Notice that the answers only compressed down to ⅓ of their size whereas the full dictionary compressed down to almost ⅕. I assume that a lot of this is the unpredictable ordering.) However, at the same time, I’m also dropping the JSON encoding. I’m just concatenating all the 5-letter words. This saves me three characters (","
) on every entry which is a ⅜ savings (although these characters likely compress very well).
As for lookup, I’m doing a binary search. It isn’t necessary to hit my performance target but significantly speeds up my exhaustive testing. Plus the added code size is tiny.
Results
- Accepted in 126 ns on average.
- Rejected in 128 ns on average.
- Size: 63.340 KiB
- zstd default: 30.570 KiB
- zstd max: 23.443 KiB
- Code: 319 B
While our size dropped significantly from removing the JSON the compression became less effective and we ended up saving less than 2%. If you add on 2 bytes per answer to regain the answer list you basically break even.
Note: An interesting optimization would be to restrict answers selection so that instead of 2 bytes to select an arbitrary answer you only use 1 byte (or even less). It should be fairly easy to devise a scheme that humans couldn’t easily use to bias their choice even though it would make some answer orders impossible.
More Efficient Encoding
The first obvious thing is that all the words only use letters in the a-z
range. We don’t need to spend a full byte per character. We only need log_2(26) = 4.7
bits per character, or 23.5 bits per word. I’ll ignore the half a bit and use 24 bits, or 3 bytes per word.
The logic is quite simple. Instead of storing the word itself you store the index of the word in the list of possible words. For example aaaaa
is 0
and zzzzz
is 26^5 - 1
. The code is below:
fn index(word: [u8; 5]) -> u32 {
let mut i: u32 = 0;
for c in word {
debug_assert!((b'a'..=b'z').contains(&c));
i *= 26;
i += (c - b'a') as u32;
}
i
}
Our dictionary is now a list of 3 byte word indexes.
Results
- Accepted in 120 ns on average.
- Rejected in 128 ns on average.
- Size: 38.004 KiB
- zstd default: 38.013 KiB
- zstd max: 37.521 KiB
- Code: 440 B
We get the ⅗ size difference we were expecting, however zstd completely fails to find any patterns! The compressed size is actually larger than before, and on the default zstd setting it doesn’t compress at all. This is surprising because typically there is repetition 3 bytes apart at the start of every word, but I guess zstd doesn’t notice that. gzip
also struggles but zopfli
does manage to crunch it down to 34.314 KiB. Still significantly larger than letting the compressor see the ASCII directly.
More Efficient Encoding with Index
Instead of storing each word in 3 bytes we can use the first 2 characters in an index and compress the remaining 3 characters into 2 bytes. This saves us 1 byte per word but takes up space on the index.
Luckily the most common 2-letter prefix “co” only has 220 words, this means that we only need one byte per prefix in our index or 26^2 = 676
bytes.
Results
- Accepted in 450 ns on average.
- Rejected in 226 ns on average.
- Size: 25.996 KiB
- zstd default: 20.479 KiB
- zstd max: 17.721 KiB
- Code: 671 B
This is much smaller than before plus the compression seems to find more patterns, but it is significantly slower. There are two main reasons.
- We sum up the values in the index. This is good for space but bad for speed. This could be fixed with a pre-processing pass to resolve the offers before doing the lookups.
- I didn’t bother implementing a binary search for the values. I didn’t think it would help for a max of 220 entries, but maybe it would.
Bitmap
What if instead of storing the indexes we stored one bit for every possible word. 1
if it is an allowed word and a 0
otherwise. This will require 26^5 bits
and should give very fast lookups.
We can use the same logic for converting the word into an index then just check that bit in the table.
Results
- Accepted in 8 ns on average.
- Rejected in 7 ns on average.
- Size: 1450.364 KiB
- zstd default: 23.408 KiB
- zstd max: 20.506 KiB
- Code: 234 B
This method is incredibly fast, a great option if you were hosting the data on a server. The general-purpose compression also does a great job making this option quite small on the wire.
Delta-encoding
Another option is that instead of storing the words, we just store the change from the previous word. This takes advantage that in the sorted list the first character is almost always the same, the second character is usually the same, and so on. The way I decided to do this was by storing the difference in indexes. This works well because we can enumerate the possible words. Other options would be including a count for the number of characters that don’t match, then the replacement characters.
Unfortunately, the distance between words varies wildly. From adjacent words like “charr”, “chars” and “chart” to distant words like “xerus” and “xoana” which are 164 068 possible words apart! That means we will need 17.4 bits to store the distance. However, since most gaps are smaller it makes the most sense to use a variable-length encoding.
In this case, I use a simple varint encoding where the top bit set means that there are more bytes to add on. In this system, 3 bytes gives us 21 usable bits which is large enough for our largest gap.
Results
- Accepted in 7 372 ns on average.
- Rejected in 10 167 ns on average.
- Size: 16.961 KiB
- zstd default: 14.229 KiB
- zstd max: 14.237 KiB
- Code: 309 B
This solution is slow because it sequentially scans all the words until it finds a match. This is because we can’t seek in the variable-length encoding, and you need to sum all previous values to find the current index. However, at 10 µs it is still well within our target time. This could also be improved easily by building a 2 or even 3 character prefix index on the client, or the client could just construct a bitmap.
Delta-encoding with Index
This was an idea that in retrospect doesn’t make much sense. It was basically the same as the previous idea except that I inserted a jump-table for the first letter at the front. This speeds up lookup as you don’t need to scan every earlier word. However, I thought that it would also be smaller because I didn’t have to include the first letter in the list, however since I was doing delta encoding it doesn’t make a difference. The distance between taken
and tools
is the same as the distance between aken
and ools
.
Note: This is basically the same approach that Alexander Pruss used for his Game Boy implementation. On the Game Boy it is likely beneficial to build the index up front instead of using valuable RAM.
Results
- Accepted in 461 ns on average.
- Rejected in 523 ns on average.
- Size: 16.990 KiB
- zstd default: 14.251 KiB
- zstd max: 14.258 KiB
- Code: 545 B
This is insignificantly larger, both before and after compression.
The speed-up is significant, about 20x faster. This makes sense because instead of scanning half of all words on average, you only need to scan half of the words with the same first letter. I’ll assume this isn’t exactly 13x due to letter distribution.
Bloom Filter
This is the reason why I started the project. I thought I would be able to over-fit a bloom filter to the problem. However, doing the math showed that this was unlikely. I experimented anyway and confirmed my suspicion. To get no false-positives I needed to use a ¼MiB filter. This could maybe be reduced slightly by jigging some constants to over-fit the data, but it seemed very unlikely that a small enough filter could be found to be an efficient approach.
One intrinsic downside of the filter is that you can’t iterate the entries. This means that you can’t ship your answers as a 16-bit index into the valid-word list, you need to use a full 24-bit possible-word index. However, even if you want to store a year of answers it is a small extra cost (only 365 extra bytes).
I still think this idea has value even if it doesn’t fit my goal of no false positives at a reasonable size. If you can compromise on requirements it could be a good choice. With a 16 KiB filter, you can get 1:127 false positive rate, with 8 KiB you can get 1:11. If the goal is to prevent accidental submits from typos and avoid people just guessing “aeiou” as their first word then these may be reasonable tradeoffs. The word list is full of strange words like “abcee” anyway, so I don’t think anyone will notice a couple false-positives.
Results
- Accepted in 24 ns on average.
- Rejected in 17 ns on average.
- Size: 256.000 KiB
- zstd default: 80.713 KiB
- zstd max: 65.793 KiB
- Code: 574 B
It is very fast (and I could have used a faster hash) but if you want no false-positives this approach isn’t great.
Conclusion
The best approach was Delta Encoding at 15 KiB after compression. This is a significant reduction over the simple approach at 25 KiB. Of course, it is easy to argue that this isn’t a worthwhile optimization compared to the simplicity of embedding the list in the JavaScript and checking membership using dictionary.includes(guess()) || answers.includes(guess)
.
Posted on March 2, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.