Roman
Posted on June 26, 2023
While working on my note-taking application Heaplist, when I got to saving data to the database, I started using uuid4 identifiers to identify records, which usually look something like this when represented as a string:
32dca18531a1435480461f99837a5b1d
For some reasons, I didn't really like using uuid:
- it is a rather long string of 32 characters, and I will need to show it to users sometimes
- 6 bits in uuid4 are not used, they are constants, wasteful
constant bits in uuid4:
uuid.uuid4().bytes[6] & 0b11110000 # == 64
uuid.uuid4().bytes[8] & 0b11000000 # == 128
I decided to explore other options. Guid implementations like ulid, ksuid have some popularity. But they did not suit me either, mainly due to the fact that they include a timestamp.
My choice is completely random 128-bit identifiers, which I encode to a base62 string. Why base62? They contain only letters and numbers, so they can be selected by double-clicking (and, for example, base64 is not). Identifiers have become shorter, now they are 22-character strings, for example:
4xT8QKx8f3BwZP06VKSEMy
But there is a problem! Encoding to a number system that is not a power of two is a rather slow process, due to the fact that it is involves the division operation. For decoding (base62 to u128) there is no such problem, since there will be no division, only multiplication and addition.
So, first let's try the naive encoder implementation written in Rust:
const BASE62_N: usize = 62;
const BASE62_DIGITS: &[u8; BASE62_N] =
b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
// how many base62 numbers do we need to store u128?
// you can calculate this number like this, python: math.ceil(math.log(2**128, 62))
const U128_BASE62_ENCODED_LEN: usize = 22;
pub fn u128_to_base62_naive(mut n: u128) -> String {
// padding occurs from lower digits to higher ones, unfilled high digits will remain zero
let mut b62_str = vec![b'0'; U128_BASE62_ENCODED_LEN];
let mut i = U128_BASE62_ENCODED_LEN;
while n > 0 {
i -= 1;
let digit_index = (n % BASE62_N as u128) as usize;
b62_str[i] = BASE62_DIGITS[digit_index];
n /= BASE62_N as u128;
}
// the variable is guaranteed to contain a valid utf-8 string, there is no need to additionally check it
unsafe { String::from_utf8_unchecked(b62_str) }
}
In fact, this is a conversion from a decimal number system to a number system with base 62. To do this, we divide the original number by a new base into which we translate, the remainder of the division is the next digit in the new number system, we divide the quotient again, and so on in a cycle .
Let's look at the speed of the algorithm, the benchmark result is (i5-4690 3.50GHz): bench_u128_to_base62_naive 1000000: 32.86Mib/s, 2153250.28/s
Sad numbers, right? For comparison, base64 encoder works at 309.35Mib/s
, hex encoder 645.34Mib/s
.
Why is that? Since 62 is not a power of two, for conversion we have to re-divide the entire 128-bit number on each cycle, each bit of the original number affects each subsequent digit of the result. Let's look at the assembler generated by the Rust compiler godbolt, here you can see that two functions are used for division:
__umodti3
__udivti3
These are the functions implemented in runtime (llvm) for division of 128-bit numbers (designations "u" - unsigned, "ti" - 128 bit). And of course they will be relatively slow.
We need to simplify work for the processor. As you know, any number in the base BASE
can be represented as: HIGH * BASE^R + LOW
, for example decimal 1111222 = 1111 * 10^3 + 222
, what does this give us? We divide both parts by 10^3
by integer division, we get 1111 = 1111
, the remainder of the division will be 222 = 222
, so this representation allows us, when converting between bases, to divide not the whole number but only its part. As you probably already guessed, we will replace the division of 128-bit numbers with the division of 64-bit numbers.
So, first we need to choose the power to which we will raise 62
, the optimal one will be R
, under which the condition 62^R <= 2^64 < 62^(R+1)
is satisfied, it can be calculated by the formula:
math.floor(math.log(2**64, 62)) # == 10
Now let's write a new implementation in Rust, it looks like this:
const BASE62_N: usize = 62;
const BASE62_DIGITS: &[u8; BASE62_N] =
b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
const U128_BASE62_ENCODED_LEN: usize = 22;
const BASE62_POW_10: u128 = (BASE62_N as u128).pow(10);
pub fn u128_to_base62(mut n: u128) -> String {
let mut b62_str = vec![b'0'; U128_BASE62_ENCODED_LEN];
// `nlow` will generate the lower 10 digits in base62
let mut nlow = (n % BASE62_POW_10) as u64;
let mut i = U128_BASE62_ENCODED_LEN;
// convert number in three steps:
// 0000000000001111111111
// + 0022222222220000000000
// + 3300000000000000000000
// = 3322222222221111111111
for _ in 0..2 {
// here is the same algorithm for converting a number to a new base, but now using 64-bit division
for _ in 0..10 {
i -= 1;
let digit_index = (nlow % BASE62_N as u64) as usize;
b62_str[i] = BASE62_DIGITS[digit_index];
nlow /= BASE62_N as u64;
}
// move on to the next block of digits
n /= BASE62_POW_10;
nlow = (n % BASE62_POW_10) as u64;
}
// finish by converting the 2 remaining digits
for _ in 0..2 {
i -= 1;
let digit_index = (nlow % BASE62_N as u64) as usize;
b62_str[i] = BASE62_DIGITS[digit_index];
nlow /= BASE62_N as u64;
}
unsafe { String::from_utf8_unchecked(b62_str) }
}
Let's look at the result assembler godbolt, now, the div
instruction is used for 64-bit division. And in optimized build it will be replaced by multiplication and shifts.
As a result it was __umodti3 x 22, __udivti3 x 22
, became __umodti3 x 3, __udivti3 x 2, div x 44
. This is what the benchmark shows: bench_u128_to_base62 1000000: 112.18Mib/s, 7351959.26/s
. Speed up by 3.5 times!
Ready.
and yes, I've tried to replace the u64 division with the u32 division, there was no significant increase in performance, it was about 2% faster
My links
I'm making my perfect todo, note-taking app Heaplist, you can try it here heaplist.app
And I have a twitter @rsk
Posted on June 26, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.