jD91mZM2
Posted on February 9, 2018
Hello! Today I decided to stop feeling like Base64 is black magic, and go really learn it. You can see the result here. I learned a lot of interesting things, so I'll share it with you here.
Because of mixed backgrounds I'll make sections for everything so you can skip anything you already know.
Bitwise operators
To properly learn Base64, you need to learn bitwise operators. I'm sure you've seen stuff like 50 & 1 == 0
somewhere before. What are these weird half-boolean operators doing on the integers?!
As you know, computers work with binary. But the relationship is closer than you think. What bitwise operators do is flip certain bits.
Each binary digit (0 or 1) is called a bit.
Bitwise And: &
. This calculates the integer result of two integers where each bit is the same. In other words, it sees which bits number one and number two has. So it's called bitwise and. Examples:
42 // <- Binary: 101010
& 15 // <- Binary: 001111
= 10 // <- Binary: 001010
5 // <- Binary: 101
& 6 // <- Binary: 110
= 4 // <- Binary: 100
Bitwise Shift: <<
and >>
. Basically "shifts" all numbers to left or right. These can be compared to multiplying by 10^n in decimal. In fact, it's exactly the same as multiplying by 2^n. When you multiply something by 2, the compiler sometimes optimizes it to bit-shifting by one.
16 // <- Binary: 10000
<< 1
= 32 // <- Binary: 100000
42 // <- Binary: 101010
>> 3
= 5 // <- Binary: 101
Byte order
As you may know, an int
is a 32-bit integer and a byte
is an 8-bit integer.
Since 32 (and all other types of integers) is a multiple of 8, you could technically convert a bigger type of integer to multiple bytes and then back... right?
Like say 666, 0000001010011010
, could be stored as 00000010
and 10011010
?
Turns out, of course, I wasn't the first one to think of this. In fact, in memory they're represented like this too. In C you could simply cast a int*
to a byte*
and then access the specific bytes inside the integer.
What you may find (it depends), is that they are in the reverse order of what I just showed (10011010
first, not last). This is called little endian, where as what I showed is big endian. Big endian means the "big end" (the byte that makes the largest impact, just like changing the one in 100 makes a larger impact than changing the one in 10) is first.
Putting it all together
Normal binary data is stored in 8 bits, which means it can go to 256.
64 happened to be a good amount of characters that is a power of 2.
The reason it's a power of two is quite brilliant: It can be converted to by simply reading 6 bits instead of 8.
Say we have the following: "hello".
In bytes, this is 104 101 108 108 111
.
In binary: 01101000 01100101 01101100 01101100 01101111
.
Now we make this into one integer. Before we do that though, there's a problem. This quickly gets unbelievably large.
We solve this using a convenient fact: 4 6-bit integers has the same bit length as 3 8-bit ones. That's also why it's padded to 4: 6*4 = 8*3. This is really convenient for the programmer when decoding, so he/she doesn't have to bounds check valid base64.
Because of this, we only have to worry about 3 bytes at a time (4 when decoding).
for bytes in input.chunks(3) {
// Fun fact, you *could* do this, but it's more fun to use bit shifting.
// let buf = mem::transmute::<&[u8], u32>(bytes).to_be();
let mut buf = 0;
buf += (bytes.get(0).cloned().unwrap_or(0) as u32) << 8*2;
buf += (bytes.get(1).cloned().unwrap_or(0) as u32) << 8*1;
buf += bytes.get(2).cloned().unwrap_or(0) as u32;
The reason for .get(n).cloned().unwrap_or(0)
is basically because indexing can crash. The input doesn't have to be dividable by 3.
Now we have
011010000110010101101100
and
011011000110111100000000
.
Note: The last number gets padded out with 0s.
Second step now sounds really simple: Re-split the numbers every 6 bytes and insert =
if it runs out before it's a multiple of 4.
for i in 0..4 {
let j = (3-i) * 6;
let buf = ((buf >> j) & 0x3F) as u8;
The first part is actually rather simple but unecessarily verbose. It counts down from 24, 6 steps at a time. It could probably have been written like for j in (0..=24).rev().step_by(6)
, but I need i
later.
But next comes the fun part:
Let's imagine j
is 18.
011010000110010101101100
>> 24
= 011010000110
011010000110
& 000000111111 // <- 0x3F in binary.
= 000110
See the pattern? In a way, we basically just took a substring... of the integer! The bitwise shift allowed us to fetch a value down, and the bitwise and allowed us to specify which values we cared about.
Now we have
011010 000110 010101 101100
and
011011 000110 111100 000000
.
Finally, let's insert the value, or =
if out of bounds.
if bytes.len() >= i {
output.push(char(buf));
} else {
output.push('=');
}
}
}
Note: I didn't include the definition of char
function. It basically just maps a value to a base64 character, like 0
to A
.
We now have
a G V s
and
b G 8 =
.
aGVsbG8=
Posted on February 9, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.