File Compression: Understanding the Huffman Coding Algorithm
Phyllipe Bezerra
Posted on January 26, 2023
Hello! First of all I just wanna to make clear that this article is a result of me just trying to understand something new. Feel free to suggest or point out necessary corrections!
What is Huffman Enconding?
Huffman encoding is a lossless data compression algorithm that assings variable-length encoding "words" to fixed-length input characters based on their frequencies of occurence. These words are chosen in a way that more frequent characters have shorter words and less frequent characters have longer words.
Why Variable-length encoding?
Before that, let's understand what's the problem with fixed-length encoding. It's quite simple:
Fixed-length enconding waste space
See for example, the string "test tested test"
, since "test"
appear three times we would be wasting space encoding this string with a fixed-length approach.
The solution
And here comes the variable-length approach:
- Encoding of different lengths for different characters
- Assing shorter encodings to frequently occuring characters
For exemple, take the word "test"
. First let's check the characters frequencies of occurence:
Characters | Frequency |
---|---|
t | 2 |
e | 1 |
s | 1 |
Now create a linked list in ascending order based on the frequency, the list node element will have a character space, frequency space, a pointer to the next list node, and to make things simple, right and left pointers that we will use to create a binary tree soon. So it would be something like this:
Now we need to remove and "merge" the nodes with the two lowest frequencies, forming a new node that will be the parent node. The parent node will have the character value as an asterisk (*) and the frequency value will be the sum of its children. Also the left and right pointers will point to the lowest frequency child and the second lowest frequency child respectively. So for this step with our list we would get something like this:
And then we can put this tree at the tail of our list, and it will be like this:
While the head element of the list is not a parent node, i.e a node with character as asterisk, we need to repeat the previously step, so our list will be like this at the end of the second iteration:
At this point, we already can get the encoding for our characters. To get the enconding we need to go from the root node to the leaves tracking the direction, if we go left we can add the value 0 otherwise add 1 to the tracking value, for example:
To get the encoding for our character 'S' we need to go to the left and then left again, that way our enconding will be 00
. For E character we need to go to the left and then right, so our encoding will be 01
, and the T character we need to only go to the right and the encoding will be 1
, at the end we would get the encodings as:
Characters | Frequency | Encoding |
---|---|---|
t | 2 | 1 |
e | 1 | 01 |
s | 1 | 00 |
As you can see here we already have achieved the two rules previously mentioned:
- Encoding of different lengths for different characters
- Assing shorter encodings to frequently occuring characters
Now let's write down the encoded value of our string "test"
and the binary value:
- Binary Value:
01110100 01100101 01110011 01110100
- Length: 32 - Encoded Value:
1 01 00 1
- Length: 6
As you can see here, we got a length reduction of 81%!
And that's my basic comprehension of the huffman encoding algorithm, next I will be probably coding a file compressor using the huffman encoding approach and I will leave the github repo link once done.
While that feel free to suggest or point out necessary corrections!
Posted on January 26, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
January 18, 2023