The Hitchhiker's Guide to Binary-to-Text Encoding

favr

Patrick Favre

Posted on April 2, 2023

The Hitchhiker's Guide to Binary-to-Text Encoding

Either for debugging, data serialization, cryptography or ID generation, binary-to-text encoding is an important tool for most developers representing binary data in a sequence of printable characters. Either you currently want to select a specific one or just want to generally understand the basic properties of each, this article will provide you an overview.

One thing all of these encodings have in common, is that they require more space than the underlying bit-data. How much
depends on the encoding and the size of its alphabet. Another important property is "human-readability", so if you want
to understand the underlying value at a glance, it will be way easier with a hex encoding than base64. Also don't forget
padding, required if a single character does not exactly represent 2, 4 or 8 bits, which makes the output length variable. Finally, you need to consider how readily available implementations of the chosen encoding is, especially if you want to send the data to different system using different tech stacks.

Encodings

Binary

Binary, also known as base-2 encoding, is the simplest and most fundamental binary-to-text encoding. It represents data using only two symbols: 0 and 1. In binary encoding, each byte (consisting of 8 bits) is directly translated into a sequence of eight 0s and 1s.

Binary encoding is best suited for situations where readability is not a primary concern, such as number encoding and debugging purposes. Although it is not widely used for general text encoding due to its verbosity, binary remains an essential building block in understanding more complex binary-to-text encoding schemes.

Property Value
Efficiency 12.5 % (1 bit/char), 1 bit segments
32/64/128 bit 1-32/1-64/1-128 chars
Padding false
Const. Out. Len. false
Suited for number encoding, debugging
Alphabet 01
Known Usages none
Standardization none
Popularity implementations: common, usage: not common
Example 11010011 01111000 01101100 10010011 01111110 01111111 00111000

Octal

Octal, or base-8 encoding, represents data using eight distinct symbols: 0 through 7. In octal encoding, each byte (8 bits) is divided into three groups of 3 bits each, and each group is then converted into a single octal digit.

Octal encoding is particularly well-suited for number encoding applications, such as the Unix chmod command, which uses octal notation to represent file permissions. While not as prevalent as some others, octal remains a useful and compact representation for certain use cases, especially in contexts where base-8 arithmetic is more convenient or intuitive.

Property Value
Efficiency 37.5 % (3 bit/char), 24 bit segments
32/64/128 bit 1-11/1-22/1-43 chars
Padding false
Const. Out. Len. false
Suited for number encoding
Alphabet 01234567
Known Usages chmod
Popularity implementations: common, usage: not common
Standardization none
Example 703767722333074323

Decimal

Decimal, or base-10 encoding, represents data using 0 through 9. In decimal encoding, bytes are treated as integer values and then converted to their corresponding decimal representation.

Decimal encoding is particularly suited for number encoding and single-byte representation applications. Due to its familiarity and ease of understanding, decimal encoding is often employed in contexts where readability is important, and the data being represented consists primarily of numerical values.

Property Value
Efficiency 41.5 % (3.32 bit/char)
32/64/128 bit 1-10/1-20/1-39 chars
Padding false
Const. Out. Len. false
Suited for number encoding
Alphabet 0123456789
Known Usages single byte representations
Popularity implementations: common, usage: not common
Standardization none
Example 15902780311763155

Hex

Hexadecimal, often abbreviated as "hex" or referred to as base-16 encoding, is a widely used binary-to-text encoding method that represents data using sixteen distinct symbols: 0-9 and A-F (or a-f) for the digits 10 through 15. In hex encoding, each byte (8 bits) is divided into two groups of 4 bits each, with each group being converted into a single hex digit.

Hexadecimal encoding is particularly suited for number and byte-string encoding applications. It is widely used in various contexts, such as UUIDs, cryptographic keys, and color codes in web design, among others. Hex encoding has been standardized by RFC 4648, which provides guidelines on how this encoding method should be used and implemented in various applications.

Property Value
Efficiency 50 % (4 bit/char), 8 bit segments
32/64/128 bit 8/16/32 chars
Padding false
Const. Out. Len. true
Suited for number & byte-string encoding
Alphabet 0123456789abcdef
Known Usages UUIDs, color codes, cryptographic keys, ...
Popularity implementations: very common, usage: very common
Standardization RFC 4648
Example 387f7e936c78d3

Base26

Base26 encoding, also known as alphabetic encoding, represents data using the 26 letters of the English alphabet (A-Z).

It is particularly suited for number encoding applications and may be useful in scenarios where the encoding output should only contain alphabetic characters. However, it is not widely adopted, and there are no known standardization or specific use cases for this encoding method.

Property Value
Efficiency 58.8 % (4.70 bit/char)
32/64/128 bit 7/14/28 chars
Padding false
Const. Out. Len. true
Suited for byte-string encoding
Alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ
Known Usages none
Popularity implementations: not common, usage: not common
Standardization none
Example EIQYWQEAJRFF

Base32

Base32 represents data using a set of 32 distinct characters, typically consisting of uppercase letters A-Z and digits 2-7. This encoding scheme is designed to be more human-readable and resistant to errors when compared to other schemes like base64, while still offering a relatively compact representation of data.

This encoding method is particularly well-suited for scenarios where data needs to be case-insensitive, easy to read, or less prone to transcription errors. Base32 has been standardized by RFC 4648 but has several variations.

Property Value
Efficiency 62.5 % (5 bit/char), 40 bit segments
32/64/128 bit 7+1/13+3/26+6 chars (+padding)
Padding true
Const. Out. Len. true
Suited for byte-string encoding
Alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ234567
Known Usages none
Popularity implementations: common, usage: not common
Standardization RFC 4648
Variations z-base-32, Crockford's Base32, base32hex, Geohash
Example HB7X5E3MPDJQ

Base36

Base36 represents data using a set of 36 distinct characters, consisting of both the 26 lowercase letters of the English alphabet (a-z) and the 10 Arabic numerals (0-9). This encoding scheme aims to provide a more compact and human-readable representation of data while still offering a balance between efficiency and readability.

Base36 encoding is particularly suited for applications that involve encoding large integers, such as unique identifiers or URL slugs.

Property Value
Efficiency 64.6 % (5.17 bit/char)
32/64/128 bit 1-7/1-13/1-25 chars
Padding false
Const. Out. Len. false
Suited for big integer encoding
Alphabet 0123456789abcdefghijklmnopqrstuvwxyz
Known Usages Reddit Url Slugs
Popularity implementations: common, usage: not common
Standardization none
Example 4cl2cf404wj

Base58

Base58 encoding represents data using a set of 58 distinct characters, consisting of uppercase letters A-Z, lowercase letters a-z, and the digits 1-9, excluding visually similar characters such as '0', 'O', 'I', and 'l'. This encoding scheme aims to provide a compact and human-readable representation of data while minimizing the risk of transcription errors.

While base58 encoding is not standardized, it has gained popularity in the cryptocurrency and distributed systems communities.

Property Value
Efficiency 73.2 % (5.86 bit/char)
32/64/128 bit 6/11/22 chars
Padding false
Const. Out. Len. false
Suited for big integer encoding
Alphabet 123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz
Known Usages Bitcoin, IFPS
Popularity implementations: not common, usage: not common
Standardization none
Variations flicker short-urls
Example 39BQ5CdzFL

Base64

Base64 encoding is one of the most widely used binary-to-text encoding. It utilizes a set of 64 distinct characters, which includes uppercase letters A-Z, lowercase letters a-z, digits 0-9, and two additional characters, typically '+' and '/' (or '-' and '_' for the URL-safe variant). Padding is represented as '='. This encoding scheme aims to provide a compact and universally compatible representation of data, allowing it to be safely transmitted or embedded in various environments.

Base64 encoding is particularly suited for applications that involve encoding byte strings, such as embedding images in HTML or transmitting binary data over text-based protocols like email. It is standardized in RFC 4648, with various variations defined in other RFCs, making it a widely recognized and supported encoding method across different platforms and programming languages.

Property Value
Efficiency 75 % (6 bit/char), 24 bit segments
32/64/128 bit 6+2/11+1/22+2 chars (+padding)
Padding true
Const. Out. Len. true
Suited for byte-string encoding
Alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
(url-safe) ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
Known Usages practically everywhere
Popularity implementations: not common, usage: not common
Standardization RFC 4648 (previously RFC 3548)
Variations RFC 4880 (ASCII Armor), RFC 1421, RFC 2152, RFC 3501, bcrypt radix64
Example OH9-k2x40w
OH9+k2x40w (url-safe)

Ascii85

Ascii85, also known as Base85 encoding, uses a set of 85 distinct characters, which include all printable ASCII characters (except for whitespace) and an additional four characters that are used for padding and delimiting.

Ascii85 encoding is often used in environments where binary data needs to be represented in the most compact way.

Property Value
Efficiency 80.1 % (6.41 bit/char)
32/64/128 bit 1-5/2-10/4-20 chars
Padding false
Const. Out. Len. false
Suited for byte-string encoding
Alphabet 123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz
Known Usages Git, IPv6, Adobe PDF and PostScript
Popularity implementations: not common, usage: not common
Variations 32/Z85 ZeroMQ, ZMODEM Pack-7 encoding, btoa, Adobe, RFC 1924
Example 3.HC@Cj=D

Base122

Base122 is an experimental encoding that facilitates printable and non-printable characters to maximize space efficiency. Base-122 can be used in any context of binary-to-text embedding where the text encoding is UTF-8. There is a JavaScript and C reference implementation by the original author, with some options in Python, Java and Rust.

Property Value
Efficiency 86.6 % (6.93 bit/char)
32/64/128 bit ?
Padding false
Const. Out. Len. false
Suited for embedding blobs in HTML (experimental)
Alphabet full 7bit minus some reserved chars (UTF-8 compatible)
Known Usages none
Popularity implementations: not common, usage: not common
Example ��v�~� (non-printable characters, might not render correctly)

Encoding while Compressed

More Bits per char is always smaller, right? While sometimes the encoded character sequence is directly used, often, specifically when sending data through HTTP, it will be sent compressed rather than just encoded. Since compression algorithms might not be as intuitive as one thinks, I tested the different encodings with different data types to see how they behave:

Chart showing how well different encodings compress

For this experiment I used gzip and the following data

  • a JPEG (42.2 kB, 31.1 kB compressed)
  • Android LogCat output (887.8 kB, 51.8 kB compressed)
  • random data (1024 bytes, 1047 bytes compressed)

The data will be first encoded with the various schemes, and then compressed. The chart shows how much bigger it is compared to just the raw data compressed (lower is better).

Interestingly Hex fairs the best with real world data being considerably smaller than the more high-density encodings like ascii85 and base64. This is probably to the dictionary friendly smaller alphabet.

The full test suite can be found here.

Conclusion

Don't get overwhelmed by the sheer number of options to choose from. If you do not a have a specific requirement on the output character set or length, then in most cases it makes sense to stick to a common option like base64 and not worry too much about things like space efficiency. I also recommend checking the quantity and quality of available implementations before setting your mind on a specific encoding, because there is nothing more annoying than subtle incompatibilities.

💖 💪 🙅 🚩
favr
Patrick Favre

Posted on April 2, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related