go

Decrypt Go: varint

huizhou92

huizhou92

Posted on September 20, 2024

Decrypt Go: varint

Recently, I discovered that the Go standard library includes a built-in implementation of varint, found in encoding/binary/varint.go. This implementation is similar to the varint used in protobuf. Using the Golang standard library's varint source code, we will systematically learn and review the concept of varint.

If you're familiar with protobuf, you probably already know that all integer types (except fixed types like fixed32 and fixed64) are encoded using varint.

varint mainly solves two issues:

  1. Space Efficiency: Take uint64 as an example, representing values as large as 18,446,744,073,709,551,615. In most real-world scenarios, however, our integer values are much smaller. If your system needs to process values as low as 1, you'd still use 8 bytes to represent this value in transmission, wasting space since most bytes store no useful data. varint encoding uses a variable-length byte sequence to represent integers, reducing the space required for smaller values.

  2. Compatibility: varint allows us to handle integers of different sizes without altering the encoding/decoding logic. This means fields can be upgraded from smaller types (like uint32) to larger ones (like uint64) without breaking backward compatibility.

This article will dive into Golang varint implementation, exploring its design principles and how it addresses the challenges of encoding negative numbers.

This article was first published under the Medium MPP plan. Follow me on Medium if you're a Medium user.

The Design Principles of varint

varint is designed based on simple principles:

  • 7-bit Grouping: The binary representation of an integer is divided into 7-bit groups. From the least significant bit to the most significant bit, every 7-bit group becomes a unit.
  • Continuation Bit: A flag bit is added before each 7-bit group, forming an 8-bit byte. If more bytes follow, the flag bit is set to 1; otherwise, it’s set to 0. For example, the integer uint64(300) has a binary representation of 100101100. Dividing this into two groups—10 and 0101100—and adding flag bits results in two bytes: 00000010 and 10101100, which is the varint encoding of 300. Compared to uint64, which uses 4 bytes, varint reduces the storage by 75%. list1: uint64 to varint
func main() {  
    v := uint64(300)  
    bytes := make([]byte, 0)  
    bytes = binary.AppendUvarint(bytes, v)  
    fmt.Println(len(bytes))  
    for i := len(bytes); i > 0; i-- {  
       fmt.Printf("%08b ", bytes[i-1])  
    }  
}
Enter fullscreen mode Exit fullscreen mode

Figure 1: varint encoding

varint for Unsigned Integers

The Go standard library provides two sets of varint functions: one for unsigned integers (PutUvarint, Uvarint) and another for signed integers (varint, Putvarint).

Let's first look at the unsigned integer varint implementation:

list2: go src PutUvarint

func PutUvarint(buf []byte, x uint64) int {
    i := 0
    for x >= 0x80 {
        buf[i] = byte(x) | 0x80
        x >>= 7
        i++
    }
    buf[i] = byte(x)
    return i + 1
}
Enter fullscreen mode Exit fullscreen mode

There is a very important constant in the code: 0x80, which corresponds to the binary code 1000 0000. This constant is very important for the logic that follows:

  1. x >= 0x80: This checks if x requires more than 7 bits for representation. If it does, x needs to be split.
  2. byte(x) | 0x80: This applies a bitwise OR with 0x80 (1000 0000), ensuring the highest bit is set to 1 and extracting the lowest 7 bits of x.
  3. x >>= 7: Shift x right by 7 bits to process the next group.
  4. buf[i] = byte(x): When the loop ends, the highest bits are all zeros, so no further action is needed.

Uvarint is the reverse of PutUvarint.

It should be noted that: varint splits integers into 7-bit groups, meaning large integers may face inefficiencies. For example, uint64's maximum value requires 10 bytes instead of the usual 8 (64/7 ≈ 10).

Encoding Negative Numbers: Zigzag Encoding

Though varint is efficient, it doesn’t account for negative numbers. In computing, numbers are stored as two’s complement, which means a small negative number might have a sizeable binary representation.

For example, -5 in 32-bit form is represented as 11111111111111111111111111111011, requiring 5 bytes in varint encoding

Go uses zigzag encoding to solve this problem:

  • For positive numbers n, map them to 2n.
  • For negative numbers -n, map them to 2n-1. This way, positive and negative numbers alternate without conflict, hence the name zigzag encoding. For example, after zigzag encoding int32(-5), the value becomes 9 (00000000000000000000000000001001), which varint can represent with just 1 byte.

Here’s the Golang implementation:

list3: go src Putvarint

func Putvarint(buf []byte, x int64) int {
    ux := uint64(x) << 1
    if x < 0 {
        ux = ^ux
    }
    return PutUvarint(buf, ux)
}
Enter fullscreen mode Exit fullscreen mode

From the code, we can see that for the implementation of varint for signed integers, the Go standard library breaks it down into two steps:

  1. First, the integer is converted using ZigZag encoding.
  2. Then, the converted value is encoded using varint.

For negative numbers, there is an extra step: ux = ^ux. This part might be confusing—why does this transformation result in 2n - 1?

We can roughly deduce the process, assuming we have an integer -n:

  1. First, the original value is shifted left, then inverted. This can be viewed as: first invert the value, then shift left, and finally add 1. This results in 2*(~(-n)) + 1.
  2. the two’s complement of a negative number is the bitwise inversion of its absolute value plus 1. So, how do we derive the absolute value from the two’s complement? There is a formula: |A| = ~A + 1.
  3. Substituting this formula into the first step: 2*(n - 1) + 1 = 2n - 1. This perfectly matches the ZigZag encoding for negative numbers (mathematics is indeed excellent).

Figure 2: ZigZag encoding

In the Go standard library, calling PutUvarint only applies varint encoding, while calling PutVarint first applies ZigZag encoding and then varint encoding.

In protobuf, if the type is int32, int64, uint32, or uint64, only varint encoding is used. However, for sint32 and sint64, ZigZag encoding is applied first, followed by varint encoding.

When varint Is Not Suitable

Despite its benefits, varint isn't ideal for all scenarios:

  • Large integers: varint can be less efficient than fixed-length encoding for huge numbers.
  • Random data access: Since varint uses variable lengths, indexing specific integers directly is challenging.
  • Frequent mathematical operations: varint-encoding data requires decoding before operations, potentially affecting performance.
  • Security-sensitive applications: varint encoding may leak information about the original integer's size, which could be unacceptable in secure environments.
💖 💪 🙅 🚩
huizhou92
huizhou92

Posted on September 20, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related