In this article we are going to take a look at a LeetCode problem that took me down the rabbit hole of Golang Benchmarks and Microbenchmarks in general. We will see how easy it is to mislead yourself with microbenchmarks and of course, how to avoid it.

Leet code

A few days ago I have challenged myself to go throughout NeetCode Roadmap and figure out how much I can optimize each task for both memory and CPU. For those who aren't familiar with NeetCode roadmap, it's basically an ordered list of LeetCode problems that you should solve in order to master a specific topic.

Two Sum Problem

If you're familiar with Two Sum problem, feel free to skip to Solutions section.

The question that caught my eye was the simple Two Sum task.

Given slice nums and a target value target, return indices of the two numbers such that they add up to target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

Example:

Input: nums = [2, 7, 11, 15], target = 9

Output: [0, 1] (because nums[0] + nums[1] == 9)

Solutions

LeetCode Submission results (In order: CPU improved, CPU original, Memory):

I have wrote 2 different solutions for this problem, (from the bottom up) one optimized for memory and one for CPU. The second one also has a slightly improved version which we are going to take a look at later.

So, let's hit it off with the first one.

Solution 1: Memory optimized

LeetCode Submission results:

Memory: 3.7 MB

CPU: 28 ms

    func twoSum(nums []int, target int) []int {

        // Start at the first element,
        // and iterate until the second to last element.
        for i := 0; i < len(nums)-1; i++ {
            // Start at the second element,
            // and iterate until the last element.
            for j := i + 1; j < len(nums); j++ {
                // If the sum of the two elements is equal to the target,
                // return the indices.
                if nums[i]+nums[j] == target {
                    return []int{i, j}
            }
        }
    }
    return nil
}

This is a civilized way to brute force the problem.

It essentially iterates over the slice using 2 nested loops, and checks if the sum of the two elements is equal to the target. It's time complexity is still O(n^2) but it's memory complexity is O(1), which means it's not using any additional memory no matter how large the input is (loading the slice into memory is not counted).

Solution 2: CPU optimized

As I have already mentioned, this solution has 2 versions, the original one and the improved one. Benchmarking is going to be done on the improved version, but I will also include the original one for comparison.

The original solution

LeetCode Submission results:
Memory: 4.3 MB
CPU: 7 ms

func TwoSumHash(nums []int, target int) []int {

    // initialize a map to store the values and their indices
    var hMap = make(map[int]int)

    // iterate over the slice
    for i, n := range nums {
        // if the difference between the target 
        // and the current element is in the map, return the indices
        if _, ok := hMap[target-n]; ok {
            return []int{hMap[target-n], i}
        }
        // if not, add the current element to the map
        hMap[n] = i

    }
    return []int{}
}

This might look a little daunting at first, but let me explain.

Basically, what we are using the Hash Map for is storing the difference between the current element and the target value. For example, if the target is 9 and the current element is 2, we are going to store 7 in the map.

Then, when we iterate over the next element, we check if the difference between the target(9) and the current element is in the map. If it is, we return the indices of the current element and the element that is stored in the map.

This gives us a time complexity of O(n) and a memory complexity of O(n). Which means that the time it takes to execute the function is linearly proportional to the size of the input, but so is the memory it uses.

The improved solution

LeetCode Submission results:
Memory: 4.3 MB
CPU: 3 ms

All changes are commented in the code.

func TwoSumHash(nums []int, target int) []int {

    var hMap = make(map[int]int)

    // Declared outside of loop to avoid re-declaring. Mutating is 👑
    var complement int
    for i, n := range nums {
        // calculated before if statement to improve performance
        complement = target - n
        if _, ok := hMap[target-n]; ok {
            return []int{hMap[target-n], i}
        }
        hMap[n] = i
    }
    return []int{}
}

First optimization is a no-brainer. We declare the complement variable outside of the loop to avoid re-declaring it on each iteration, allowing us to mutate the value instead of re-assigning it on each iteration.

The second optimization is a little more subtle. We are calculating the complement before the if statement. And before you ask, yes, it does improve performance. But don't expect it to be a game changer.

Benchmarks

Now we have come to the fun part. Let's benchmark these solutions and see how they perform.

If you want to follow along, for benchmarking I have used the default testing package along with this command:

go test -bench=. -benchmem -benchtime=10s

Understanding benchmark results

Benchmark	Iterations	Time per iteration	Memory per iteration	Allocations per op
TwoSumHash	94230505	123.6 ns/op	16 B/op	1 allocs/op

Benchmarks - The name of the benchmark function.
Iterations - The number of times the benchmark function was executed (higher is better).
Time per iteration - The average time it took to execute the benchmark function, in nanoseconds (lower is better).
Memory per iteration - The average memory allocated per iteration, in bytes (lower is better).
Allocations per op - The number of memory allocations per iteration (lower is better).

Keep in mind that "iteration" is not the same as "loop iteration". Iterations in benchmarking terms are the number of times the benchmark function was executed. As benchmarking in Go runs the given function in a loop and 'iteration' refers to single execution of the function.

LeetCode data set

Now that we know how to read the benchmark results, let's take a look at the results for the LeetCode data set.

Slice: [2,7,11,15]

Target: 9

I have given myself the liberty to insert commas in some output values for the sake of readability.

BenchmarkAllTwoSum/TwoSumHash-16                94,230,505               123.6 ns/op            16 B/op          1 allocs/op
BenchmarkAllTwoSum/TwoSum-16                    194,767,664               59.57 ns/op           16 B/op          1 allocs/op

As we can see, the array solution is almost exactly twice as fast as the hash map solution. This is in stark contrast to the LeetCode submission results, where the hash map solution is almost 10 times (9.33 times) faster than the array solution.

Why is this happening? How can a hash map be slower than an array?

The answer is simple. It's not.
The problem lies in our dataset. The LeetCode data set is too small to get any meaningful results. Hence, the results are misleading.

For easier understanding, I here is the benchmark code:

var inputSlice = []int{2, 7, 11, 15}
var target = 9

//Run benchmarks
func BenchmarkAllTwoSum(b *testing.B) {
    b.Run("TwoSum", benchmarkTwoSum)
    b.Run("TwoSumHash", benchmarkTwoSumHash)
}

// These are the functions that we are benchmarking,
// as they are not public, 
// they are wrapped in `BenchmarkAllTwoSum` function.
func benchmarkTwoSum(b *testing.B) {
    for i := 0; i < b.N; i++ {
        TwoSum(inputSlice, target)
    }
}

func benchmarkTwoSumHash(b *testing.B) {
    for i := 0; i < b.N; i++ {
        TwoSumHash(inputSlice, target)
    }
}

What I'd like to focus on is inputSlice and target variables. As it is unpractical to write out a larger dataset by hand, I have used a function that generates a random slice of integers in a given range. Here is the function:

func CreteIntSlice(length int, maxValue int) []int {
    var result = make([]int, 0, length)
    reader := rand.Reader
    for i := 0; i < length; i++ {
        random, _ := rand.Int(reader, big.NewInt(int64(maxValue)))
        result = append(result, int(random.Int64()))
    }
    return result
}

Don't pay too much attention to this function, it's ugly but it works. The important part is that it generates a random slice of integers in a given range. length is the length of the slice, and maxValue determines the range of the random numbers. For example, if maxValue is 100, the random numbers will be between 0 and 100.

Larger data set

Now that we know what the problem is, let's fix it. Here is the new benchmark code:

var inputSlice = CreteIntSlice(1_000, 1000)
var target = 143

func BenchmarkAllTwoSum(b *testing.B) {

    b.Run("TwoSumHash", benchmarkTwoSumHash)
    b.Run("TwoSum", benchmarkTwoSum)

}

Tip: You can use underscores in numbers to make them more readable. For example, 1_000 is the same as 1000.

Now we have a slice of 1000 random integers between 0 and 1000, and a target value of 143. Let's see how the benchmarks perform now.

TwoSumHash                  295,315             43,681 ns/op           20,985 B/op         21 allocs/op
TwoSum                      753,201             13,869 ns/op              16 B/op          1 allocs/op

Now we still have the same problem, but there's something interesting happening here. Besides the Hash solution being ~3.1 times slower, there is also a huge difference in memory usage. The Hash solution uses ~1311 times more memory than the array solution, as well as 21 times more allocations. This is of course because for each iteration of the loop, we are allocating a new element in the hash map.

But these results look even worse for the hash map, which is supposed to be faster. Remember O(n) vs O(n^2)? Well, it's time to put that to the test.

An even larger data set

Let's take it up a notch. Here is the new benchmark code:

var inputSlice = util.CreteIntSlice(1_000_000, 100_000)
var target = 548_888

Yeah, we are going to use a slice of 1 million random integers between 0 and 100 thousand, and a target value of 548_888. Let's see how the benchmarks perform now.

TwoSumHash                     134          86,174,212 ns/op        10,576,811 B/op       2,919 allocs/op
TwoSum                           1        340,888,900,172 ns/op             0 B/op          0 allocs/op

Now we have finally given the Hash map enough space to stretch its legs. And it did. The Hash Map came in at 86 milliseconds, while the array took 340.8 seconds. That's ~3955 times faster. On smaller data sets, the Hash Map was only 3 to 10 times slower than it's simpler counterpart, but on larger data sets, it's almost 4k times faster. That's a huge difference.

Now, we have to be fair. Hash Map solution has allocated around 10.5 MB, while the 16 B of the array approach wasn't even registered by the benchmark. But, if you are working with large data sets, you probably have enough memory to spare.

Conclusion

Be aware of the data you run your benchmarks on. For most use cases and side projects, the array O(n^2) solution is going to be more than enough. But if you are working with large data sets, you should definitely consider using a Hash Map.

Also, keep in mind that microbenchmarks often don't reflect real world performance. Besides testing it on a local machine that has other processes running, you have to take into account the wider architecture of the system you're building.