Measuring performance using BenchmarkDotNet - Part 1
Tony Knight
Posted on March 15, 2021
We all must build fast software, right? Right? It’s true that microservices tend to introduce latencies - stateless functions mean a whole lot more network calls, and you can wave goodbye to data locality. But a microservice is still dependent on its own code being fast, and at least fast enough.
In the past we’ve relied on profilers, stopwatches, dedicated performance teams, and sometimes plain old complaints from the field. All of these methods require some form of measurement; unfortunately they tend to be “big picture” performance that lacks detail - and often without concrete scenarios. This gets very expensive very quickly.
Very often, you just want to measure the code’s performance without the baggage of dependencies. You might have a critical piece of code that absolutely must meet certain performance criteria. Measuring such microcode cam obviously be done with profilers - dotTrace, ANTS to name just two. The problem is they bring their own baggage as well, and worse can’t be easily relied upon in a CI pipeline. So how can you measure microcode performance in CI? Unit tests are a terrible idea, what else is there? Step forward BenchmarkDotNet.
Measure your code’s performance with benchmarks at near zero cost and. All you need are:
BenchmarkDotNet from Nuget
We’ll talk about how to write simple benchmarks, how to run them and how to interpret the results.
No prizes are sought for best efficiency here. Please do not take this as a reference implementation of Fibonacci!
To answer the scaling question, we would implement a benchmark, run it and analyse the results. Skipping forward a rendered benchmark report would look something like the below:
What do all the headers actually mean?
The column
What it means
The name of the code-under-test; a single benchmark may have several methods under test for, e.g. scenarios. This value is lifted directly from your benchmark code.
An arbitrary parameter: in this case the number of Fibonacci numbers generated by the method under test.
Execution time statistics. Note that these can be given down to nanoseconds, depending on how fast your code is. Low is best.
Quartile execution time statistics: note the time units. Low is best.
The number of operations executed per second for the method/parameter combination. High is good.
The fastest performing method/parameter combination. Low is best.
Gen 0/1/2
The total number of collections per generation
Total bytes allocated against all generations
Note the header information in the report! It’ll give details on the OS, CPU, .Net version, JIT method and GC configuration. Always benchmark like-for-like!
OK… what do those numbers really mean?
Let’s look at each iteration of Count, and we’re using it here to get the first Count numbers of the Fibonacci sequence.
Where Count is 1 the mean execution time is 103.4 nanoseconds. That’s 0.1 microseconds, or 0.0001 milliseconds. I like that: nice and fast.
Where Count is 13 (yes, the parameters themselves follow Fibonacci!) the mean time is 407.2 ns: four times what Count=1 is, yet the Count is 13 times bigger. I’ll take that, for now.
Where Count is 34 the mean time is 1,077.9 ns, or 1.077 microseconds, or just over 0.0001 milliseconds. That’s taking 2.6 times more time than Count = 13. Let’s compare against Count = 1: Count is 34 times bigger , yet takes 10 times the time. I’ll take that too.
If we plot Count against the time ratio we see this:
In other words, time used does not grow as Count grows. If it did, the lines would be parallel.
So the benchmarks are showing that the implementation has reasonably acceptable scaling. It's not constant time, but it’s better than O(n) time: a pleasant surprise.
If you're not satisfied with the performance results, simply make your changes, re-run the benchmarks & re-analyse. That's it.
You haven’t mentioned the memory yet, have you?
Trust me, I’m getting to that.
Pay particular attention to memory usage. Garbage collections and memory allocations are as important as sheer speed!
Count=1 used 128 bytes.
Count=13 used 312 bytes
Count=34 used 744 bytes.
If we plot Count against the allocation growth ratios, we see this:
This means the used memory isn’t constant either: the memory used for Count=34 is greater than the memory used for Count=1. Again it's better than O(n). To my mind this is OK, but not great: we need more investigation. It's probably incurred with yield return, but do we want to sacrifice the readability? Probably not, but in any case we’re getting new perspectives on our code. This is a good thing.
What other rendered reports can you get?
You can output a markdown version of your report and many other formats; Markown output is GitHub inspired.
You can use the following attributes to output the many different types of rendered reports:
Charting is supported through the R project. As R is a world in itself, I’m going to skip the subject.
If you want charts, consider importing the rendered JSON into Excel. The CsvExporter attribute will generate a CSV with the data you need.
Full code example
What does the benchmark code look like using BenchmarkDotNet? It might surprise you to see how simple it is.
BenchmarkDotNet relies on declarative code over which it will reflect. Leaving aside the class attributes (more on those later), note the [Params] attribute over Count from the report above, likewise [Benchmark] and Fibonacci().
You’ll notice that the benchmarks have a return type of void and do not have any assertions. Remember: we’re not proving functional correctness here, we’re measuring resource usage.
Show me the code!
I’ve created a simple BenchmarkDotNet implementation here:
There’s only the one C# project in there - benchmarkdotnetdemo.csproj - that contains the minimal files.
BenchmarkDotNet will only work if the console project is built with a Release configuration, that is with code optimisations applied. Running in Debug will result in a run-time error.
This is the Program.cs file, and like all C# console apps you need an entry point:
This one-line-to-rule-them-all will perform all command line parsing, all help, all benchmark execution and all report generation.
One point here is .FromAssembly(typeof(Program).Assembly) - this informs BenchmarkDotNet of its benchmark search scope. Benchmarks are internally discovered by reflection - you’ll see soon enough.
NOTE: If you were to run the project without any command line arguments, BenchmarkDotNet will assume an interactive CLI.
.Run(args) returns a sequence of report objects comprised of the same data used for rendered reports: I’ve excluded them for simplicity. If you want to run benchmarks and fail CI builds if performance dips they are your first place to look.
Create a new benchmark
There is a file called SimpleBenchmark.cs. Let’s have a look.
Just for completeness: note the similar declarations as SimpleBenchmarks.cs. In this case, we’re adding a [Params] parameter to support benchmark permutations.
Without going into too much detail, BenchmarkDotNet will attempt to run your benchmarks many many times over to settle on mean and median values.
When you run the benchmarks you may first be confused by just how many iterations are involved, so let’s give a simplistic explanation. Modern OSs are preemptive multitaskers, CPUs have pipeline caches as well as instruction reordering features. .NET itself has the JIT compiler. This means that no single execution of code can be relied upon to give a canonical result.
This is part of the reason why unit tests are terrible for benchmarking! They only run once and incur their own (unaccounted) overheads.
BenchmarkDotNet will run warm up iterations before it can take representative values. These show up as various stages: OverheadJitting & WorkloadJitting, WorkloadPilot, OverheadWarmup, OverheadActual.
JIT comes at a cost: the first time any .NET code executes it must first be JIT compiled. The more complex the code the higher the JIT cost, usually showing as CPU and time costs. As we’re interested only in runtime performance, these steps eliminate JIT costs from measurements.
In the same vein other warmup steps are run to eliminate other “once only” costs, for instance to warm up pipelining caches.
After these steps have completed, BenchmarkDotNet will iterate these operations to yield the final statistics; these are shown as WorkloadActual steps.
If you want more detail, please refer to BenchmarkDotNet’s own documentation. In these code samples we’re using the default Throughput strategy for microbenchmarking.
How long does it take?
It depends ;) Simple calculations, such as in the demo project, will run in under a minute. Adding permutations (such as with [Params]) will linearly increase the benchmarking time, as each parameter will be benchmarked in its own right.
With that in mind, it’s quite clear that resource hungry algorithms, benchmarked with a large variety of parameters, will take a considerable amount of time.
Don’t expect to parallelise BenchmarkDotNet: it runs benchmarks sequentially. Thread context switching is itself a cost and extremely difficult to compensate for.
What have we learned?
We’ve seen how to get BenchmarkDotNet
We’ve seen how to integrate it in a simple console application
We’ve seen the minimum work needed to build benchmarks
We’ve had a taste of the reports and inferences we can gain from BenchmarkDotNet
A simple demonstration of the superlative BenchmarkDotNet and its integration into Github Actions.
Measuring code performance is self evidently a vital discipline to software engineering and yet is so often skipped, usually for false economies. BenchmarkDotNet makes this essential task simplicity itself, with a syntax and style that's immediately intuitive to anyone versed in unit testing.
Just exercise your code in a declarative way, include it in your CI pipeline, and enjoy the results.
This project just demonstrates the basics: the .net project, the CI pipeline and the resultant reports.
The Benchmarks
The absolute minimum function that can be benchmarked - it does nothing.
A simple addition metric, again of minimal complexity.
A simple multiplication metric, again of minimal complexity.
Benchmarking a Fibonacci implementation, measuring the computation time for the first N Fibonacci numbers.