Tips for writing performant code
Demetre
Posted on December 7, 2021
Why performant code is all about memory
Increasing the performance of a given piece of code is not as simple as throwing more ram, more GHz, or more cores by cranking the cloud provider's knob to 11. Decreasing the runtime of a serial program or improving the scaling per core of a parallel program requires not only a deeper understanding of the underlying hardware, but also writing code with that hardware in mind.
While it is true that the number of transistors has doubled each year, the per-core frequency has been tapering off since 2010. CPUs single-core performance has started to follow the same tampering off trend. In contrast, the number of logical cores per processor has doubled every few years. In recent history, AMD's Ryzen line has pushed 4 cores / 8 threads from being the top of the line to 32 cores / 32 threads now being top of the line.
While the number of transistors and logical cores continues to climb every year, unfortunately, the speed of memory accesses has not. The gap between the speed the CPU can process numbers and the speed it can access memory from cache or main memory has grown every year. For any modern machine, the time spent pulling data from memory vastly overshadows that of actually doing useful computations. This is why, when trying to improve the run time performance of code, it is not a technique to decrease the number of computations that materialize the largest improvement. It instead requires changes to how data is used and accessed. It is techniques to best utilizing the cache that specifically has the biggest effect.
An easy to understand generalization of modern memory is that there are 2 levels. The first, being a small but very fast cache and the other, a slow but larger main memory (or DRAM). When something is initially loaded from DRAM to be used by the CPU, it is also placed into the cache, evicting something else to take its place. This simple model is nice, but in reality, there are multiple cache layers and multiple ways to pick what to evict. Things like hardware prefetching and cache lines also come into play. Those topics are all much outside the scope of this article, but interesting to read about nonetheless. This nice and simple model is what we're going to use. For reference, reading from cache is an order of magnitude faster than reading from main memory. It takes single-digit nano-seconds (ns) to read from cache vs. ~100ns to read from main memory, although this still depends entirely on specific hardware. While a programmer can't directly choose what goes into cache and what gets evicted, they can still get a very good idea. Being always cognizant of what's likely to be in cache and avoiding these slow memory accesses is the key to achieving a sizable speed up.
This entire article will be written with C in mind but the main ideas are transferable to most anything else.
Foreword
Before trying to engineer code to run faster, the underlying algorithm should always be looked at first. There is almost no reason to try to optimize an O(n^2) algorithm when an O(n) one exists. While there are cases, especially when dealing with small amounts of data, that n^2 might be faster due to implementation details, this article is targeting algorithms running on data much larger than that. Nothing in this article will help make selection sort faster than merge sort for significantly large arrays.
Ways to increase cache hits
The whole name of the game to improve serial performance is increasing cache hits. Or otherwise, the goal is to reduce how often something is loaded from main memory. Either way, the end goal is to reduce the amount of time (cycles) it takes to load something into the CPU. The two major ways to do this is with the techniques of:
Temporal locality and
Spatial locality.
Before moving on, we should add a small but important twist to the simple memory model. When something is loaded from slow memory into cache, the level of granularity is not a single byte or word but an entire cache line. These cache lines are in the range of 64 or 128 bytes. When a single int is requested by the program and that int is in slow memory, the entire cache line the int is in is also brought into cache.
Temporal locality
Reuse data that is already in cache.
This means that once a value is loaded in for the first time, it should be used as much as possible. It should be mentioned that it’s never possible to 100% avoid slow memory accesses since at some point something has to be loaded in for the first time, even when considering cache lines.
For an example, compare the following code snippets:
for person in persons:
person.first_name.lower()
// lots of unrelated memory loads here
for person in persons:
person.last_name.lower()
vs
for person in persons:
person.first_name.lower()
person.last_name.lower()
While this is a quite simplified example, the idea is visible. For a large persons
array, each person
will have to be loaded twice in the first example, but only once in the second. While a good compiler can see this and might fuse the loops together, it can't always rely on compiler optimizations to reordering large blocks of code.
Spatial locality
Operate on data that are stored close to each other in memory and potentially would have been brought into cache together.
Because data is brought into cache at the cache line level, adjacent data is also brought in cache. When iterating through a list, it is much better to access linearly instead of random accesses because of this.
While most loops are linear by default, one must not forget about the instructions of the code itself. Those still need to be brought into cache and read in order to be executed. While this effect is small, having many function calls requires many jumps in instruction memory, creating more slow memory accesses. This is why some compilers will automatically inline functions, saving this extra function call overhead and potential slow memory access. Modern CPUs also do branch prediction which somewhat mitigates the performance hit of function calls or logical branching statements. Again though, one can't always rely on compiler optimizations when trying to squeeze out the little bit of performance that is possible.
Parallel Processing
"But why can't you just add more cores to the code to make it run faster? Why bother spending time to make serial code faster?"
Without fast serial code, the parallel code won't be good. All the same principles to making serial code fast apply to parallel code as well. Without these good fundamentals already satisfied, the parallel code will never be as fast as it could be. There are extra things to keep in mind when dealing with parallel code. Explaining how to decompose data, decompose tasks, schedule threads, and everything else required to make incredibly fast parallel code, would fill up an entire book. Instead, this article will mention just some basic things to keep in mind before trying to turn serial code into parallel.
Amdahl's Law
Amdahl's law gives the formula for the theoretically speed up in run time of a task or program. Not all parts of a program can be parallelized. So, even with infinite cores and perfect parallelized code, there is still a minimum amount of run time because of the serial parts of the program. This is why focusing on writing fast serial code is important to do as a first step. Even if 90% of the program can be parallelized, with infinite cores the program can still only achieve a 10x speedup. If 75% of the program can be parallelized, the program can achieve at best a 4x speedup. Well-written serial code can easily achieve a 4x speedup by itself and is easier to write, easier to debug, and easier to understand.
Starting with a slow serial implementation and trying to parallelize it will never give as good of results as writing very good serial code from the start and parallelizing that.
Embarrassing parallel
This is almost opposite to the idea of Amdahl's law. Amdahl's law says that even in the best case one can only achieve a modest speedup multiplier. Embarrassing parallel problems are ones in which little to no effort is required to effectively parallelize a problem. Sometimes these problems are also called perfectly parallel, or pleasingly parallel. The opposite of these would be inherently serial problems that can't be meaningfully parallelized at all.
It's worth learning which types of problems are embarrassing parallel, because they're the low-hanging fruit when trying to decrease run time. These problems are normally characterized by having no interdependencies between tasks. The data can be easily split, no thread-to-thread communication is required. Things like password cracking, 3D video rendering, and numerical integration are all examples of this set of problems.
Parallelizing isn't free
For a given problem, where a large part of it can be parallelized and that part can be parallelized well, there is still an added runtime cost of parallelizing it. Only in outlier circumstances will no synchronization between threads be required. In C/C++, locking and unlocking mutexes still has a cost. While there are fewer (operating) system calls when the mutex isn't contested by other threads there is still additional overhead. There are other ways to synchronize threads, semaphores, signals, and barriers. These additional synchronization mechanics are often required to ensure program correctness and avoid race conditions. They also add additional developer overhead which is an extra thing to consider. There is no free lunch by parallelizing part of the algorithm.
False Sharing
When 2 threads load the same cache line into their respective caches and the cache line values are only read and not modified, there is no issue. However, when one thread writes to any part of that cache line, the entire cache line is now marked as invalid and written back to main memory. Whenever another thread tries to read that now invalid cache line, it will have to re-read it from slow memory even if it's trying to access a different part of the cache line. This issue is called false sharing and causes almost an artificial increase in run time. This problem can be commonly seen when an array is shared between threads. If each thread writes to a different index of the array, there will be no synchronization required, but the data access times will be very bad because that entire cache line will have to be reloaded from slow memory every time. This issue can be worked around by creating a new local variable for each thread instead of sharing the array, or padding the array with dummy values so that each thread is reading values at least one cache line away from each other.
Measure scaling
In an ideal world, the run time decreases linearly with the number of threads or cores added. Real world use cases rarely reflect this ideal speed up. When measuring the speedup achieved by adding additional cores, there are two ways to measure it:
- Strong Scaling
This is when the amount of data being worked on stays the same and the amount of cores increases,
- Weak scaling
This is when the amount of data increases with the number of cores being added.
These scaling amounts are by no means guaranteed to be the same, and it's worth investigating both of these scaling values when comparing a parallel version of a program to a serial version.
Things not mentioned
While these topics are interesting and worth investigating in their own right, they're simply too complex for this article.
Other kinds of compiler optimizations include:
Optimizing registers,
Different levels of gcc optimization e.g -O1, -O2, -O3,
Loop unrolling,
Different multi-processor models such as distributed systems and GPUs,
Calculating efficiency, machine balance, and machine peak,
Branch prediction on modern CPUs, and
Hardware prefetching (or why iterating through a loop doesn't normally have cache misses).
Posted on December 7, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024
November 30, 2024