Advanced Thread Safety in C++
Paul J. Lucas
Posted on July 13, 2023
Introduction
C++ supports writing programs where parts of the code run concurrently using threads. When writing such programs, you have to take additional steps to ensure that data shared among threads can not cause race conditions. Typically, race conditions are avoided by proper use of mutexes and locks. However, in high-performance code, mutexes can sometimes be too costly, especially with high-contention data. C++ supports alternatives.
Note: what follows will not describe how to create or join threads since those things are largely straightforward. Instead, what follows will describe how to write programs safely in a multithreaded program.
A Brief History
In the late 1960s, early 1970s, programs (running on computers such as the PDP-7 and later the PDP-11) were:
- Single-threaded. (While nascent “threads” appeared in 1966, POSIX threads didn’t arrive until 1995.)
- Executed sequentially. (That is, machine code instructions generated by compilers were executed one at a time and in the same order as the original statements in your program were written.)
- Running in flat (non-hierarchical) memory. (CPU caches didn’t exist until the late 1970s and didn’t go mainstream until 1993 with the Intel Pentium.)
While CPU performance was increasing rapidly, memory performance lagged:
Performance Mitigation Tactics
In order not to have overall performance constrained by memory, CPU designers had to employ mitigation tactics of:
- Memory caching.
- Instruction parallelization.
- Speculative execution.
- Multiple CPUs/cores.
All of these tactics are done entirely by the hardware. You as the programmer have no way either to know or really influence what’s going on under the hood.
Additionally, compiler implementors also employ their own mitigation tactics of:
- Constant folding.
- Copy elision.
- Statement hoisting.
- Instruction scheduling.
- Loop unrolling.
- ... and many more.
The compiler is free to do any optimization so long as it makes no change to the observable behavior of your program (the “as-if rule”).
As long as your program has only a single thread, you can remain blissfully unaware of either the hardware or compiler tactics being employed. However, once you have multiple threads, you must become acutely aware of what’s going on under the hood.
Atomic
When doing multithreaded programming, you’ll hear the term “atomic” used. What, exactly, is meant by “atomic” anyway? There are three related, but distinct, meanings:
A value operation (read or write) completes with no possible intervening operation by another thread, e.g., writing single-byte values on any CPU, or 16- or 32-bit values on a 32-bit CPU, or 64-bit values on a 64-bit CPU.
An updated value is visible to other CPUs.
Multiple value operations complete with no possible intervening operation by another thread, e.g., updating A and B together (“transactional”).
“Atomic” always means #1; it usually also means #2; or all three.
A (Bad) Example
Consider the following code where the main thread spawns another. Both threads run so long as shutdown
remains false
. To shut down, the main thread sets shutdown
to true
. In response, both while
loops will exit and the main thread will join the other thread. Unfortunately, the code is wrong. (More unfortunately, I’ve seen such code in production.)
bool shutdown;
void thread_1() {
while ( !shutdown ) {
// ...
}
}
int main() {
std::thread t1{ thread_1 };
while ( !shutdown ) {
// ...
if ( some_condition )
shutdown = true;
}
t1.join();
}
It’s wrong because shutdown
is shared by multiple threads improperly. Some programmers mistakenly think along the lines of:
“It’s just a
bool
, a single byte, so of course it’s thread-safe because you can’t read or write half a byte, so it must be atomic.”
The problem is that while that’s actually true, it’s insufficient. A bool
is atomic only by meaning #1; it’s not atomic by meaning #2. That is, just because a bool
is updated by one thread does not mean that updated value is visible to other threads (running on other CPUs).
A (Bad) Fix: volatile
I’ve also seen code where programmers knew they had to do something more to make such code thread-safe. Unfortunately, that something turned out to be:
volatile bool shutdown; // Thread-safe now? No!
That is, they inserted volatile
because they kind-of understand what it does, but not what it’s for. The attempted use of volatile
to make something thread-safe is always wrong in C++.
To understand what
volatile
does and what it’s for, see here.
A Better Fix: Mutexes & Locks
The classic way to fix this code (correctly) would be to use a mutex:
bool shutdown;
std::mutex shutdown_mux;
void thread_1() {
while ( true ) {
// ...
std::lock_guard<std::mutex> lock{ shutdown_mux };
if ( shutdown )
break;
}
}
int main() {
std::thread t1{ thread_1 };
while ( true ) {
// ...
if ( some_condition ) {
std::lock_guard<std::mutex> lock{ shutdown_mux };
shutdown = true;
break;
}
}
t1.join();
}
The mutexes and locks make shutdown
be atomic by meaning #1 and meaning #2. While not necessary for this program, mutexes and locks also make things atomic by meaning #3 as well, so they’re the most general, one-size-fits-all tool for making programs thread-safe. However, C++ offers alternative tools that provide a more custom fit especially useful in high-performance code.
A Word of Caution
Writing safe multithreaded code even with ordinary mutexes and locks is hard to get right, even for experts. Writing safe multithreaded code using advanced thread-safety techniques is even harder! Before considering them, you should:
Know that using efficient algorithms matters far more than locking technique. For example, an N⋅log(N) algorithm with slower mutexes will still very likely be more performant than an N2 algorithm with faster locking (or lock-free) techniques.
Measure to see if your code is spending too much time locking. (Remember: results are CPU-dependent!)
Only if locking takes a significant percentage of time, then consider the following techniques.
An Even Better Fix: std::atomic
While use of mutexes and locks is correct, they’re fairly heavy-weight. In the case of using a bool
, a much lighter-weight fix is:
std::atomic<bool> shutdown; // Thread-safe now? Yes!
Use of std::atomic<T>
does not make shutdown
atomic by meaning #1 (it already is); it makes it atomic by meaning #2.
The std::atomic<T>
class is specialized for all built-in types (bool
, char
, int
, etc.) and all pointer types (T*
for any T
). It can even be used for your own type T
, but only when:
-
T
is trivially copyable (that is, ifmemcpy()
would correctly copyT
). -
sizeof(T)
≤ 16 (typically).
When T
is an integral or pointer type, std::atomic<T>
has conventional operators overloaded for it:
std::atomic<int> x{ 0 };
x = 1; // x.store(1)
int i = x; // x.load()
++x; // x.fetch_add(1)
x++; // x.fetch_add(1)
--x; // x.fetch_sub(1)
x--; // x.fetch_sub(1)
x += 1; // x.fetch_add(1)
x -= 1; // x.fetch_sub(1)
x &= 1; // x.fetch_and(1)
x |= 1; // x.fetch_or(1)
x ^= 1; // x.fetch_xor(1)
The operators are shorthands for the member functions shown in the comments. While the operators are convenient, I always use the member functions to make it obvious in code that the variable being manipulated is a std::atomic
.
Incidentally, this:
x = x + 1; // not the same as either ++x or x += 1
is not thread-safe because the read of x
and the write to x
are not atomic by meaning #3.
Lock-Free Atomics
Even though std::atomic<T>
offers better performance over mutexes and locks, it’s not guaranteed to. For a type T
on a specific CPU under specific conditions, is_lock_free()
and is_always_lock_free()
check. Those conditions include:
Accessing unaligned values, e.g., reading an
int
at a memory address that’s not evenly divisible bysizeof(int)
(assuming the CPU can even do it at all without generating a bus error).Runtime CPU dispatching, e.g., a particular program during initialization may detect whether the CPU supports the
cmpxchg16b
instruction and, if not, may be forced to use locks.
The only type that’s guaranteed to be always lock-free on all CPUs under all conditions is std::atomic_flag
that’s a special-case of std::atomic<bool>
. (An example using std::atomic_flag
is shown later.)
Memory Barriers
Atomic (by any meaning) is not enough to ensure thread-safety because the order of memory operations does not necessarily match the order of statements in a program. Why not? Because the hardware, the compiler, or both, may reorder things to improve performance. The “as-if” rule holds only from the perspective of a single thread.
Memory barriers (aka, “fences”) help ensure thread-safety by selectively prohibiting reordering of memory operations across the barrier. They also provide some synchronization among threads. C++ provides std::memory_order
.
The sometimes confusing thing about memory barriers is that they’re about controlling the order of operations, not the operations themselves.
memory_order_seq_cst
: Sequential Consistency
This is the safest memory order which is why it’s the default. For example, the signature of std::atomic<T>::fetch_add()
is:
T fetch_add( T arg, std::memory_order order
= std::memory_order_seq_cst ) noexcept;
It’s also the least efficient because it establishes a global memory ordering (bottleneck) across all threads. In specific cases, more efficient memory orders can be used.
memory_order_relaxed
: No Ordering
This is the least safe memory order in that it does not guarantee operation order or synchronization, but still guarantees modification order. What good is that? A typical use-case is incrementing reference counts:
template<typename T>
class shared_ptr {
public:
// ...
private:
struct counted_obj {
// ...
std::atomic<size_t> _count;
T _obj;
};
counted_obj *const _co;
void _inc_ref() noexcept {
_co->_count.fetch_add( 1, std::memory_order_relaxed );
}
// ...
};
At this point, you might ask something along the lines of:
“If there’s no synchronization, how can this possibly work? Couldn’t thread 1 increment the value from N to N+1 and thread 2 also increment the old value from N to N+1?”
The answer is no for two reasons:
Each increment is still atomic (by meanings #1 and #2) so each thread always sees the latest value. What
memory_order_relaxed
does is allow the hardware or the compiler to reorder other operations having nothing to do with_count
either before or after its increment to improve overall performance, not the performance of_count
specifically.Because
_count
is part of a private data structure, it’s guaranteed that no actions are ever conditionally taken by another thread based on its current value. Said another way, when only performing an increment and not altering the control flow in any thread, then use ofmemory_order_relaxed
is safe.
An example of when memory_order_relaxed
is not safe is when decrementing reference counts. (More later.)
Now that you hopefully understand memory_order_relaxed
, you should never use it unless you can prove your use of it is correct and it actually significantly improves performance. Correct use of memory_order_relaxed
is very hard to do.
memory_order_acquire
& memory_order_release
These memory orders are safer and always used in pairs:
memory_order_release
(used withstore()
) is used to “publish” information: no accesses can be reordered after.memory_order_acquire
(used withload()
) is used to “subscribe” to information: no accesses can be reordered before.
In this example, shared_data
is shared among two threads and data_ready
that acts as a semaphore:
int shared_data;
std::atomic<int> data_ready{ 0 };
void producer() {
shared_data = 42;
data_ready.store( 1, std::memory_order_release );
}
void consumer() {
while ( data_ready.load( std::memory_order_acquire ) == 0 )
;
assert( shared_data == 42 ); // always true
}
int main() {
std::thread t1{ producer };
std::thread t2{ consumer };
t1.join();
t2.join();
}
The
producer
setsshared_data
then signals that the data is ready (or “publishes” it). The write toshared_data
“happens before” the write todata_ready
because the use ofmemory_order_release
guarantees that the write toshared_data
can not be reordered afterdata_ready
.Meanwhile, the
consumer
busy waits for the signal ondata_ready
, then safely readsshared_data
. The read fromdata_ready
“happens before” the read fromshared_data
because the use ofmemory_order_acquire
guarantees that the read fromshared_data
can not be reordered beforedata_ready
.
Things to note:
-
shared_data
itself need not be atomic. (This is particularly useful for data that can not be made atomic either because it’s too big or is not trivially copyable.) - You can set any amount of data, then “publish” all of it simultaneously.
At this point, you might ask whether busy waiting is efficient. In general, it’s efficient when the waits are short — shorter than the time it would take to lock and unlock a mutex.
A general spinlock (a kind of busy waiting) can be implemented using std::atomic_flag
:
std::atomic_flag mutex = ATOMIC_FLAG_INIT;
class spin_lock {
public:
explicit spin_lock( std::atomic_flag *mutex ) noexcept :
_mutex{ mutex }
{
while ( !_mutex->test_and_set( std::memory_order_acquire ) )
;
}
~spin_lock() noexcept {
_mutex->clear( std::memory_order_release );
}
private:
std::atomic_flag *const _mutex;
};
memory_order_consume
This memory order is a special case of memory_order_acquire
that allows operations to be dependency-ordered that can be more efficient on weakly ordered CPUs (ARM, PowerPC; but not x86). From the earlier example:
int shared_data;
std::atomic<int> data_ready{ 0 };
There’s no actual dependency between shared_data
and data_ready
other than what’s in our minds — which the compiler has no way to know. To create an actual dependency that the compiler can use to keep things in dependency order, we can instead do:
std::atomic<int*> data_ready{ nullptr };
void producer() {
static int shared_data;
shared_data = 42;
data_ready.store( &shared_data, std::memory_order_release );
}
void consumer() {
while ( (data_ready.load( std::memory_order_consume )) == nullptr )
;
assert( *data_ready == 42 ); // always true
}
Now, we’ve created an actual dependency using a pointer and a pointed-to value in that the value of the pointer has to be loaded before the pointer is dereferenced. The compiler can preserve this order without using an additional and more expensive memory barrier.
memory_order_acq_rel
This memory order is memory_order_acquire
and memory_order_release
combined into one used for read-modify-write operations: no accesses can be reordered before or after. A typical use-case is decrementing reference counts:
template<typename T>
class shared_ptr {
public:
// ...
~shared_ptr() noexcept {
if ( _co->_count.fetch_sub( 1, std::memory_order_acq_rel ) == 1 )
delete _co;
}
// ...
};
The reason memory_order_relaxed
can’t be used here is because a use of the reference counted object in another thread could be reordered to be after the delete
happens resulting in undefined behavior.
std::atomic_thread_fence()
Using std::atomic
with memory_order_relaxed
is atomic without a barrier; using std::atomic_thread_fence()
is a barrier without a specific atomic. It’s useful if you want to update several atomic values together. For example, instead of doing:
atomic<int> x, y;
x.store( 1, std::memory_order_release );
y.store( 2, std::memory_order_release );
you can do:
x.store( 1, std::memory_order_relaxed );
y.store( 2, std::memory_order_relaxed );
std::atomic_thread_fence( std::memory_order_release );
But there is another difference. Given:
std::atomic<int> sync{ 0 };
// (1) load & store operations
sync.store( 1, std::memory_order_release );
// (2) store operations
No operations in (1) will be reordered after the store-release of sync
into (2); however, store operations in (2) can be reordered before the release into (1).
std::atomic_thread_fence()
imposes stronger ordering guarantees than an operation on an atomic with the same memory order. Given:
// (1) load & store operations
std::atomic_thread_fence( std::memory_order_release );
// (2) load operations
sync.store( 1, std::memory_order_relaxed );
// (3) store operations
No operations in (1) will be reordered after all subsequent store operations into (3) and store operations in (3) can not be reordered before the fence into (1). However, since memory_order_release
controls store operations only, load operations in (2) can be reordered before the release into (1).
The rules for memory_order_acquire
are similar except it controls load operations only and enforces ordering in the opposite direction. The differences are subtle. In general, an acquire or release associated with a particular atomic is preferred.
Compare-and-Swap (CAS)
Compare-and-swap (CAS), as its name suggests, is both compare and swap operations done together atomically (by all atomic meanings). Conceptually, it’s implemented as:
template<typename T>
bool atomic<T>::compare_exchange( T &expected, T desired ) {
if ( this->_value == expected ) {
this->_value = desired;
return true;
}
expected = this->_value;
return false;
}
except done atomically. The idea is that you check the value of an atomic variable and:
- If it’s the value you expect (or hope for), then (and only then) set it to a new desired value; or:
- If it’s not the value you expect (or hope for), it means some other thread changed the value, so do nothing. You are free to reattempt setting the value.
There are actually two flavors: compare_exchange_weak()
and compare_exchange_strong()
. (More on the difference later.)
I don’t know why the C++ committee named these functions with “compare exchange” instead of the conventional “compare and swap.” I’ll continue to use the “CAS” abbreviation regardless.
For example, we can reimplement the previous spin_lock
class using CAS:
std::atomic<bool> mutex;
class spin_lock {
public:
explicit spin_lock( std::atomic<bool> *mutex ) noexcept :
_mutex{ mutex }
{
bool flag = false;
while ( !_mutex->compare_exchange_weak( flag, true ) )
flag = false; // must reset
}
~spin_lock() noexcept {
_mutex->store( false );
}
private:
std::atomic<bool> *const _mutex;
};
Using CAS, we check the value of _mutex
:
If it’s
false
, it means it’s currently unlocked, so set it totrue
to indicate it’s now locked and returntrue
.-
If it’s
true
, it means it’s currently locked (by another thread), so do not set the value and returnfalse
.Note that in this case, the first parameter of
flag
is an in/out parameter and it’s set to the atomic’s current value (here,true
) even though we don’t care. This forces us to resetflag
tofalse
prior to another attempt.
Lock-Free Operations
One of the primary use-cases for CAS is that it allows you to implement lock-free operations on data structures. For example, part of a lock-free stack
implementation might be:
template<typename T>
class stack {
public:
struct node {
node *_next;
T _data;
explicit node( T &&data ) : _data{ std::move( data ) } { }
};
void push( T &&data ) {
node<T> *new_node = new node<T>{ std::move( data ) };
new_node->_next = _head.load( std::memory_order_relaxed );
while ( !_head.compare_exchange_weak(
new_node->_next, new_node,
std::memory_order_release,
std::memory_order_relaxed ) );
}
private:
std::atomic<node<T>*> _head;
};
After a new node is created, we try to update _head
:
If it’s still equal to
new_node->_next
(the original value of_head
), update_head
to point tonew_node
using the memory order specified by the third argument.If it’s not equal, it means another thread snuck in and updated
_head
to point to different new node, so do nothing and reattempt. (Note thatnew_node->_next
has been updated to be the new_head
pointing to the different node using the memory order specified by the fourth argument.)
compare_exchange_weak()
vs compare_exchange_strong()
The existence of compare_exchange_weak()
implies there’s a compare_exchange_strong()
— and there is. The difference between them is:
compare_exchange_strong()
fails (returnsfalse
) only if the current value is not the expected value.compare_exchange_weak()
may also fail if there’s a “spurious failure.”
A “spurious failure” is quirk on weakly ordered CPUs (e.g., ARM and PowerPC; but not x86) where the operation fails for reasons other than the value not being the expected value.
So if compare_exchange_strong()
never fails spuriously, why does compare_exchange_weak()
exist? Before answering that question, let’s look at a conceptual implementation of them both:
enum class cas_result {
EQUAL,
NOT_EQUAL,
SPURIOUS_FAILURE
};
template<typename T>
cas_result cas_weak_impl( atomic<T> *p, T &expected, T desired );
template<typename T>
bool atomic<T>::compare_exchange_weak( T &expected, T desired ) {
return cas_weak_impl( this, expected, desired ) == cas_result::EQUAL;
}
template<typename T>
bool atomic<T>::compare_exchange_strong( T &expected, T desired ) {
cas_result cr;
do {
cr = cas_weak_impl( this, expected, desired );
} while ( cr == cas_result::SPURIOUS_FAILURE );
return cr == cas_result::EQUAL;
}
Assume there’s only cas_weak_impl()
that implements a weak version of CAS:
compare_exchange_weak()
is a simple wrapper aroundcas_weak_impl()
.compare_exchange_strong()
, however, wraps it with a loop that effectively filters out spurious failures. The important thing to remember here is that there’s a loop.
Given that, the benefits of compare_exchange_weak()
are:
Spurious failures tend not to happen all that often.
When your code is using a loop anyway, the weak version will yield better performance on weakly ordered CPUs (ARM, PowerPC; but not x86 — but it’s no worse).
Detects the ABA Problem (on weak CPUs) — more later.
So then why does compare_exchange_strong()
exist?
If you have a loop only to filter out spurious failures, don’t: use
compare_exchange_strong()
.But if you have a loop anyway, use
compare_exchange_weak()
.However, if handling a spurious failure is expensive (for example, if you have to discard and reconstruct a new object), use
compare_exchange_strong()
.But
compare_exchange_strong()
doesn’t detect the ABA Problem (more later).
Given all that, you might wonder when would you ever not have a loop for compare_exchange_strong()
? One example is implementing try_lock()
:
bool my_try_lock( std::atomic<bool> *mutex ) noexcept {
bool flag = false;
return mutex->compare_exchange_strong( flag, true );
}
The “ABA Problem”
The ABA Problem can be illustrated as follows. Suppose thread 1 performs the following steps:
- Read a memory location (value is “A”).
- Do some work.
- Read the same memory location again (value is still “A”).
- Compare the original and recent values (they’re equal).
- Conclusion: nothing has changed.
The problem is suppose thread 2 performs the following steps while thread 1 is doing its step 2:
- Write “B” to the same memory location.
- Do some work.
- Write “A” to the same memory location.
By the time thread 1 does its step 3, it reads “A” and believes nothing has changed — even though it has. You might now ask:
“If the value of “A” is the same, why does it matter?”
The answer is: sometimes it doesn’t — but sometimes it does.
Consider the stack<T>::push()
code from earlier (repeated here for convenience):
void push( T &&data ) {
node<T> *new_node = new node<T>{ std::move( data ) };
new_node->_next = _head.load( std::memory_order_relaxed );
while ( !_head.compare_exchange_weak(
new_node->_next, new_node,
std::memory_order_release,
std::memory_order_relaxed ) );
}
and the following illustration:
State (1) shows the initial conditions where A and B are nodes on the stack, _head
(H) points to A, and new_node
(N) has its _next
also point to A. The dotted box contains _head
and new_node->_next
that are being compared. State (3) shows the desired final state where _head
points to new_node
and new_node->_next
points to A.
But what if another thread sneaks in before the while
loop is entered, pushes a new node X shown by state (2), then immediately pops it? Both _head
and new_node->_next
still point to A, so _head
will be set to new_node
— which is correct. In this case, the ABA Problem isn’t actually a problem.
But now consider what the code for stack<T>::pop()
might be:
T pop() {
node<T> *first = _head.load( std::memory_order_relaxed );
while ( first != nullptr &&
!_head.compare_exchange_weak(
first, first->_next,
std::memory_order_release,
std::memory_order_relaxed ) );
if ( first == nullptr )
throw empty_stack_exception{};
T ret_val{ std::move( first->_data ) };
delete first;
return ret_val;
}
and the following illustration:
State (1) shows the initial conditions where A, B, and C are nodes on the stack and _head
(H) and first
(F) both point to A. The dotted box contains _head
and first
that are being compared. State (5) shows the desired final state where _head
points to B.
But what if another thread sneaks in before the while
loop is entered, pops A and B shown by state (2), then pushes A shown by state (3)? Both _head
and first
still point to A, so _head
will be set to first->_next
— which is wrong! Why? Originally, first
pointed to A whose _next
pointed to B, hence first->_next
(the desired argument in the compare) is B. But B was deleted in (2) so we’ll end up in state (4) with _head
being a dangling pointer to B that was deleted. In this case, the ABA Problem is actually a problem! (It wasn’t a problem for push()
because the desired value of new_node
could never become stale.)
Even worse, there’s no easy fix for this. In this case, the problem is that part of the desired value expression can change. While first
stays the same, what first->_next
points to can change. Detecting ABA Problems is hard, even for experts.
How can this be fixed?
- Give up and just use a mutex and locks.
- Implement versioned pointers.
- Implement hazard pointers.
Versioned Pointers
A versioned pointer is an ordinary pointer plus an additional “version number” such that every time the value of the pointer changes, the version number is incremented. Conceptually, something like:
template<typename T>
class vers_ptr {
public:
vers_ptr() noexcept : _ptr{ nullptr }, _vers{ 0 } { }
vers_ptr( T *ptr ) noexcept : _ptr{ ptr }, _vers{ 1 } { }
vers_ptr& operator=( T *ptr ) noexcept {
set( ptr );
return *this;
}
// ...
private:
T *_ptr;
uintptr_t _vers; // must be sizeof(T*) for cmpxchg16b
void set( T *ptr ) noexcept {
_ptr = ptr;
++_vers;
}
};
static_assert( sizeof( vers_ptr<void> ) == 16 );
Then use vers_ptr<node>
instead of node<T*>
:
struct node {
vers_ptr<node> _next;
// ...
};
T pop() {
vers_ptr<node> first = _head.load( std::memory_order_relaxed );
while ( first != nullptr &&
!_head.compare_exchange_weak(
first, first->_next,
std::memory_order_release,
std::memory_order_relaxed ) ) ;
// ...
}
// ...
std::atomic<vers_ptr<node>> _head;
This would work because, even though the _ptr
part of first
would still point to A, its _vers
part would be different, so _head
and first
would not compare equal, so _head
would not be set to B, the stale value of first->_next
.
The caveats of vers_ptr
, however, are:
This will work only if the CPU supports double-pointer-wide (16 bytes on a 64-bit CPU) CAS (which, for example, x86_64 does via the
cmpxchg16b
instruction).As mentioned, a specialization of
std::atomic<T>
requiresT
to be trivially copyable — which is whyoperator=(vers_ptr<T> const&)
can’t be implemented.
Cache Lines
For many CPUs, the L1 cache is “chunked” into cache lines typically ranging from 16-64K in size each. To read a given memory location from main memory into the cache, the location and the surrounding chunk-sized bytes are all read in together. Similarly, to write a given memory location from the cache into main memory, the entire cache line is written. For code that exhibits locality of reference, the chunking yields a performance gain. However, in some cases, it can yield a performance loss.
Consider a lock-free queue
:
template<typename T>
class queue {
// ...
std::atomic<node*> _head;
std::atomic<node*> _tail;
};
Assume that your code has one thread pushing items onto the tail of the queue (repeatedly updating _tail
) and a second thread popping items from the head of the queue (repeatedly updating _head
).
In the code as given, _head
and _tail
will very likely reside on the same cache line. This means updating one will invalidate the entire cache line adding otherwise unnecessary contention for the other.
In a case such as this, you want _head
and _tail
to reside on different cache lines so updating one doesn’t affect the other. You can actually achieve that by using alignas
and hardware_destructive_interference_size
:
template<typename T>
class queue {
// ...
alignas(std::hardware_destructive_interference_size) std::atomic<node*> _head;
alignas(std::hardware_destructive_interference_size) std::atomic<node*> _tail;
};
This will waste a little bit of memory, but ensure that _head
and _tail
reside on different cache lines.
Conclusion
To summarize:
Your algorithm matters far more than locking technique.
Thread-safety in general and using atomics and barriers directly specifically is very hard to get right, even for experts.
Before using advanced thread-safety techniques, measure to see if your code is spending too much time locking. (Remember: results are CPU-dependent!)
Only if locking takes a significant percentage of time, then consider atomics and barriers.
Atomics and barriers are two sides of the same coin.
Compare-and-swap (CAS) is a fundamental building-block.
Be aware of the ABA Problem.
Be aware of cache lines.
Posted on July 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.