Go deadlock detection: delock library
Iñigo Etxaniz
Posted on December 5, 2023
Introduction
In the Go programming world, managing concurrency is a part of everyday life. Go makes concurrent programming relatively straightforward with its built-in primitives like goroutines and channels. However, when it comes to locking structures to ensure thread safety, things can get a bit tricky.
Typically, we use a pointer to a structure and consistently call its methods to maintain thread safety. This approach works well for most scenarios, but complex algorithms demand careful consideration to avoid concurrency issues, especially deadlocks.
While working on a large-scale project for a caching implementation in an authentication service, I encountered an intriguing problem. Occasionally, and quite rarely, the system would hang. Race conditions were unlikely culprits, as I diligently used mutexes within structures. This left me suspecting deadlocks. But how could I confirm this and, more importantly, identify where they were happening? That's where the idea for delock
began – a tool specifically designed to detect and pinpoint deadlocks in Go applications.
Understanding Deadlocks in Go
Deadlocks in Go, or in any programming language for that matter, are akin to a gridlock in traffic – everything comes to a standstill. A deadlock occurs when two or more goroutines are waiting on each other to release resources, leading to a situation where none of them can proceed.
Go Runtime Deadlock Detection
This example, albeit a bit contrived, effectively illustrates a fundamental deadlock scenario. In the Execute
method, we acquire a mutex lock and then call the same method recursively without ever releasing the lock. This leads to an attempt to acquire the lock again while it's still held by the same goroutine, causing a deadlock.
Though this is a simple and somewhat unrealistic example, it mirrors real-world scenarios where deadlocks occur due to improper lock handling. It’s a classic case of a process indefinitely waiting on itself. The Go runtime is adept at catching such blatant deadlocks in a single-threaded context, where the entire application grinds to a halt. This example serves as a reminder of how subtle mistakes in concurrency management can lead to significant issues, and why it's crucial to code with caution when dealing with locks.
package main
import (
"log"
"sync"
"time"
)
type LockExample struct {
mu sync.Mutex
Value int
}
func (l *LockExample) Execute() {
l.mu.Lock()
defer l.mu.Unlock()
time.Sleep(100 * time.Millisecond)
l.Value++
log.Printf("value: %d", l.Value)
l.Execute()
}
func main() {
var l LockExample
l.Execute()
}
Upon running the simple deadlock example with standard sync.Mutex
, the Go runtime effectively detects and reports a deadlock. Here's what happens:
- The log statement
2023/12/05 12:38:02 value: 1
indicates that theExecute
function ran once and incrementedValue
. - Immediately after, the program crashes with a fatal error:
all goroutines are asleep - deadlock!
. This error is thrown by the Go runtime when it detects that all goroutines are waiting on each other, with no possibility of resuming. - The stack trace provided points to where the deadlock occurs, highlighting the recursive call to
Execute
that attempts to reacquire an already held lock.
This result demonstrates the Go runtime's ability to detect simple deadlocks where the entire program is stuck waiting for locks to be released. It serves as a practical example of how easy it is to fall into a deadlock trap and the importance of careful lock management.
go run ./examples/simple_std
2023/12/05 12:38:02 value: 1
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [sync.Mutex.Lock]:
sync.runtime_SemacquireMutex(0x485040?, 0x0?, 0xc00007e050?)
/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0000120e0)
/usr/local/go/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:90
main.(*LockExample).Execute(0xc0000120e0)
/home/inigo/Documents/Projects/delock/examples/simple_std/main.go:15 +0x4c
main.(*LockExample).Execute(0xc0000120e0)
/home/inigo/Documents/Projects/delock/examples/simple_std/main.go:21 +0x12a
main.main()
/home/inigo/Documents/Projects/delock/examples/simple_std/main.go:26 +0x1f
exit status 2
Limitations in Complex Scenarios
In real-world programming, especially in services, it's common to have multiple goroutines performing different tasks. In such environments, a deadlock might not stop the entire application, making it more challenging to detect and resolve.
To illustrate this, let's modify our previous example to fit into a broader context:
package main
import (
"log"
"sync"
"time"
)
type LockExample struct {
mu sync.Mutex
Value int
}
func (l *LockExample) GetValue() int {
l.mu.Lock()
defer l.mu.Unlock()
return l.Value
}
func (l *LockExample) Execute() {
l.mu.Lock()
defer l.mu.Unlock()
time.Sleep(100 * time.Millisecond)
l.Value++
log.Printf("value: %d", l.Value)
l.Execute()
}
func main() {
var l LockExample
go func() {
l.Execute()
}()
readValueChan := make(chan int)
go func() {
readValueChan <- l.GetValue()
}()
for {
timeoutChan := make(chan int)
go func() {
time.Sleep(1 * time.Second)
timeoutChan <- 0
}()
select {
case <-timeoutChan:
log.Printf("waiting...")
case <-readValueChan:
log.Printf("value: %d", l.GetValue())
return
}
}
}
In this version, we run Execute
in a separate goroutine, simulating a background process in a service. Another goroutine attempts to read the Value
field, while the main routine periodically checks if the value has been read.
Output and Analysis
Running this code produces an output like this:
2023/12/05 12:42:25 value: 1
2023/12/05 12:42:26 waiting...
...
The first log indicates that Value
incremented once. However, the subsequent "waiting..." logs reveal the main goroutine is active but cannot read Value
again due to the deadlock in Execute
. This setup demonstrates a subtle deadlock that does not halt the entire program, exemplifying how deadlocks in multi-threaded applications can be more elusive and challenging to detect.
This scenario highlights the need for tools like delock
, which can help identify and diagnose deadlocks in more complex, real-world applications.
Detecting Deadlocks with Delock
In our previous example, the Go runtime couldn’t detect the deadlock because other parts of the program were still running. Now, let's see how delock
changes the game.
We modify our code to use delock.Mutex
instead of sync.Mutex
. This small change allows delock
to detect and report deadlocks even in complex scenarios where the Go runtime might not.
package main
import (
"log"
"time"
"github.com/ietxaniz/delock"
)
type LockExample struct {
mu delock.Mutex
Value int
}
func (l *LockExample) GetValue() int {
lockID, err := l.mu.Lock()
if err != nil {
panic(err)
}
defer l.mu.Unlock(lockID)
return l.Value
}
func (l *LockExample) Execute() {
lockID, err := l.mu.Lock()
if err != nil {
panic(err)
}
defer l.mu.Unlock(lockID)
time.Sleep(100 * time.Millisecond)
l.Value++
log.Printf("value: %d", l.Value)
l.Execute()
}
func main() {
var l LockExample
go func() {
l.Execute()
}()
readValueChan := make(chan int)
go func() {
readValueChan <- l.GetValue()
}()
for {
timeoutChan := make(chan int)
go func() {
time.Sleep(1 * time.Second)
timeoutChan <- 0
}()
select {
case <-timeoutChan:
log.Printf("waiting...")
case <-readValueChan:
log.Printf("value: %d", l.GetValue())
return
}
}
}
In the modified code, the standard locking mechanism:
l.mu.Lock()
defer l.mu.Unlock()
is transformed into:
lockID, err := l.mu.Lock()
if err != nil {
panic(err)
}
defer l.mu.Unlock(lockID)
Here's what's happening:
-
lockID
: This is a unique identifier for the lock instance within thedelock
library. It's used internally to track and manage stack data associated with each lock operation. -
err
: If theLock()
method takes longer than the defined timeout, it returns an error. This error contains valuable stack trace information indicating where the deadlock might be occurring.
For those concerned about verbosity, using fold-ninja, a VSCode extension I developed, can streamline the appearance in the editor. The expanded code block:
lockID, err := l.mu.Lock()
if err != nil {
panic(err)
}
defer l.mu.Unlock(lockID)
is neatly collapsed to:
lockID, err := l.mu.Lock()
defer l.mu.Unlock(lockID)
This makes the code more manageable and less cluttered while maintaining the functionality and robustness provided by delock
.
Output and Analysis
When we run this code:
2023/12/05 12:52:42 value: 1
2023/12/05 12:52:43 waiting...
panic: Deadlock detected
Unlike before, delock
detects the deadlock situation and panics, providing detailed stack trace information. This shows exactly where the deadlock occurs, demonstrating delock
's ability to uncover hidden issues in multi-threaded applications.
By using delock
, developers can gain insights into complex deadlock situations, making it a powerful tool for debugging and ensuring smooth concurrency management in Go services.
Tackling More Complex Deadlock Scenarios
Before we dive into a more intricate deadlock scenario, let's recall the primary function of the delock
tool: delock
is designed to help us identify and understand deadlocks in our Go applications, especially in complex cases where traditional Go runtime detection might fall short. Its ability to provide detailed stack trace information is invaluable in diagnosing and resolving these challenging issues.
Now, let's explore a scenario involving parent-child relationships within data structures, where deadlocks can be more elusive and difficult to detect.
Not all deadlock issues are as straightforward as the ones we've explored so far. In real-world applications, especially those involving intricate parent-child relationships between data structures, deadlocks can be more elusive. Let's delve into a more challenging example, where two interconnected data structures, each with its own lock, interact in ways that could potentially lead to deadlock situations.
In this scenario, we'll examine how parent and child structures, both equipped with mutexes, can create complex locking sequences that might result in deadlocks under certain conditions. This complexity showcases the subtleties of deadlock issues in sophisticated program architectures.
The following code demonstrates the same scenario using the delock
library. While delock
is instrumental in detecting deadlocks, it's important to note that it's primarily designed for debugging and testing purposes rather than production use. The additional processing delock
performs when acquiring and releasing locks might introduce overhead not suitable for a production environment.
package main
import (
"fmt"
"os"
"strconv"
"time"
"github.com/ietxaniz/delock"
)
type Child struct {
mu delock.RWMutex
value string
id int
parent *Parent
}
type Parent struct {
mu delock.RWMutex
items []*Child
lastID int
numChanges int
}
func (p *Parent) Add(value string) {
lockID, err := p.mu.Lock()
if err != nil {
panic(err)
}
defer p.mu.Unlock(lockID)
time.Sleep(1 * time.Millisecond)
p.lastID++
if p.items == nil {
p.items = make([]*Child, 0)
}
p.items = append(p.items, &Child{value: value, id: p.lastID, parent: p})
}
func (p *Parent) GetChildByID(childID int) *Child {
lockID, err := p.mu.RLock()
if err != nil {
panic(err)
}
defer p.mu.RUnlock(lockID)
for _, item := range p.items {
if item.GetID() == childID {
return item
}
}
return nil
}
func (p *Parent) GetNumChanges() int {
lockID, err := p.mu.RLock()
if err != nil {
panic(err)
}
defer p.mu.RUnlock(lockID)
return p.numChanges
}
func (p *Parent) IncreaseNumChanges() {
lockID, err := p.mu.RLock()
if err != nil {
panic(err)
}
defer p.mu.RUnlock(lockID)
time.Sleep(1 * time.Millisecond)
p.numChanges++
}
func (c *Child) GetID() int {
lockID, err := c.mu.RLock()
if err != nil {
panic(err)
}
defer c.mu.RUnlock(lockID)
time.Sleep(1 * time.Millisecond)
return c.id
}
func (c *Child) SetValue(value string) {
lockID, err := c.mu.Lock()
if err != nil {
panic(err)
}
defer c.mu.Unlock(lockID)
time.Sleep(1 * time.Millisecond)
c.value = value
c.parent.IncreaseNumChanges()
}
func main() {
var p Parent
if len(os.Args) < 3 {
println("Usage: complex <n> <t>")
return
}
n, err := strconv.Atoi(os.Args[1])
if err != nil {
panic(err)
}
t, err := strconv.Atoi(os.Args[2])
if err != nil {
panic(err)
}
go func() {
for i := 0; i < n; i++ {
time.Sleep(time.Duration(t) * time.Millisecond)
p.Add(fmt.Sprintf("child%d", i))
go func(id int, sleepMilliseconds int) {
for {
time.Sleep(time.Duration(sleepMilliseconds) * time.Millisecond)
child := p.GetChildByID(id)
if child != nil {
child.SetValue(fmt.Sprintf("child%d", id))
}
}
}(i, t)
}
}()
lastReadedNumChanges := 0
go func() {
for {
time.Sleep(1 * time.Second)
lastReadedNumChanges = p.GetNumChanges()
}
}()
for {
time.Sleep(1 * time.Second)
println(lastReadedNumChanges)
}
}
It's important to note that while this example runs smoothly under normal, unstressed conditions, it can lead to a deadlock under high-stress scenarios. This behavior can be observed by adjusting the number of runner goroutines (n
) and the sleep duration in milliseconds (t
) in each step. When n
is low and t
is high, the program appears to function correctly. However, increasing n
while decreasing t
creates conditions ripe for a deadlock, demonstrating how system load and operation timing can impact the occurrence of deadlocks in complex systems.
With the complex example in mind, we'll run the code using various parameters to simulate different stress levels. By altering the number of goroutines (n
) and the sleep duration (t
), we aim to observe the program's behavior under varying conditions. Sometimes, the Value
might increase steadily; other times, it may get stuck, raising suspicions of a deadlock. To confirm these suspicions, we'll switch to using delock
. This will help us determine not only if a deadlock is occurring but also under which specific conditions. Once detected, we'll delve into the code to unravel the reason behind the deadlock, enhancing our understanding of these intricate scenarios.
go run ./examples/complex_std/ 50 50
0
105
395
780
780
780
780
780
780
go run ./examples/complex_std/ 50 100
28
28
136
300
496
752
1035
1407
1787
2171
2552
2937
50 tasks with 50ms of sleep time seems to have problems while 50 tasks with 100ms of sleep time seems to work properly. So lets execute 50 tasks with 10ms sleep time using delock version.
export DELOCK_TIMEOUT=10000 && go run ./examples/complex 50 10
0
...
0
panic: Deadlock detected
* * * * * READ LOCK * * * * * * * *
goroutine 52 [running]:
github.com/ietxaniz/delock.(*RWMutex).RLock(0xc0000b0120)
/home/inigo/Documents/Projects/delock/rwmutex.go:159 +0x65
main.(*Parent).GetChildByID(0xc0000b0120, 0x3)
/home/inigo/Documents/Projects/delock/examples/complex/main.go:42 +0x38
main.main.func1.1(0x0?, 0xa)
/home/inigo/Documents/Projects/delock/examples/complex/main.go:119 +0x55
created by main.main.func1 in goroutine 18
/home/inigo/Documents/Projects/delock/examples/complex/main.go:116 +0x54
* * * * * READ LOCK * * * * * * * *
goroutine 14 [running]:
github.com/ietxaniz/delock.(*RWMutex).RLock(0xc0000b0120)
/home/inigo/Documents/Projects/delock/rwmutex.go:159 +0x65
main.(*Parent).IncreaseNumChanges(0xc0000b0120)
/home/inigo/Documents/Projects/delock/examples/complex/main.go:68 +0x2a
main.(*Child).SetValue(0xc00007c300, {0xc000220020, 0x6})
/home/inigo/Documents/Projects/delock/examples/complex/main.go:95 +0xb1
main.main.func1.1(0x0?, 0xa)
/home/inigo/Documents/Projects/delock/examples/complex/main.go:124 +0x15a
created by main.main.func1 in goroutine 18
/home/inigo/Documents/Projects/delock/examples/complex/main.go:116 +0x54
...
exit status 2
"I've omitted several lines from the output because numerous locks were acquired simultaneously in this scenario. However, I retained two critical outputs for analysis. The first one demonstrates typical behavior, but the second raises a red flag, particularly at examples/complex/main.go:95. This line is our key to understanding the deadlock. Let's examine the related code segment, with line numbers for reference:"
87 func (c *Child) SetValue(value string) {
88 lockID, err := c.mu.Lock()
89 if err != nil {
90 panic(err)
91 }
92 defer c.mu.Unlock(lockID)
93
94 c.value = value
95 c.parent.IncreaseNumChanges()
96 }
Line 95, c.parent.IncreaseNumChanges()
is called while holding a lock. This is a very common mistake. The problem is that IncreaseNumChanges
also acquires a lock. So, if IncreaseNumChanges
is called while holding a lock, it will never be able to acquire the lock and will wait forever. This is a deadlock.
So, this demonstrates how this tool can effectively help us find deadlocks in our code.
Solution to the Found Deadlock Issue
Upon discovering the deadlock, two potential solutions come to mind. The first is straightforward, while the second aligns more with my preferred approach:
Straightforward Solution
func (c *Child) SetValue(value string) {
lockID, err := c.mu.Lock()
if err != nil {
panic(err)
}
c.value = value
parent := c.parent
c.mu.Unlock(lockID)
parent.IncreaseNumChanges()
}
This solution works by explicitly unlocking before exiting the function. However, it risks future errors if the code is modified and the lock is not properly released in all paths.
Preferred Solution
func (c *Child) setValueSafe(value string) *Parent {
lockID, err := c.mu.Lock()
if err != nil {
panic(err)
}
defer c.mu.Unlock(lockID)
c.value = value
return c.parent
}
func (c *Child) SetValue(value string) {
parent := c.setValueSafe(value)
parent.IncreaseNumChanges()
}
This approach achieves the same goal but retains the defer
statement for unlocking. This makes the code more robust against future changes and ensures that the lock is always properly released.
With these corrections applied, the program's output now aligns with our expectations, as evidenced in the terminal:
go run ./examples/complex_solved 100 1
0
404
846
1299
1763
2203
4440
7563
10670
13750
16838
19926
23012
26104
29207
32306
35417
Repository and Examples
The delock
library is hosted on GitHub and can be found at https://github.com/ietxaniz/delock. The repository includes an examples
folder, showcasing various scenarios:
-
simple
andsimple_std
: Demonstrating deadlock detection withdelock
and the Go standard library, respectively. -
simple2
andsimple2_std
: Highlighting a scenario where Go's standard deadlock detection might fail, butdelock
succeeds. -
complex
,complex_std
, andcomplex_solved
: Exploring more intricate deadlock situations with varying outcomes depending on the detection method used and whether the deadlock issue is resolved.
To run these examples, clone the repository and execute the desired example from the root folder using the command: go run pathToExample
. For instance: go run ./examples/complex
.
Open Invitation for Collaboration
I welcome any suggestions, feedback, or contributions to the delock
project. Whether it's improving existing features, proposing new ones, or enhancing documentation and examples, your input is valuable. Feel free to raise issues, submit pull requests, or reach out for discussions on potential enhancements.
Posted on December 5, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.