Go deadlock detection: delock library

ietxaniz

Iñigo Etxaniz

Posted on December 5, 2023

Go deadlock detection: delock library

Introduction

In the Go programming world, managing concurrency is a part of everyday life. Go makes concurrent programming relatively straightforward with its built-in primitives like goroutines and channels. However, when it comes to locking structures to ensure thread safety, things can get a bit tricky.

Typically, we use a pointer to a structure and consistently call its methods to maintain thread safety. This approach works well for most scenarios, but complex algorithms demand careful consideration to avoid concurrency issues, especially deadlocks.

While working on a large-scale project for a caching implementation in an authentication service, I encountered an intriguing problem. Occasionally, and quite rarely, the system would hang. Race conditions were unlikely culprits, as I diligently used mutexes within structures. This left me suspecting deadlocks. But how could I confirm this and, more importantly, identify where they were happening? That's where the idea for delock began – a tool specifically designed to detect and pinpoint deadlocks in Go applications.

Understanding Deadlocks in Go

Deadlocks in Go, or in any programming language for that matter, are akin to a gridlock in traffic – everything comes to a standstill. A deadlock occurs when two or more goroutines are waiting on each other to release resources, leading to a situation where none of them can proceed.

Go Runtime Deadlock Detection

This example, albeit a bit contrived, effectively illustrates a fundamental deadlock scenario. In the Execute method, we acquire a mutex lock and then call the same method recursively without ever releasing the lock. This leads to an attempt to acquire the lock again while it's still held by the same goroutine, causing a deadlock.

Though this is a simple and somewhat unrealistic example, it mirrors real-world scenarios where deadlocks occur due to improper lock handling. It’s a classic case of a process indefinitely waiting on itself. The Go runtime is adept at catching such blatant deadlocks in a single-threaded context, where the entire application grinds to a halt. This example serves as a reminder of how subtle mistakes in concurrency management can lead to significant issues, and why it's crucial to code with caution when dealing with locks.

package main

import (
    "log"
    "sync"
    "time"
)

type LockExample struct {
    mu    sync.Mutex
    Value int
}

func (l *LockExample) Execute() {
    l.mu.Lock()
    defer l.mu.Unlock()

    time.Sleep(100 * time.Millisecond)
    l.Value++
    log.Printf("value: %d", l.Value)
    l.Execute()
}

func main() {
    var l LockExample
    l.Execute()
}
Enter fullscreen mode Exit fullscreen mode

Upon running the simple deadlock example with standard sync.Mutex, the Go runtime effectively detects and reports a deadlock. Here's what happens:

  • The log statement 2023/12/05 12:38:02 value: 1 indicates that the Execute function ran once and incremented Value.
  • Immediately after, the program crashes with a fatal error: all goroutines are asleep - deadlock!. This error is thrown by the Go runtime when it detects that all goroutines are waiting on each other, with no possibility of resuming.
  • The stack trace provided points to where the deadlock occurs, highlighting the recursive call to Execute that attempts to reacquire an already held lock.

This result demonstrates the Go runtime's ability to detect simple deadlocks where the entire program is stuck waiting for locks to be released. It serves as a practical example of how easy it is to fall into a deadlock trap and the importance of careful lock management.

go run ./examples/simple_std
2023/12/05 12:38:02 value: 1
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [sync.Mutex.Lock]:
sync.runtime_SemacquireMutex(0x485040?, 0x0?, 0xc00007e050?)
        /usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0000120e0)
        /usr/local/go/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:90
main.(*LockExample).Execute(0xc0000120e0)
        /home/inigo/Documents/Projects/delock/examples/simple_std/main.go:15 +0x4c
main.(*LockExample).Execute(0xc0000120e0)
        /home/inigo/Documents/Projects/delock/examples/simple_std/main.go:21 +0x12a
main.main()
        /home/inigo/Documents/Projects/delock/examples/simple_std/main.go:26 +0x1f
exit status 2
Enter fullscreen mode Exit fullscreen mode

Limitations in Complex Scenarios

In real-world programming, especially in services, it's common to have multiple goroutines performing different tasks. In such environments, a deadlock might not stop the entire application, making it more challenging to detect and resolve.

To illustrate this, let's modify our previous example to fit into a broader context:

package main

import (
    "log"
    "sync"
    "time"
)

type LockExample struct {
    mu    sync.Mutex
    Value int
}

func (l *LockExample) GetValue() int {
    l.mu.Lock()
    defer l.mu.Unlock()

    return l.Value
}
func (l *LockExample) Execute() {
    l.mu.Lock()
    defer l.mu.Unlock()

    time.Sleep(100 * time.Millisecond)
    l.Value++
    log.Printf("value: %d", l.Value)
    l.Execute()
}

func main() {
    var l LockExample
    go func() {
        l.Execute()
    }()
    readValueChan := make(chan int)
    go func() {
        readValueChan <- l.GetValue()
    }()
    for {
        timeoutChan := make(chan int)
        go func() {
            time.Sleep(1 * time.Second)
            timeoutChan <- 0
        }()

        select {
        case <-timeoutChan:
            log.Printf("waiting...")
        case <-readValueChan:
            log.Printf("value: %d", l.GetValue())
            return
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

In this version, we run Execute in a separate goroutine, simulating a background process in a service. Another goroutine attempts to read the Value field, while the main routine periodically checks if the value has been read.

Output and Analysis

Running this code produces an output like this:

2023/12/05 12:42:25 value: 1
2023/12/05 12:42:26 waiting...
...
Enter fullscreen mode Exit fullscreen mode

The first log indicates that Value incremented once. However, the subsequent "waiting..." logs reveal the main goroutine is active but cannot read Value again due to the deadlock in Execute. This setup demonstrates a subtle deadlock that does not halt the entire program, exemplifying how deadlocks in multi-threaded applications can be more elusive and challenging to detect.

This scenario highlights the need for tools like delock, which can help identify and diagnose deadlocks in more complex, real-world applications.

Detecting Deadlocks with Delock

In our previous example, the Go runtime couldn’t detect the deadlock because other parts of the program were still running. Now, let's see how delock changes the game.

We modify our code to use delock.Mutex instead of sync.Mutex. This small change allows delock to detect and report deadlocks even in complex scenarios where the Go runtime might not.

package main

import (
    "log"
    "time"

    "github.com/ietxaniz/delock"
)

type LockExample struct {
    mu    delock.Mutex
    Value int
}

func (l *LockExample) GetValue() int {
    lockID, err := l.mu.Lock()
    if err != nil {
        panic(err)
    }
    defer l.mu.Unlock(lockID)

    return l.Value
}
func (l *LockExample) Execute() {
    lockID, err := l.mu.Lock()
    if err != nil {
        panic(err)
    }
    defer l.mu.Unlock(lockID)

    time.Sleep(100 * time.Millisecond)
    l.Value++
    log.Printf("value: %d", l.Value)
    l.Execute()
}

func main() {
    var l LockExample
    go func() {
        l.Execute()
    }()
    readValueChan := make(chan int)
    go func() {
        readValueChan <- l.GetValue()
    }()
    for {
        timeoutChan := make(chan int)
        go func() {
            time.Sleep(1 * time.Second)
            timeoutChan <- 0
        }()

        select {
        case <-timeoutChan:
            log.Printf("waiting...")
        case <-readValueChan:
            log.Printf("value: %d", l.GetValue())
            return
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

In the modified code, the standard locking mechanism:

l.mu.Lock()
defer l.mu.Unlock()
Enter fullscreen mode Exit fullscreen mode

is transformed into:

lockID, err := l.mu.Lock()
if err != nil {
    panic(err)
}
defer l.mu.Unlock(lockID)
Enter fullscreen mode Exit fullscreen mode

Here's what's happening:

  • lockID: This is a unique identifier for the lock instance within the delock library. It's used internally to track and manage stack data associated with each lock operation.
  • err: If the Lock() method takes longer than the defined timeout, it returns an error. This error contains valuable stack trace information indicating where the deadlock might be occurring.

For those concerned about verbosity, using fold-ninja, a VSCode extension I developed, can streamline the appearance in the editor. The expanded code block:

lockID, err := l.mu.Lock()
if err != nil {
    panic(err)
}
defer l.mu.Unlock(lockID)
Enter fullscreen mode Exit fullscreen mode

is neatly collapsed to:

lockID, err := l.mu.Lock()
defer l.mu.Unlock(lockID)
Enter fullscreen mode Exit fullscreen mode

This makes the code more manageable and less cluttered while maintaining the functionality and robustness provided by delock.

Output and Analysis

When we run this code:

2023/12/05 12:52:42 value: 1
2023/12/05 12:52:43 waiting...
panic: Deadlock detected
Enter fullscreen mode Exit fullscreen mode

Unlike before, delock detects the deadlock situation and panics, providing detailed stack trace information. This shows exactly where the deadlock occurs, demonstrating delock's ability to uncover hidden issues in multi-threaded applications.

By using delock, developers can gain insights into complex deadlock situations, making it a powerful tool for debugging and ensuring smooth concurrency management in Go services.

Tackling More Complex Deadlock Scenarios

Before we dive into a more intricate deadlock scenario, let's recall the primary function of the delock tool: delock is designed to help us identify and understand deadlocks in our Go applications, especially in complex cases where traditional Go runtime detection might fall short. Its ability to provide detailed stack trace information is invaluable in diagnosing and resolving these challenging issues.

Now, let's explore a scenario involving parent-child relationships within data structures, where deadlocks can be more elusive and difficult to detect.

Not all deadlock issues are as straightforward as the ones we've explored so far. In real-world applications, especially those involving intricate parent-child relationships between data structures, deadlocks can be more elusive. Let's delve into a more challenging example, where two interconnected data structures, each with its own lock, interact in ways that could potentially lead to deadlock situations.

In this scenario, we'll examine how parent and child structures, both equipped with mutexes, can create complex locking sequences that might result in deadlocks under certain conditions. This complexity showcases the subtleties of deadlock issues in sophisticated program architectures.

The following code demonstrates the same scenario using the delock library. While delock is instrumental in detecting deadlocks, it's important to note that it's primarily designed for debugging and testing purposes rather than production use. The additional processing delock performs when acquiring and releasing locks might introduce overhead not suitable for a production environment.

package main

import (
    "fmt"
    "os"
    "strconv"
    "time"

    "github.com/ietxaniz/delock"
)

type Child struct {
    mu     delock.RWMutex
    value  string
    id     int
    parent *Parent
}

type Parent struct {
    mu         delock.RWMutex
    items      []*Child
    lastID     int
    numChanges int
}

func (p *Parent) Add(value string) {
    lockID, err := p.mu.Lock()
    if err != nil {
        panic(err)
    }
    defer p.mu.Unlock(lockID)

    time.Sleep(1 * time.Millisecond)
    p.lastID++
    if p.items == nil {
        p.items = make([]*Child, 0)
    }
    p.items = append(p.items, &Child{value: value, id: p.lastID, parent: p})
}

func (p *Parent) GetChildByID(childID int) *Child {
    lockID, err := p.mu.RLock()
    if err != nil {
        panic(err)
    }
    defer p.mu.RUnlock(lockID)

    for _, item := range p.items {
        if item.GetID() == childID {
            return item
        }
    }

    return nil
}

func (p *Parent) GetNumChanges() int {
    lockID, err := p.mu.RLock()
    if err != nil {
        panic(err)
    }
    defer p.mu.RUnlock(lockID)

    return p.numChanges
}

func (p *Parent) IncreaseNumChanges() {
    lockID, err := p.mu.RLock()
    if err != nil {
        panic(err)
    }
    defer p.mu.RUnlock(lockID)

    time.Sleep(1 * time.Millisecond)

    p.numChanges++
}

func (c *Child) GetID() int {
    lockID, err := c.mu.RLock()
    if err != nil {
        panic(err)
    }
    defer c.mu.RUnlock(lockID)

    time.Sleep(1 * time.Millisecond)

    return c.id
}

func (c *Child) SetValue(value string) {
    lockID, err := c.mu.Lock()
    if err != nil {
        panic(err)
    }
    defer c.mu.Unlock(lockID)

    time.Sleep(1 * time.Millisecond)
    c.value = value
    c.parent.IncreaseNumChanges()
}

func main() {
    var p Parent
    if len(os.Args) < 3 {
        println("Usage: complex <n> <t>")
        return
    }
    n, err := strconv.Atoi(os.Args[1])
    if err != nil {
        panic(err)
    }
    t, err := strconv.Atoi(os.Args[2])
    if err != nil {
        panic(err)
    }
    go func() {
        for i := 0; i < n; i++ {
            time.Sleep(time.Duration(t) * time.Millisecond)
            p.Add(fmt.Sprintf("child%d", i))
            go func(id int, sleepMilliseconds int) {
                for {
                    time.Sleep(time.Duration(sleepMilliseconds) * time.Millisecond)
                    child := p.GetChildByID(id)
                    if child != nil {
                        child.SetValue(fmt.Sprintf("child%d", id))
                    }
                }
            }(i, t)
        }
    }()

    lastReadedNumChanges := 0

    go func() {

        for {
            time.Sleep(1 * time.Second)
            lastReadedNumChanges = p.GetNumChanges()
        }
    }()

    for {
        time.Sleep(1 * time.Second)
        println(lastReadedNumChanges)
    }
}
Enter fullscreen mode Exit fullscreen mode

It's important to note that while this example runs smoothly under normal, unstressed conditions, it can lead to a deadlock under high-stress scenarios. This behavior can be observed by adjusting the number of runner goroutines (n) and the sleep duration in milliseconds (t) in each step. When n is low and t is high, the program appears to function correctly. However, increasing n while decreasing t creates conditions ripe for a deadlock, demonstrating how system load and operation timing can impact the occurrence of deadlocks in complex systems.

With the complex example in mind, we'll run the code using various parameters to simulate different stress levels. By altering the number of goroutines (n) and the sleep duration (t), we aim to observe the program's behavior under varying conditions. Sometimes, the Value might increase steadily; other times, it may get stuck, raising suspicions of a deadlock. To confirm these suspicions, we'll switch to using delock. This will help us determine not only if a deadlock is occurring but also under which specific conditions. Once detected, we'll delve into the code to unravel the reason behind the deadlock, enhancing our understanding of these intricate scenarios.

go run ./examples/complex_std/ 50 50
0
105
395
780
780
780
780
780
780

go run ./examples/complex_std/ 50 100
28
28
136
300
496
752
1035
1407
1787
2171
2552
2937
Enter fullscreen mode Exit fullscreen mode

50 tasks with 50ms of sleep time seems to have problems while 50 tasks with 100ms of sleep time seems to work properly. So lets execute 50 tasks with 10ms sleep time using delock version.

export DELOCK_TIMEOUT=10000 && go run ./examples/complex 50 10
0
...
0
panic: Deadlock detected


* * * * * READ LOCK * * * * * * * *
goroutine 52 [running]:
github.com/ietxaniz/delock.(*RWMutex).RLock(0xc0000b0120)
        /home/inigo/Documents/Projects/delock/rwmutex.go:159 +0x65
main.(*Parent).GetChildByID(0xc0000b0120, 0x3)
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:42 +0x38
main.main.func1.1(0x0?, 0xa)
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:119 +0x55
created by main.main.func1 in goroutine 18
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:116 +0x54


* * * * * READ LOCK * * * * * * * *
goroutine 14 [running]:
github.com/ietxaniz/delock.(*RWMutex).RLock(0xc0000b0120)
        /home/inigo/Documents/Projects/delock/rwmutex.go:159 +0x65
main.(*Parent).IncreaseNumChanges(0xc0000b0120)
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:68 +0x2a
main.(*Child).SetValue(0xc00007c300, {0xc000220020, 0x6})
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:95 +0xb1
main.main.func1.1(0x0?, 0xa)
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:124 +0x15a
created by main.main.func1 in goroutine 18
        /home/inigo/Documents/Projects/delock/examples/complex/main.go:116 +0x54

...

exit status 2
Enter fullscreen mode Exit fullscreen mode

"I've omitted several lines from the output because numerous locks were acquired simultaneously in this scenario. However, I retained two critical outputs for analysis. The first one demonstrates typical behavior, but the second raises a red flag, particularly at examples/complex/main.go:95. This line is our key to understanding the deadlock. Let's examine the related code segment, with line numbers for reference:"

87     func (c *Child) SetValue(value string) {
88      lockID, err := c.mu.Lock()
89      if err != nil {
90          panic(err)
91      }
92      defer c.mu.Unlock(lockID)
93
94      c.value = value
95      c.parent.IncreaseNumChanges()
96     }
Enter fullscreen mode Exit fullscreen mode

Line 95, c.parent.IncreaseNumChanges() is called while holding a lock. This is a very common mistake. The problem is that IncreaseNumChanges also acquires a lock. So, if IncreaseNumChanges is called while holding a lock, it will never be able to acquire the lock and will wait forever. This is a deadlock.

So, this demonstrates how this tool can effectively help us find deadlocks in our code.

Solution to the Found Deadlock Issue

Upon discovering the deadlock, two potential solutions come to mind. The first is straightforward, while the second aligns more with my preferred approach:

Straightforward Solution

func (c *Child) SetValue(value string) {
    lockID, err := c.mu.Lock()
    if err != nil {
        panic(err)
    }

    c.value = value
    parent := c.parent
    c.mu.Unlock(lockID)
    parent.IncreaseNumChanges()
}
Enter fullscreen mode Exit fullscreen mode

This solution works by explicitly unlocking before exiting the function. However, it risks future errors if the code is modified and the lock is not properly released in all paths.

Preferred Solution

func (c *Child) setValueSafe(value string) *Parent {
    lockID, err := c.mu.Lock()
    if err != nil {
        panic(err)
    }
    defer c.mu.Unlock(lockID)

    c.value = value
    return c.parent
}

func (c *Child) SetValue(value string) {
    parent := c.setValueSafe(value)
    parent.IncreaseNumChanges()
}
Enter fullscreen mode Exit fullscreen mode

This approach achieves the same goal but retains the defer statement for unlocking. This makes the code more robust against future changes and ensures that the lock is always properly released.

With these corrections applied, the program's output now aligns with our expectations, as evidenced in the terminal:

go run ./examples/complex_solved 100 1
0
404
846
1299
1763
2203
4440
7563
10670
13750
16838
19926
23012
26104
29207
32306
35417
Enter fullscreen mode Exit fullscreen mode

Repository and Examples

The delock library is hosted on GitHub and can be found at https://github.com/ietxaniz/delock. The repository includes an examples folder, showcasing various scenarios:

  • simple and simple_std: Demonstrating deadlock detection with delock and the Go standard library, respectively.
  • simple2 and simple2_std: Highlighting a scenario where Go's standard deadlock detection might fail, but delock succeeds.
  • complex, complex_std, and complex_solved: Exploring more intricate deadlock situations with varying outcomes depending on the detection method used and whether the deadlock issue is resolved.

To run these examples, clone the repository and execute the desired example from the root folder using the command: go run pathToExample. For instance: go run ./examples/complex.

Open Invitation for Collaboration

I welcome any suggestions, feedback, or contributions to the delock project. Whether it's improving existing features, proposing new ones, or enhancing documentation and examples, your input is valuable. Feel free to raise issues, submit pull requests, or reach out for discussions on potential enhancements.

💖 💪 🙅 🚩
ietxaniz
Iñigo Etxaniz

Posted on December 5, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related