Tracing Go Routine Blocking Issues with pprof Execution Profiles

Go routines and channels are powerful concurrency primitives that enable developers to write highly performant and scalable applications in Go. However, improperly synchronized access to shared memory between go routines can lead to race conditions and deadlocks that manifest as blocking issues. Tracking down the root cause of blocking in complex applications with many go routines can be quite challenging. This is where pprof, the built-in profiling tool in Go, comes handy.
By generating CPU and heap profiles, pprof allows us to understand how the application executes over time and identifies hot code paths that dominate execution. When go routines appear blocked for no obvious reason, it usually indicates contention over shared data structures. The key is to study the execution flows during this blocked period at goroutine granularity to pinpoint the exact synchronization issues.
In this post, we will demonstrate a step-by-step workflow to diagnose hard-to-root-cause go routine blocking problems using pprof execution profiles.

Reproducing the Issue

We have a sample application with several active goroutines processing requests from a shared queue. At random times, the processing seems to halt even though the queue keeps growing.
To simulate this, we created a benchmark test driving high load through the system and injecting an artificial delay of few seconds randomly while accessing the shared queue. This allowed us to deterministically reproduce the blocking issue.
The benchmark is designed to simulate production load by spinning up 150 concurrent goroutines sending requests into the shared queue at a high rate. The goroutines processing queue items start getting blocked at times despite the piling up requests.

Capturing Execution Profile

The first step is to capture a CPU profile during an instance of prolonged blocking to understand what the goroutines are actually executing:


func main() {

  // logic to detect blocking
  
  f, err := os.Create("cpu.prof")
  if err != nil {
    log.Fatal("could not create CPU profile: ", err)
  }

  pprof.StartCPUProfile(f)
  defer pprof.StopCPUProfile() 
  
  // ... rest of the code
}

This will generate a CPU profile, recording stack traces for actively running goroutines every 10ms, for the duration blocking was detected. The profile data is written to the cpu.prof file.

Analyzing the Profile

We can now visualize the profile timeline using go tool pprof both on command line as well in a graphical interface like Chromium. This provides a high-level overview of the hot functions and call graphs:


$ go tool pprof cpu.prof
(pprof) top
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%  
 162.50ms 50.78% 50.78%  162.50ms 50.78%  sync.(*RWMutex).Lock
  80ms 25% 75.78%   80ms 25%  runtime.gopark
     40ms 12.5% 88.28%      40ms 12.5%  main.processRequest
     20ms 6.25% 94.53%      20ms 6.25%  sync.(*RWMutex).Unlock
      10ms 3.13% 97.66%       10ms 3.13%  sync.(*Mutex).Unlock
       8ms 2.50%   100%        8ms 2.50%  runtime.schedule

The CPU profile provides a summarized view of the hottest functions that cumulatively consumed over 320ms during the profiled duration. We can immediately see that contention in acquiring the RWMutex reader-writer lock used the queue is responsible for a bulk of the time - over 50%.

The 'go tool pprof' allows interactive exploration of the profile. We can drill down into various caller subgraphs to visualize the entire calling context at each point. Focusing on the hot code region around this lock acquisition allows us to zoom into the specific goroutine and caller context:


goroutine 832 [running]:
     runtime.goparkunlock(...)
     /usr/local/go/src/runtime/proc.go:297
# here the goroutine is parked/blocked 
     
   goroutine 819 [runnable]:
     sync.(*RWMutex).Unlock(0xc00010c3a0)
     /usr/local/go/src/sync/rwmutex.go:199 +0x48     
     
   goroutine 843 [running]:
     sync.(*RWMutex).RLock(0xc00010c200)     
     /usr/local/go/src/sync/rwmutex.go:137 +0x90

This shows that goroutine 832 trying to acquire the write lock is parked, while goroutine 843 was able to obtain the read lock successfully.

Identifying the Root Cause

The profile analysis clearly shows that concurrent access to the shared queue is allowing the reader goroutines make progress while writes getting blocked.
This confirms that improperly synchronized concurrent RWMutex usage is the root cause of the deadlock, with the write operations bottlnecked.

Addressing the Problem

Now that we have identified the concurrency bug using pprof execution profiling, we can fix this by serializing all writes to the shared queue through a separate mutex. This will prevent the reads from blocking queued writes.
We also optimized the hot code areas detected in the profile analysis to minimize lock contention. Profiling the improved code shows significantly reduced CPU consumption around synchronization primitives.

Takeaways

pprof CPU profiles provide complete visibility into exact code paths and goroutines responsible for blocking
Interactive timeline analysis helps uncover shared data structure contention issues
Look out for concurrently blocked and running goroutines around same data structures
Careful synchronization is needed to prevent complex deadlocks in Go

With these learnings, we can prevent and troubleshoot similar blocking issues in large-scale golang applications. pprof helps unlock performance bottlenecks that are otherwise very hard to root cause!

Additional Resources / References

‍

Want to receive update about our upcoming podcast?

Latest Articles

View All Articles

Implementing custom windowing and triggering mechanisms in Apache Flink for advanced event aggregation

Dive into advanced Apache Flink stream processing with this comprehensive guide to custom windowing and triggering mechanisms. Learn how to implement volume-based windows, pattern-based triggers, and dynamic session windows that adapt to user behavior. The article provides practical Java code examples, performance optimization tips, and real-world implementation strategies for complex event processing scenarios beyond Flink's built-in capabilities.

15

min read

Implementing feature flags for controlled rollouts and experimentation in production

Discover how feature flags can revolutionize your software deployment strategy in this comprehensive guide. Learn to implement everything from basic toggles to sophisticated experimentation platforms with practical code examples in Java, JavaScript, and Node.js. The post covers essential implementation patterns, best practices for flag management, and real-world architectures that have helped companies like Spotify reduce deployment risks by 80%. Whether you're looking to enable controlled rollouts, A/B testing, or zero-downtime migrations, this guide provides the technical foundation you need to build robust feature flagging systems.

12

min read

Implementing incremental data processing using Databricks Delta Lake's change data feed

Discover how to implement efficient incremental data processing with Databricks Delta Lake's Change Data Feed. This comprehensive guide walks through enabling CDF, reading change data, and building robust processing pipelines that only handle modified data. Learn advanced patterns for schema evolution, large data volumes, and exactly-once processing, plus real-world applications including real-time analytics dashboards and data quality monitoring. Perfect for data engineers looking to optimize resource usage and processing time.

12

min read