Tracing Go Routine Blocking Issues with pprof Execution Profiles

Discover how to use pprof, Go's profiling tool, to tackle blocking issues in Go applications. This guide outlines steps to pinpoint and remedy performance bottlenecks stemming from goroutine synchronization and shared memory access.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

Tracing Go Routine Blocking Issues with pprof Execution Profiles

Go routines and channels are powerful concurrency primitives that enable developers to write highly performant and scalable applications in Go. However, improperly synchronized access to shared memory between go routines can lead to race conditions and deadlocks that manifest as blocking issues. Tracking down the root cause of blocking in complex applications with many go routines can be quite challenging. This is where pprof, the built-in profiling tool in Go, comes handy.
By generating CPU and heap profiles, pprof allows us to understand how the application executes over time and identifies hot code paths that dominate execution. When go routines appear blocked for no obvious reason, it usually indicates contention over shared data structures. The key is to study the execution flows during this blocked period at goroutine granularity to pinpoint the exact synchronization issues.
In this post, we will demonstrate a step-by-step workflow to diagnose hard-to-root-cause go routine blocking problems using pprof execution profiles.

Reproducing the Issue

We have a sample application with several active goroutines processing requests from a shared queue. At random times, the processing seems to halt even though the queue keeps growing.
To simulate this, we created a benchmark test driving high load through the system and injecting an artificial delay of few seconds randomly while accessing the shared queue. This allowed us to deterministically reproduce the blocking issue.
The benchmark is designed to simulate production load by spinning up 150 concurrent goroutines sending requests into the shared queue at a high rate. The goroutines processing queue items start getting blocked at times despite the piling up requests.

Capturing Execution Profile

The first step is to capture a CPU profile during an instance of prolonged blocking to understand what the goroutines are actually executing:


func main() {

  // logic to detect blocking
  
  f, err := os.Create("cpu.prof")
  if err != nil {
    log.Fatal("could not create CPU profile: ", err)
  }

  pprof.StartCPUProfile(f)
  defer pprof.StopCPUProfile() 
  
  // ... rest of the code
}

This will generate a CPU profile, recording stack traces for actively running goroutines every 10ms, for the duration blocking was detected. The profile data is written to the cpu.prof file.

Analyzing the Profile

We can now visualize the profile timeline using go tool pprof both on command line as well in a graphical interface like Chromium. This provides a high-level overview of the hot functions and call graphs:


$ go tool pprof cpu.prof
(pprof) top
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%  
 162.50ms 50.78% 50.78%  162.50ms 50.78%  sync.(*RWMutex).Lock
  80ms 25% 75.78%   80ms 25%  runtime.gopark
     40ms 12.5% 88.28%      40ms 12.5%  main.processRequest
     20ms 6.25% 94.53%      20ms 6.25%  sync.(*RWMutex).Unlock
      10ms 3.13% 97.66%       10ms 3.13%  sync.(*Mutex).Unlock
       8ms 2.50%   100%        8ms 2.50%  runtime.schedule

The CPU profile provides a summarized view of the hottest functions that cumulatively consumed over 320ms during the profiled duration. We can immediately see that contention in acquiring the RWMutex reader-writer lock used the queue is responsible for a bulk of the time - over 50%.

The 'go tool pprof' allows interactive exploration of the profile. We can drill down into various caller subgraphs to visualize the entire calling context at each point. Focusing on the hot code region around this lock acquisition allows us to zoom into the specific goroutine and caller context:


goroutine 832 [running]:
     runtime.goparkunlock(...)
     /usr/local/go/src/runtime/proc.go:297
# here the goroutine is parked/blocked 
     
   goroutine 819 [runnable]:
     sync.(*RWMutex).Unlock(0xc00010c3a0)
     /usr/local/go/src/sync/rwmutex.go:199 +0x48     
     
   goroutine 843 [running]:
     sync.(*RWMutex).RLock(0xc00010c200)     
     /usr/local/go/src/sync/rwmutex.go:137 +0x90

This shows that goroutine 832 trying to acquire the write lock is parked, while goroutine 843 was able to obtain the read lock successfully.

Identifying the Root Cause

The profile analysis clearly shows that concurrent access to the shared queue is allowing the reader goroutines make progress while writes getting blocked.
This confirms that improperly synchronized concurrent RWMutex usage is the root cause of the deadlock, with the write operations bottlnecked.

Addressing the Problem

Now that we have identified the concurrency bug using pprof execution profiling, we can fix this by serializing all writes to the shared queue through a separate mutex. This will prevent the reads from blocking queued writes.
We also optimized the hot code areas detected in the profile analysis to minimize lock contention. Profiling the improved code shows significantly reduced CPU consumption around synchronization primitives.

Takeaways

  • pprof CPU profiles provide complete visibility into exact code paths and goroutines responsible for blocking
  • Interactive timeline analysis helps uncover shared data structure contention issues
  • Look out for concurrently blocked and running goroutines around same data structures
  • Careful synchronization is needed to prevent complex deadlocks in Go

With these learnings, we can prevent and troubleshoot similar blocking issues in large-scale golang applications. pprof helps unlock performance bottlenecks that are otherwise very hard to root cause!

Additional Resources / References

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Implementing Custom Instrumentation for Application Performance Monitoring (APM) Using OpenTelemetry

Application Performance Monitoring (APM) has become crucial for businesses to ensure optimal software performance and user experience. As applications grow more complex and distributed, the need for comprehensive monitoring solutions has never been greater. OpenTelemetry has emerged as a powerful, vendor-neutral framework for instrumenting, generating, collecting, and exporting telemetry data. This article explores how to implement custom instrumentation using OpenTelemetry for effective APM.

Mobile Engineering
time
5
 min read

Implementing Custom Evaluation Metrics in LangChain for Measuring AI Agent Performance

As AI and language models continue to advance at breakneck speed, the need to accurately gauge AI agent performance has never been more critical. LangChain, a go-to framework for building language model applications, comes equipped with its own set of evaluation tools. However, these off-the-shelf solutions often fall short when dealing with the intricacies of specialized AI applications. This article dives into the world of custom evaluation metrics in LangChain, showing you how to craft bespoke measures that truly capture the essence of your AI agent's performance.

AI/ML
time
5
 min read

Enhancing Quality Control with AI: Smarter Defect Detection in Manufacturing

In today's competitive manufacturing landscape, quality control is paramount. Traditional methods often struggle to maintain optimal standards. However, the integration of Artificial Intelligence (AI) is revolutionizing this domain. This article delves into the transformative impact of AI on quality control in manufacturing, highlighting specific use cases and their underlying architectures.

AI/ML
time
5
 min read