Discover how to use pprof, Go's profiling tool, to tackle blocking issues in Go applications. This guide outlines steps to pinpoint and remedy performance bottlenecks stemming from goroutine synchronization and shared memory access.
Go routines and channels are powerful concurrency primitives that enable developers to write highly performant and scalable applications in Go. However, improperly synchronized access to shared memory between go routines can lead to race conditions and deadlocks that manifest as blocking issues. Tracking down the root cause of blocking in complex applications with many go routines can be quite challenging. This is where pprof, the built-in profiling tool in Go, comes handy.
By generating CPU and heap profiles, pprof allows us to understand how the application executes over time and identifies hot code paths that dominate execution. When go routines appear blocked for no obvious reason, it usually indicates contention over shared data structures. The key is to study the execution flows during this blocked period at goroutine granularity to pinpoint the exact synchronization issues.
In this post, we will demonstrate a step-by-step workflow to diagnose hard-to-root-cause go routine blocking problems using pprof execution profiles.
We have a sample application with several active goroutines processing requests from a shared queue. At random times, the processing seems to halt even though the queue keeps growing.
To simulate this, we created a benchmark test driving high load through the system and injecting an artificial delay of few seconds randomly while accessing the shared queue. This allowed us to deterministically reproduce the blocking issue.
The benchmark is designed to simulate production load by spinning up 150 concurrent goroutines sending requests into the shared queue at a high rate. The goroutines processing queue items start getting blocked at times despite the piling up requests.
The first step is to capture a CPU profile during an instance of prolonged blocking to understand what the goroutines are actually executing:
This will generate a CPU profile, recording stack traces for actively running goroutines every 10ms, for the duration blocking was detected. The profile data is written to the cpu.prof file.
We can now visualize the profile timeline using go tool pprof both on command line as well in a graphical interface like Chromium. This provides a high-level overview of the hot functions and call graphs:
The CPU profile provides a summarized view of the hottest functions that cumulatively consumed over 320ms during the profiled duration. We can immediately see that contention in acquiring the RWMutex reader-writer lock used the queue is responsible for a bulk of the time - over 50%.
The 'go tool pprof' allows interactive exploration of the profile. We can drill down into various caller subgraphs to visualize the entire calling context at each point. Focusing on the hot code region around this lock acquisition allows us to zoom into the specific goroutine and caller context:
This shows that goroutine 832 trying to acquire the write lock is parked, while goroutine 843 was able to obtain the read lock successfully.
The profile analysis clearly shows that concurrent access to the shared queue is allowing the reader goroutines make progress while writes getting blocked.
This confirms that improperly synchronized concurrent RWMutex usage is the root cause of the deadlock, with the write operations bottlnecked.
Now that we have identified the concurrency bug using pprof execution profiling, we can fix this by serializing all writes to the shared queue through a separate mutex. This will prevent the reads from blocking queued writes.
We also optimized the hot code areas detected in the profile analysis to minimize lock contention. Profiling the improved code shows significantly reduced CPU consumption around synchronization primitives.
Takeaways
With these learnings, we can prevent and troubleshoot similar blocking issues in large-scale golang applications. pprof helps unlock performance bottlenecks that are otherwise very hard to root cause!