in category Golang
This is Part 2 of the pprof series. If you’re new to pprof, start with Part 1: Quick Start.
Last week I had a service that was “slow.” The recommendation endpoint was taking around 28ms per request. Not terrible, but not great either. Management wanted it faster.
I could have guessed. Maybe the database? Maybe JSON serialization? Maybe string operations? All reasonable guesses. All probably wrong.
So I profiled it. The results surprised me.
The Setup
Before we dive into the data, here’s what I was working with: a content recommendation service that analyzes articles and finds similar ones. The /recommend endpoint was the slow one. It needs to:
- Load an article
- Compare it with 200 other articles
- Calculate similarity scores
- Return the top 10 matches
Simple enough. But 28ms seemed high for what should be straightforward operations.
package main
import (
"compress/gzip"
"crypto/sha256"
"encoding/json"
"fmt"
"log"
"math"
"net/http"
_ "net/http/pprof"
"sort"
"strings"
"sync"
"time"
)
type Article struct {
ID string `json:"id"`
Title string `json:"title"`
Content string `json:"content"`
Author string `json:"author"`
Tags []string `json:"tags"`
Views int `json:"views"`
Likes int `json:"likes"`
}
type AnalyticsResult struct {
ArticleID string `json:"article_id"`
ReadingTime int `json:"reading_time_minutes"`
Complexity float64 `json:"complexity_score"`
Keywords []string `json:"keywords"`
RelatedArticles []string `json:"related_articles"`
Compressed string `json:"compressed_preview"`
Signature string `json:"signature"`
ProcessingMs int64 `json:"processing_ms"`
}
var (
articles []Article
cache = make(map[string]AnalyticsResult)
cacheMu sync.RWMutex
)
func init() {
topics := []string{
"Golang has excellent concurrency primitives with goroutines and channels",
"Database indexing strategies can significantly improve query performance",
"Microservices architecture provides flexibility but introduces complexity",
"Machine learning models require careful feature engineering",
"Container orchestration with Kubernetes enables scalable deployment",
}
authors := []string{"Alice", "Bob", "Charlie", "Diana", "Eve"}
for i := 0; i < 200; i++ {
content := ""
for j := 0; j < 5+i%10; j++ {
content += topics[j%len(topics)] + ". "
}
articles = append(articles, Article{
ID: fmt.Sprintf("article-%d", i),
Title: fmt.Sprintf("Article %d: Advanced Topics", i),
Content: content,
Author: authors[i%len(authors)],
Tags: []string{"golang", "programming", "architecture"},
Views: 1000 + i*100,
Likes: 50 + i*5,
})
}
}
func recommendHandler(w http.ResponseWriter, r *http.Request) {
articleID := r.URL.Query().Get("id")
if articleID == "" {
articleID = "article-0"
}
var article *Article
for i := range articles {
if articles[i].ID == articleID {
article = &articles[i]
break
}
}
if article == nil {
http.Error(w, "article not found", http.StatusNotFound)
return
}
start := time.Now()
recommendations := findSimilarArticles(*article, 10)
processingMs := time.Since(start).Milliseconds()
response := map[string]interface{}{
"article_id": articleID,
"recommendations": recommendations,
"processing_ms": processingMs,
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
}
// Problem: N² similarity comparison
func findSimilarArticles(article Article, limit int) []string {
type similarity struct {
id string
score float64
}
var similar []similarity
for _, other := range articles {
if other.ID == article.ID {
continue
}
score := calculateSimilarity(article, other)
similar = append(similar, similarity{other.ID, score})
}
sort.Slice(similar, func(i, j int) bool {
return similar[i].score > similar[j].score
})
result := make([]string, 0, limit)
for i := 0; i < limit && i < len(similar); i++ {
result = append(result, similar[i].id)
}
return result
}
// Problem: Expensive similarity calculation
func calculateSimilarity(a1, a2 Article) float64 {
// Create word sets (inefficient - done on every call)
words1 := strings.Fields(strings.ToLower(a1.Content))
words2 := strings.Fields(strings.ToLower(a2.Content))
// Count common words (nested loops - O(n*m))
common := 0
for _, w1 := range words1 {
for _, w2 := range words2 {
if w1 == w2 {
common++
break
}
}
}
wordScore := float64(common) / math.Max(float64(len(words1)), float64(len(words2)))
return wordScore * 100
}
func main() {
http.HandleFunc("/recommend", recommendHandler)
log.Println("Server starting on :8080")
log.Println("pprof available at http://localhost:8080/debug/pprof/")
log.Fatal(http.ListenAndServe(":8080", nil))
}
Collecting the Profile
While hitting the endpoint with load, I collected a 30-second CPU profile:
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
Then opened it:
go tool pprof -http=:9090 cpu.prof
The web UI gives you multiple views. Let’s walk through each one.
The Flame Graph - Your First Stop
The flame graph is where you start. It shows the call stack horizontally, with width representing time spent. Here’s what mine looked like:

Immediately obvious: there’s a massive block in the middle. The wide sections are where time goes. Everything else is noise.
Looking closer at the flame graph, I can see:
runtime.systemstacktakes up half the width - that’s Go runtime overheadnet/httphandling is there but not dominant- There’s a chunk of my code (
main.*functions) but it’s not the widest part
The flame graph gives you intuition. But you need numbers to know what to fix.
The Top View - Raw Numbers
Switch to the Top view in the web UI or run:
go tool pprof -top cpu.prof
Here’s what I got:
File: pprof2
Type: cpu
Time: 2026-01-30 14:26:44 CET
Duration: 30.13s, Total samples = 1.62s ( 5.38%)
Showing nodes accounting for 1.62s, 100% of 1.62s total
flat flat% sum% cum cum%
0.68s 41.98% 41.98% 0.68s 41.98% syscall.syscall
0.46s 28.40% 70.37% 0.46s 28.40% runtime.kevent
0.10s 6.17% 76.54% 0.10s 6.17% runtime.pthread_cond_wait
0.10s 6.17% 82.72% 0.10s 6.17% runtime.pthread_kill
0.05s 3.09% 85.80% 0.05s 3.09% runtime.madvise
0.05s 3.09% 88.89% 0.05s 3.09% runtime.pthread_cond_signal
0.03s 1.85% 90.74% 0.09s 5.56% main.calculateSimilarity
0.02s 1.23% 91.98% 0.02s 1.23% runtime.stealWork
0.01s 0.62% 92.59% 0.01s 0.62% internal/runtime/atomic.(*Uint32).Load (inline)
0.01s 0.62% 93.21% 0.01s 0.62% runtime.asyncPreempt
0.01s 0.62% 93.83% 0.01s 0.62% runtime.findObject
0.01s 0.62% 94.44% 0.14s 8.64% runtime.findRunnable
0.01s 0.62% 95.06% 0.03s 1.85% runtime.goschedImpl
0.01s 0.62% 95.68% 0.01s 0.62% runtime.memclrNoHeapPointers
0.01s 0.62% 96.30% 0.01s 0.62% runtime.memequal
0.01s 0.62% 96.91% 0.01s 0.62% runtime.pollWork
0.01s 0.62% 97.53% 0.02s 1.23% runtime.scanobject
0.01s 0.62% 98.15% 0.01s 0.62% runtime.typePointers.next
0.01s 0.62% 98.77% 0.01s 0.62% sort.partition_func
0.01s 0.62% 99.38% 0.03s 1.85% strings.Fields
0.01s 0.62% 100% 0.01s 0.62% syscall.syscall6
0 0% 100% 0.64s 39.51% bufio.(*Writer).Flush
Wait. My code (main.calculateSimilarity) shows only 0.03s flat (1.85%) but 0.09s cumulative (5.56%). What does this mean?
Understanding Flat vs Cumulative
This is critical. Let me break it down:
- Flat (self) time: Time spent in the function’s own code
- Cumulative time: Time spent in the function + everything it calls
Look at main.calculateSimilarity:
- Flat: 0.03s (1.85%) - barely any time in the function itself
- Cumulative: 0.09s (5.56%) - significant time including callees
This tells me the function is fast, but something it calls is slow.
Now look at syscall.syscall:
- Flat: 0.68s (41.98%)
- Cumulative: 0.68s (41.98%)
Flat equals cumulative. This function IS the work. It’s not delegating. Most of that is I/O - writing HTTP responses back to clients.
Finding the Real Bottleneck
Let’s look at cumulative time sorted:
go tool pprof -top -cum cpu.prof
File: pprof2
Type: cpu
Time: 2026-01-30 14:26:44 CET
Duration: 30.13s, Total samples = 1.62s ( 5.38%)
Showing nodes accounting for 1.62s, 100% of 1.62s total
flat flat% sum% cum cum%
0 0% 0% 0.77s 47.53% net/http.(*conn).serve
0.68s 41.98% 41.98% 0.68s 41.98% syscall.syscall
0 0% 41.98% 0.64s 39.51% bufio.(*Writer).Flush
0 0% 41.98% 0.64s 39.51% internal/poll.(*FD).Write
0 0% 41.98% 0.64s 39.51% internal/poll.ignoringEINTRIO (inline)
0 0% 41.98% 0.64s 39.51% net.(*conn).Write
0 0% 41.98% 0.64s 39.51% net.(*netFD).Write
0 0% 41.98% 0.64s 39.51% net/http.(*response).finishRequest
0 0% 41.98% 0.64s 39.51% net/http.checkConnErrorWriter.Write
0 0% 41.98% 0.64s 39.51% runtime.systemstack
0 0% 41.98% 0.64s 39.51% syscall.Write (inline)
0 0% 41.98% 0.64s 39.51% syscall.write
0.46s 28.40% 70.37% 0.46s 28.40% runtime.kevent
0 0% 70.37% 0.44s 27.16% runtime.gcStart.func4
0 0% 70.37% 0.44s 27.16% runtime.netpoll
0 0% 70.37% 0.44s 27.16% runtime.startTheWorldWithSema
0 0% 70.37% 0.15s 9.26% runtime.schedule
0.01s 0.62% 70.99% 0.14s 8.64% runtime.findRunnable
0 0% 70.99% 0.13s 8.02% runtime.mcall
0 0% 70.99% 0.12s 7.41% runtime.park_m
0 0% 70.99% 0.11s 6.79% runtime.gcBgMarkWorker.func2
0 0% 70.99% 0.11s 6.79% runtime.gcDrain
0 0% 70.99% 0.10s 6.17% main.findSimilarArticles
0 0% 70.99% 0.10s 6.17% main.recommendHandler
0 0% 70.99% 0.10s 6.17% net/http.(*ServeMux).ServeHTTP
0 0% 70.99% 0.10s 6.17% net/http.HandlerFunc.ServeHTTP
0 0% 70.99% 0.10s 6.17% net/http.serverHandler.ServeHTTP
0 0% 70.99% 0.10s 6.17% runtime.gcDrainMarkWorkerDedicated (inline)
Now we’re getting somewhere. main.findSimilarArticles has 0.10s cumulative (6.17%), and calculateSimilarity has 0.09s cumulative (5.56%). This means findSimilarArticles calls calculateSimilarity, and that’s where most of its time goes.
Let’s look at the call relationship:
go tool pprof cpu.prof
(pprof) peek findSimilarArticles
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
0.10s 100% | main.recommendHandler
0 0% 0% 0.10s 6.17% | main.findSimilarArticles
0.09s 90.00% | main.calculateSimilarity
0.01s 10.00% | sort.Slice
----------------------------------------------------------+-------------
This shows:
findSimilarArticlesis called byrecommendHandler(100% of calls in this profile)- 90% of its time is spent in
calculateSimilarity - 10% is spent sorting the results
Looking at the Source Code
Now let’s see what calculateSimilarity actually does:
go tool pprof -list=main.calculateSimilarity cpu.prof
ROUTINE ======================== main.calculateSimilarity in pprof2.go
30ms 90ms (flat, cum) 5.56% of Total
. 10ms 389:func calculateSimilarity(a1, a2 Article) float64 {
. . 390: // Create word sets (inefficient)
. 10ms 391: words1 := strings.Fields(strings.ToLower(a1.Content))
. 20ms 392: words2 := strings.Fields(strings.ToLower(a2.Content))
. . 393:
. . 394: // Count common words (nested loops)
. . 395: common := 0
. . 396: for _, w1 := range words1 {
30ms 40ms 397: for _, w2 := range words2 {
. 10ms 398: if w1 == w2 {
. . 399: common++
. . 400: break
. . 401: }
. . 402: }
. . 403: }
Line 391: 10ms in strings.Fields and strings.ToLower
Line 392: 20ms doing the same thing again
Line 397-398: 50ms in nested loops comparing strings
Total: 90ms cumulative. That’s 5.56% of the profile’s CPU time - but remember, the profile only captured 1.62s of actual CPU work during the 30-second window.
The Graph View - Seeing Relationships
The graph view shows the call relationships visually:

You can see:
main.calculateSimilarityis called fromfindSimilarArticles- It calls
strings.ToLowerandstrings.Fieldsrepeatedly - The edges show how much time flows through each call path
The thickness of edges matters. Thick edge = lots of time.
What About strings.Fields?
Let’s look deeper at what strings.Fields is doing:
go tool pprof -list=strings.Fields cpu.prof
ROUTINE ======================== strings.Fields in strings/strings.go
10ms 30ms (flat, cum) 1.85% of Total
. . 384:func Fields(s string) []string {
. . 385: // First count the fields.
. . 386: // This is an exact count if s is ASCII, otherwise it is an approximation.
. . 387: n := 0
. . 388: wasSpace := 1
. . 389: // setBits is used to track which bits are set in the bytes of s.
. . 390: setBits := uint8(0)
. . 391: for i := 0; i < len(s); i++ {
. . 392: r := s[i]
. . 393: setBits |= r
10ms 10ms 394: isSpace := int(asciiSpace[r])
. . 395: n += wasSpace & ^isSpace
. . 396: wasSpace = isSpace
. . 397: }
. . ...
. 10ms 404: a := make([]string, n)
. . ...
. 10ms 418: a[na] = s[fieldStart:i]
strings.Fields is scanning the entire string character by character. It’s doing work. And we’re calling it twice per similarity comparison - once for each article.
How many times is calculateSimilarity called? Let’s check findSimilarArticles:
go tool pprof -list=main.findSimilarArticles cpu.prof
ROUTINE ======================== main.findSimilarArticles in pprof2.go
0 100ms (flat, cum) 6.17% of Total
. . 356:func findSimilarArticles(article Article, limit int) []string {
. . ...
. . 364: // Compare with every other article (N²)
. . 365: for _, other := range articles {
. . 366: if other.ID == article.ID {
. . 367: continue
. . 368: }
. . 369:
. 90ms 370: score := calculateSimilarity(article, other)
. . 371: similar = append(similar, similarity{other.ID, score})
. . 372: }
. . 373:
. . 374: // Sort by similarity
. 10ms 375: sort.Slice(similar, func(i, j int) bool {
We have 200 articles. For each recommendation request, we compare the target article against all 200. That’s 200 calls to calculateSimilarity. Each call does strings.Fields twice.
200 articles × 2 calls to strings.Fields = 400 calls to strings.Fields per request.
No wonder it’s slow.
Filtering the Noise
The top view shows a lot of runtime functions. What if we only want to see our code?
go tool pprof cpu.prof
(pprof) focus=main
(pprof) top20
Active filters:
focus=main
Showing nodes accounting for 0.14s, 8.64% of 1.62s total
flat flat% sum% cum cum%
0.03s 1.85% 1.85% 0.09s 5.56% main.calculateSimilarity
0.01s 0.62% 7.41% 0.03s 1.85% strings.Fields
0 0% 8.64% 0.10s 6.17% main.findSimilarArticles
0 0% 8.64% 0.04s 2.47% main.main
0 0% 8.64% 0.10s 6.17% main.recommendHandler
Now we only see functions related to our code. The percentages changed because we’re filtering, but the relationships remain the same.
The Complete Picture
Let’s trace the entire call path:
main.recommendHandler (6.17% of profile)
→ main.findSimilarArticles (6.17% of profile)
→ main.calculateSimilarity (5.56% of profile)
→ strings.ToLower + strings.Fields - called 400x per request
→ nested loop comparisons (rest of time)
The problem is algorithmic. We’re doing O(N²) work:
- N articles to compare
- For each comparison, we process the text of both articles
- Text processing involves multiple passes (ToLower, Fields)
- Then we do nested loops to find common words (another O(M×N) where M, N are word counts)
Other Endpoints to Profile
The /recommend endpoint only exercises findSimilarArticles and calculateSimilarity. If we profiled the /analyze or /author-stats endpoints, we’d see calculateComplexity appear in the profile too.
Looking at the source code, calculateComplexity has the same pattern of inefficiency:
func calculateComplexity(content string) float64 {
words := strings.Fields(content) // First pass
// ...
sentences := strings.Split(content, ".") // Second pass
for i, sentence := range sentences {
wordCount := float64(len(strings.Fields(sentence))) // Third pass, in a loop!
// ...
}
}
Three passes over the same data. If this function shows up in your profile, you’d want to refactor it to do a single pass.
The Performance Impact
The profile shows 1.62s of CPU time over 30.13 seconds. That’s only 5.38% of wall-clock time captured. Why so low?
- Most time was idle (waiting for requests between load generator calls)
- Profile samples at 100Hz (every 10ms)
- Our requests are short, so samples are sparse
- A lot of time is spent in I/O (writing responses), which shows as syscall time
The captured 1.62s represents the actual CPU work being done during those 30 seconds of profiling. Of that:
- ~42% was syscall overhead (writing HTTP responses)
- ~28% was runtime.kevent (network polling)
- ~6% was our recommendation logic
What to Optimize?
Based on this profile, here’s what I’d fix, in order:
1. Fix calculateSimilarity (5.56% of CPU)
- Pre-process article text once, not on every comparison
- Store lowercased words, not raw content
- Use a map for lookups instead of nested loops
2. Fix calculateComplexity (when it appears in profiles)
- Single pass through the text
- Don’t call
strings.Fieldsin a loop
3. Consider caching
- Similarity scores don’t change if articles don’t change
- Could cache the results
The first fix could significantly reduce the time spent in our code. The syscall and runtime overhead is harder to eliminate - that’s just the cost of serving HTTP requests.
How CPU Profiling Works
Now that we’ve seen real data, let’s talk about how pprof actually works.
Sampling, Not Tracing
CPU profiling uses statistical sampling. Every 10ms (100Hz by default), the Go runtime:
- Interrupts the program
- Captures the current program counter (PC)
- Walks the stack to get the full call chain
- Records this in a hash table
- Resumes execution
If the same stack appears multiple times, its counter increments. After profiling ends, you have a statistical distribution of where time was spent.
This approach has minimal overhead (~1-5%) because it’s not instrumenting every function call. It’s sampling.
What Gets Counted
Only on-CPU time is counted. If your goroutine is:
- Blocked on a channel: not counted
- Waiting for I/O: not counted
- Sleeping: not counted
- Waiting on a mutex: not counted
CPU profiling shows where your program burns CPU cycles, not where it waits.
Sample Accuracy
With 100Hz sampling (10ms intervals), you get one sample per 10ms of CPU time. Over 30 seconds, that’s up to 3,000 samples. My profile had 1.62s of samples, so about 162 samples captured.
For a function to show up reliably, it needs to be on the stack when samples are taken. Functions that run for less than 10ms might be underrepresented or missing entirely.
This is why you profile for 30-60 seconds, not 5 seconds. More samples = more accurate picture. If your profile shows very few samples (like mine with only 5.38% captured), you might need to generate more load during profiling.
Inlining Complications
Go’s compiler aggressively inlines functions. This affects profiles:
func outer() {
inner() // This might be inlined
}
func inner() {
// Do work
}
If inner() is inlined, the profile shows the time in outer(), not inner(). The profiler tries to reconstruct the logical call stack using inline metadata, but it’s not perfect.
Small functions that get inlined might not appear in profiles at all.
Common Patterns I’ve Seen
After profiling dozens of Go services, here are the common CPU bottlenecks:
Pattern 1: String Operations in Loops
// Shows up as time in strings.* functions
for _, item := range items {
result += item // string concatenation
}
Fix: Use strings.Builder or pre-allocate.
Pattern 2: JSON Marshaling in Hot Paths
// Shows up as time in encoding/json
for _, item := range items {
json.Marshal(item) // expensive
}
Fix: Marshal once, or use a more efficient encoding.
Pattern 3: Repeated Work
// Calling the same function with same input repeatedly
for _, request := range requests {
processData(getData()) // getData() called every time
}
Fix: Hoist invariants out of loops.
Pattern 4: Reflection-Heavy Code
// Shows up as time in reflect.* functions
for _, item := range items {
structToMap(item) // uses reflection
}
Fix: Generate code, use concrete types, or cache reflection results.
What the Profile Doesn’t Tell You
CPU profiles don’t show:
Waiting time: If your goroutine is blocked, CPU profiling won’t help. Use block profiling or tracing.
Memory allocations: CPU profile might show runtime.mallocgc, but heap profiling is better for this.
Lock contention: Use mutex profiling for this.
I/O bottlenecks: CPU shows syscalls, but not what happens during the syscall.
GC pauses: CPU shows GC work, but not stop-the-world pauses. Use tracing for this.
Next Steps
We’ve found the bottleneck in our application code: calculateSimilarity is doing too much work, too often. The N² algorithm with repeated string processing is the main culprit. In Part 3, we’ll look at memory profiling to see if we’re also wasting memory. In Part 4, we’ll fix these issues and compare profiles to measure improvement.
But the process is always the same:
- Collect profile under realistic load
- Look at flame graph for intuition
- Check top view for numbers
- Use
listto see source code - Understand flat vs cumulative time
- Fix the biggest issue first
- Profile again
Don’t guess. Measure. The profile will surprise you every time.
Exercises for You
If you want to practice, try this with the sample application from this article:
- Run
go run pprof2.go - Generate load:
for i in (seq 1 100); curl -s http://localhost:8080/recommend > /dev/null; end - Collect a CPU profile while generating load
- Find the three most expensive functions (by cumulative time)
- Use
listto see where time is spent in each - Make a hypothesis about what’s slow
- Fix one thing
- Profile again and compare with
-base
Do this a few times, and reading profiles becomes second nature.
In the next part, we’ll dive into memory profiling. Because CPU time is one thing, but memory is another beast entirely.
← Previous: Part 1: Quick Start
Tags: #golang #pprof