How to Profile Rust Code with perf

The CPU is busy, but you don't know where

Your Rust program feels sluggish. You added a loop to process a million items, and now the UI freezes for a second. You suspect the loop, but you're not sure which line inside it is the culprit. You tried adding println! debug statements, but that changes the timing and clutters the output. You guessed at a bottleneck, optimized it, and the program is still slow.

Guessing is slow. You need data. You need to see where the CPU is actually spending its time.

Profiling gives you a map of execution. It tells you exactly which functions consume cycles, which allocations trigger cache misses, and where the call graph branches into dead ends. perf is the standard tool for this on Linux. It hooks into the kernel's hardware performance counters and samples your program with minimal overhead.

Sampling vs tracing: why `perf` wins for CPU

Profilers fall into two categories. Tracing profilers record every function entry and exit. They give you perfect precision but slow your program down by 10x or 100x. The slowdown changes the behavior of your code. Timings shift. Race conditions disappear. The profile you get is a lie.

Sampling profilers take a different approach. They interrupt your program periodically and ask, "What are you doing right now?" If you interrupt a program every millisecond, you get a snapshot of the call stack. Over time, these snapshots build a histogram. Functions that appear in 90% of the snapshots are hot. Functions that appear in 1% are idle.

perf is a sampling profiler. It uses hardware timers to trigger interrupts. The overhead is tiny, usually under 1%. Your program runs at near-native speed. The results reflect real-world behavior.

Think of it like a speed camera on a highway. You don't track every car's every move. You snap a photo every mile. If you see the same truck in 90% of the photos, that truck is driving slowly. If you see a sports car in 1% of the photos, it's zooming past. perf is the speed camera for your CPU.

Minimal setup: symbols and samples

perf needs debug symbols to map machine instructions back to your source code. A standard cargo build --release strips debug info to reduce binary size. You need to tell Cargo to keep the symbols while still optimizing the code.

Add this to your Cargo.toml:

[profile.release]
# Keep debug symbols so perf can map samples to source lines.
# This increases binary size but is essential for profiling.
debug = true

Convention aside: The community calls this the "release with debug" profile. You always profile release builds. Debug builds are too slow and have different hotspots. The optimizer changes the code structure. Profiling a debug build gives you data that doesn't apply to your shipped product.

Create a simple workload to test the workflow:

// src/main.rs

/// Simulate a hot loop with a hidden bottleneck.
fn main() {
    let data = (0..100_000_000).collect::<Vec<_>>();
    
    // This loop looks simple, but the allocation inside is the killer.
    let mut results = Vec::new();
    for &item in &data {
        // Simulate work.
        let computed = item * 2 + 1;
        // Allocation happens here on every iteration.
        results.push(format!("Result: {}", computed));
    }
    
    println!("Processed {} items", results.len());
}

Build and record:

# Build the optimized binary with symbols.
cargo build --release

# Record samples. -g captures the call graph so you see who calls what.
perf record -g ./target/release/my_app

# View the report in a terminal UI.
perf report

The perf record command runs your binary. The kernel sets a hardware timer. Every few thousand cycles, the CPU pauses your app, saves the stack trace, and resumes. This happens thousands of times per second. The result is a perf.data file in your current directory.

perf report reads that file and launches a TUI. You see a list of functions sorted by percentage. The top line is your bottleneck.

Trust the percentages. Optimize the top line, not the code you feel bad about.

Walking through the report

The perf report TUI shows columns for Overhead, Samples, and Symbol. Overhead is the percentage of total samples attributed to that function. Samples is the raw count. Symbol is the function name.

Navigate with arrow keys. Press Enter to drill down into a function. Press Space to toggle sorting. Press q to quit.

Convention aside: perf output includes standard library functions. You will see core::iter::..., alloc::..., and std::.... Don't ignore them. If alloc::raw_vec::RawVec::grow shows up high, your code is allocating memory excessively. The standard library is just code. If it's hot, your usage pattern is the problem.

Look at the call graph. The -g flag records the stack. perf report shows callers and callees. If process_batch calls compute_heavy, and compute_heavy is hot, the graph shows that relationship. You can see the path from main to the bottleneck.

If you see a function taking 0% time but you know it's slow, check the call graph. The function might be called rarely but take a long time each call. Sampling might miss it. In that case, you need a different tool. For CPU hotspots, sampling is usually enough.

Realistic scenario: inlining and `perf annotate`

Compilers optimize aggressively. Rust's compiler inlines functions to eliminate call overhead. Inlining moves the function body into the caller. The function name disappears from the call stack. perf samples the instruction pointer. If the IP is inside main, but the code is actually compute_heavy, perf reports main.

This breaks the flat profile. You see main is hot, but main is just a few lines. The truth is hidden.

Use perf annotate to see the source and assembly. Run perf annotate after recording. Select the hot symbol. You see a mix of source lines and assembly instructions. The source lines show where the samples landed. If main is hot, annotate shows the inlined code inside main. You can see exactly which line is burning cycles.

// src/parser.rs

/// Parse a line of input.
/// This function gets inlined into the caller.
fn parse_line(input: &str) -> Option<Token> {
    let trimmed = input.trim();
    if trimmed.is_empty() {
        return None;
    }
    
    // This branch is the hotspot.
    if trimmed.starts_with("#") {
        return Some(Token::Comment);
    }
    
    Some(Token::Word(trimmed.to_string()))
}

/// Process a file line by line.
fn process_file(path: &str) -> Vec<Token> {
    let content = std::fs::read_to_string(path).unwrap();
    content.lines().filter_map(parse_line).collect()
}

When you profile this, perf report might show process_file as hot. parse_line is inlined. perf annotate on process_file reveals the inlined parse_line code. You see the starts_with check is the bottleneck. You can optimize that specific branch.

If perf points at main, look closer. Inlining is hiding the truth.

Pitfalls and gotchas

Profiling introduces its own quirks. Watch for these common traps.

Debug builds lie. Profiling cargo run gives misleading results. The optimizer hasn't run. The hotspots are different. The code structure is different. Always profile release builds with debug = true.

Permissions. perf needs access to kernel performance counters. On some systems, you get perf: failed to open /proc/kcore: Permission denied. You need sudo perf record or adjust sysctl kernel.perf_event_paranoid=1. The paranoid setting controls what unprivileged users can see. Set it to 1 or 2 to allow user-space profiling without root.

Sampling misses rare bugs. If a function runs once per hour and takes 10 seconds, sampling might miss it entirely. The probability of hitting that window is low. Use tracing or manual timing for rare events. perf is for frequent hotspots.

Cache misses. CPU time isn't just instructions. The CPU waits for memory. perf can count cache misses. Run perf record -e cache-misses -g ./target/release/my_app. If cache misses are high, your data structure has poor locality. You're jumping around memory. Restructure your data to be cache-friendly. Use arrays of structs instead of structs of arrays. Use contiguous buffers.

Convention aside: perf output can be noisy. The kernel and libc show up. Filter them out if they distract you. perf report has a filter option. Or use perf script to pipe output to other tools. The community often uses perf script | stackcollapse-perf.pl | flamegraph.pl to generate flame graphs. This gives a visual stack chart. It's a popular convention for sharing profiles.

Decision: when to use `perf` vs alternatives

Profiling tools solve different problems. Pick the right one for your bottleneck.

Use perf when you need to find CPU bottlenecks in a release build. Use perf when you suspect inlining is hiding the true hotspot and need annotate to reveal source lines. Use perf when you want to measure cache misses or branch mispredictions with hardware counters.

Use cargo flamegraph when you want a visual stack flame chart instead of a terminal list. Flame graphs make it easy to see the proportion of time spent in different call paths. They are great for sharing with teammates who aren't comfortable with TUIs.

Use tracy or puffin when you need frame-by-frame timing in a game or UI loop. These tools integrate with your code. You mark regions manually. They show you exactly how long each frame took and which region caused a hitch. They are essential for real-time applications.

Use valgrind --tool=callgrind when you need precise instruction counts and don't mind the massive slowdown. Callgrind traces every instruction. It gives you exact counts. It's slow, but it catches rare hotspots that sampling misses. Use it when perf gives you incomplete data.

Use heaptrack or dhat when you need to profile memory allocations. perf shows CPU time. It doesn't show allocation sizes or lifetimes. If your bottleneck is memory pressure, use a heap profiler.

Pick the tool that matches the bottleneck. perf for CPU, valgrind for precision, tracy for frames, dhat for memory.

Where to go next

Profiling is like using a speed camera to find which parts of your code are running too slowly. You build your program normally, run it while a tool records what the computer is doing, and then look at a report to see where the time is being spent. This helps you fix performance bottlenecks without guessing.