When the code feels slow but the benchmark lies
You've optimized your hot loop. You replaced a Vec with a pre-allocated buffer. You added #[inline] to a critical function. The benchmark still shows the same wall-clock time. You stare at the code, convinced the bottleneck is in the sorting algorithm, but the profiler points somewhere else entirely. This happens to everyone. Rust gives you control, but control means you have to look under the hood. perf and flamegraph are the tools that turn "I think this is slow" into "This function eats 40% of the CPU."
What perf and flamegraph actually do
perf is a Linux kernel tool. It samples the CPU. Every few microseconds, it interrupts your program and asks, "What instruction are you executing right now?" It records the answer and the call stack. You get thousands of these samples. A flamegraph takes those samples and draws a bar chart. Wide bars mean the CPU spent a lot of time there. Tall stacks mean deep recursion or nested calls. The bottom of the graph is the entry point. The top is the leaf function doing the actual work. If a bar is wide, that function is a hotspot.
Think of perf like a high-speed camera taking snapshots of a busy factory. You don't record a continuous video. You take a photo every millisecond. If you see the same worker in 80% of the photos, that worker is the bottleneck. The flamegraph is just a way to count those photos and display the results so you can spot the pattern instantly.
Setting up the build
Rust strips debug symbols in release mode by default to save space. perf needs those symbols to map memory addresses back to function names. Without debug symbols, perf records the data, but the output is a wall of hex addresses like 0x7f3a2b1c. You can't read that. You need the symbol table.
Add debug = true to your [profile.release] in Cargo.toml. This tells rustc to keep the debug info while still optimizing the code. This is the standard convention for profiling builds. It increases the binary size, but it makes the data readable.
[profile.release]
# Keep debug info for perf. This makes the binary larger but enables readable stack traces.
debug = true
Build the application normally. The binary will contain both the optimized machine code and the symbol table perf needs.
# Build with optimizations and debug symbols.
cargo build --release
Recording the data
Run perf record to capture samples. The -g flag captures the call graph. Without -g, you get a list of functions, but you lose the context of who called whom. The -e instructions flag tells perf to sample based on instruction count. This correlates well with CPU time for compute-bound code. You can also use -e cycles or drop the flag to use the kernel default.
# Record samples with call stacks.
# sudo is often required for kernel-level sampling.
# -g captures the stack trace. -e instructions samples by instruction count.
sudo perf record -g -e instructions -- ./target/release/your_binary
perf creates a file named perf.data in the current directory. This file contains the raw samples. It can be large. Do not commit perf.data to your repository. Add it to .gitignore.
Generating the graph
The perf.data file is binary. You need to convert it to a format the flamegraph script understands. The flamegraph repository provides Perl scripts for this. perf script reads perf.data and outputs a text stream. stackcollapse-perf.pl aggregates that stream.
perf script outputs one line per sample. If the stack main->foo->bar appears 1000 times, the script outputs 1000 lines. stackcollapse-perf.pl merges identical stacks and counts them. It turns those 1000 lines into a single line: main;foo;bar 1000. This aggregation is what makes the graph readable. Without it, the SVG would contain millions of tiny bars.
# Convert raw perf data to collapsed stack format.
# stackcollapse-perf.pl is part of the flamegraph repository.
perf script | stackcollapse-perf.pl > out.stack
# Generate the SVG flamegraph.
flamegraph.pl out.stack > flamegraph.svg
Open flamegraph.svg in your browser. You'll see a colorful bar chart. The width of the bars represents time. The height represents the call stack depth.
Reading the flamegraph
A flamegraph bar has two widths. The total width of the bar represents the time spent in that function plus all functions it calls. The colored portion represents the time spent in the function itself, excluding its children. This is called "self time."
If a bar is wide but the colored part is thin, the function is just a wrapper. The bottleneck is deeper in the stack. If the colored part is wide, that function is doing the work. Look for wide colored bars near the top of the stack. Those are the leaf functions burning your CPU.
Follow the width. The widest bar is your enemy. If HashMap::insert has a wide colored bar, the hash map is the bottleneck. If main is wide but main has no color, look at the children. The work is happening in the functions main calls.
Realistic example: The hidden allocation
Consider a function that processes data. It looks simple. It iterates over a slice and inserts values into a hash map.
use std::collections::HashMap;
/// Process data with a hidden allocation cost.
fn process_data(data: &[u32]) -> HashMap<u32, u32> {
let mut map = HashMap::new();
for &val in data {
// Allocation happens here on every insert if capacity is exceeded.
map.insert(val, val * 2);
}
map
}
fn main() {
let data = vec![42; 100_000];
let _ = process_data(&data);
}
You might assume the loop is the bottleneck. The flamegraph tells a different story. Run the profiling pipeline. Open the SVG. You'll see a wide bar for HashMap::insert. Digging deeper, you'll see alloc::raw_vec::RawVec::grow taking a significant portion of the time.
The loop itself is cheap. The memory allocator is doing the heavy lifting. Every time the hash map grows, it allocates a new buffer, copies the data, and frees the old buffer. The graph shows the allocator is the bottleneck. The fix is to pre-allocate with HashMap::with_capacity or use a different structure. The graph showed the allocator. Fix the capacity, not the loop.
Pitfalls and gotchas
LTO (Link Time Optimization) can break stack traces. LTO optimizes across crates and can inline or discard symbols in ways that confuse perf. If your flamegraph shows missing functions or broken stacks, LTO might be the culprit. Disable LTO for profiling builds. Add lto = false to your profile.
[profile.release]
debug = true
# Disable LTO to preserve symbol names for perf.
lto = false
perf access is restricted on some systems. The kernel parameter kernel.perf_event_paranoid controls who can use perf. If you get "Permission denied" errors even with sudo, check this setting. Running sudo sysctl -w kernel.perf_event_paranoid=1 usually fixes it. This is a system configuration issue, not a Rust issue.
Sampling rate matters. The default frequency is often 4000 Hz. This means 4000 samples per second. For very fast functions, this might be too low. You might miss short-lived hotspots. Increase the frequency with --freq 10000 or higher. Higher frequency gives more detail but increases overhead. For most applications, the default is sufficient.
No symbols means no names. If you see a graph full of ?? or hex addresses, you forgot debug symbols. Add debug = true before you run. LTO optimizes symbols away. Turn it off for profiling.
Decision matrix
Use perf when you need low-overhead sampling on Linux and want the raw data for custom analysis. Use cargo flamegraph when you want a one-command setup that handles the perf to SVG pipeline automatically. Use tracy or puffin when you are building a game or real-time application and need frame-by-frame visualization with custom markers. Use valgrind callgrind when you need exact instruction counts and cache simulation, and can tolerate a massive slowdown.
The community convention is to use the cargo-flamegraph crate. It wraps the perf commands and handles the stackcollapse scripts for you. Running cargo flamegraph is the standard way to get a graph in seconds. It also handles the sudo prompt gracefully. Install it with cargo install flamegraph. Then run cargo flamegraph in your project. It builds the binary with debug symbols, runs perf, generates the graph, and opens the SVG. It's the fastest path from code to insight.
Match the tool to the timeline. Sampling for speed, tracing for detail.