How to profile Rust code

Profiling tells you where time actually goes, not where you think it goes. Learn the difference between sampling and instrumentation, how to use cargo flamegraph and samply, and how to read the picture you get back.

Where the time actually goes

You compile a Rust program with --release. You expect it to run at the speed of light. Instead, it crawls. You blame the standard library. You blame the allocator. The truth is usually simpler. A tight loop is allocating memory on every iteration. A recursive function is recalculating the same value. A network handler is blocking on a synchronous read. Rust gives you zero-cost abstractions, but it does not magically rewrite bad algorithms. You have to find the bottleneck first.

Profiling is how you find it. The most common approach is sampling. A sampling profiler runs your program and interrupts it thousands of times per second. Each time it interrupts, it records the current call stack. After the program finishes, it counts how many times each function appeared. The function that shows up most often is consuming the most CPU cycles. Think of it like taking a photograph of a busy factory floor every second. If the assembly line appears in ninety percent of the photos, that is where the work is happening. Instrumentation profilers exist too. They wrap every function call to record exact entry and exit times. They are more precise but slow down your program enough to distort the results. For finding performance bottlenecks, sampling is the right tool.

Trust the sample count. The widest bar on the graph is your problem.

Build for profiling, not just release

Before you run any profiler, your binary needs the right shape. Profiling a debug build is useless. Debug builds skip optimizations, leave bounds checks in place, and generate completely different assembly. You will measure overhead that vanishes in production. You need release optimizations combined with debug symbols. Debug symbols map raw memory addresses back to function names. Without them, your profiler shows you hexadecimal gibberish instead of calculate_trajectory.

Add a custom profile to your Cargo.toml. This keeps your default release builds lean while giving you a dedicated profiling target.

# Cargo.toml
[profile.profiling]
# Inherits all release optimizations like LTO and link-time optimization.
inherits = "release"
# Keeps DWARF debug info so profilers can resolve function names.
debug = true
# Disables aggressive inlining to keep call stacks readable.
lto = false

Build with cargo build --profile profiling. The lto = false setting is a convention among Rust performance engineers. Link-time optimization and full inlining collapse small helper functions into their callers. The resulting flamegraph shows one massive function instead of a useful call tree. You can re-enable it later once you know where the hot spot lives. The community also prefers debug = true in a custom profile rather than patching the default release profile. It keeps production binaries small while giving you full visibility during development.

Never profile a debug build. The numbers lie.

cargo flamegraph: the classic

cargo flamegraph is the standard tool on Linux and macOS. It wraps the operating system's native performance counters and outputs an interactive SVG. Install it globally with cargo install flamegraph. On Linux, you may need to relax kernel restrictions so your user account can read CPU counters.

# Allows non-root users to access performance counters for profiling.
sudo sysctl -w kernel.perf_event_paranoid=-1

Run it against your profiling build. Pass command-line arguments after a double dash.

# Profiles the release binary with debug symbols.
cargo flamegraph --bin myapp --profile profiling
# Passes runtime arguments to the application after the separator.
cargo flamegraph --bin myapp --profile profiling -- --threads 4 --load heavy

When the program exits, you get a flamegraph.svg file. Open it in a browser. The graph looks like a stack of horizontal blocks. Each block represents a function. The width of the block shows how much CPU time that function consumed. Blocks stacked vertically show the call hierarchy. The top block is the function doing the actual work. The blocks below it are the callers.

Read the graph from the top down. Look for wide blocks near the top that have nothing stacked above them. That function is doing heavy lifting. Look for wide blocks in places you did not expect. If a serialization library takes forty percent of your runtime, your algorithm is probably doing too much work before it even reaches the math. Ignore tall, narrow stacks. They represent deep call chains that finish quickly. Ignore the very bottom layer. Runtime startup and main are always wide but never the bottleneck. Hover over any block to see the exact percentage. Click a block to zoom into its subtree. Use the browser search to jump to a specific function name.

Click the wide blocks. Zoom until you find the function you can actually change.

samply: cross-platform, easier on macOS and Windows

cargo flamegraph relies on system tools that behave differently across operating systems. macOS DTrace requires code signing on newer versions. Windows lacks a native equivalent. samply solves this by using a cross-platform sampling backend and exporting data to the Firefox Profiler UI. Install it with cargo install samply.

Build your profiling target first, then point samply at the binary.

# Compiles the binary with debug symbols and release optimizations.
cargo build --profile profiling
# Records samples and opens the Firefox Profiler interface automatically.
samply record ./target/profiling/myapp

The Firefox Profiler interface is more interactive than a static SVG. You can filter by thread, toggle between aggregated flamegraphs and time-ordered flamecharts, and inspect individual samples. samply also supports attaching to a running process. This is useful for debugging a long-running server that slows down under load.

# Attaches to a running process by PID and starts sampling immediately.
samply record -p 12345

Use samply when you need a consistent experience across Linux, macOS, and Windows.

A worked example: finding a clone in a hot loop

You suspect a data processing function is slower than it should be. The code converts user IDs to strings inside a tight loop.

fn process(records: &[Record]) -> Vec<String> {
    let mut out = Vec::new();
    for r in records {
        // Allocates a new String on the heap for every single iteration.
        let key = r.user_id.to_string();
        if is_relevant(&key) {
            out.push(format!("{}: {}", key, r.payload));
        }
    }
    out
}

Run the profiler against a workload that exercises this function. The flamegraph shows process as a wide base. Stacked on top, <u64 as ToString>::to_string consumes thirty percent of the total time. Above that, you see alloc::fmt::format and __libc_malloc. The profiler is telling you exactly what is happening. The loop is requesting fresh heap memory on every iteration. The allocator is working hard to satisfy those requests.

The fix reuses a single buffer instead of allocating repeatedly.

use std::fmt::Write;

fn process(records: &[Record]) -> Vec<String> {
    // Pre-allocates the output vector to avoid reallocations during push.
    let mut out = Vec::with_capacity(records.len());
    // Reuses this buffer across all loop iterations.
    let mut key_buf = String::new();
    for r in records {
        key_buf.clear();
        // Formats directly into the existing heap allocation.
        let _ = write!(&mut key_buf, "{}", r.user_id);
        if is_relevant(&key_buf) {
            out.push(format!("{}: {}", key_buf, r.payload));
        }
    }
    out
}

Run the profiler again. The to_string and malloc blocks disappear from the hot path. The CPU time shifts to the actual filtering logic. If the flamegraph looks identical, your hypothesis was wrong. Change one thing, measure again, and repeat.

Profile, fix, re-profile. The graph does not lie.

Common pitfalls

Profiling reveals truth, but only if you set it up correctly. Running a debug build gives you completely different assembly. The optimizer changes loop unrolling, inlining decisions, and register allocation. Always profile with release optimizations enabled. Missing function names in your output means debug symbols were stripped. Add debug = true to your profile and rebuild. A single massive block with no children means inlining collapsed your call tree. Lower the optimization level or disable LTO temporarily.

Short workloads produce noisy data. Sampling profilers need thousands of interrupts to build a statistical picture. If your program finishes in fifty milliseconds, you will get a handful of samples that mean nothing. Wrap your hot function in a loop or generate a larger dataset until the run lasts at least a few seconds. CPU profilers only track time spent executing instructions. They do not measure time spent waiting for disk I/O or network responses. If your program is IO-bound, a CPU profiler will show almost nothing. Use a tracing library or wall-clock measurements for IO bottlenecks. Background processes on a busy machine will appear in your samples. Run on an idle system or average multiple runs to smooth out the noise. Modern CPUs also scale their frequency dynamically. Set your CPU governor to performance mode before profiling to prevent frequency drops from skewing the sample rate.

If the graph looks wrong, check your build flags before blaming the profiler.

When to reach for profiling vs benchmarking

Use criterion when you need to measure whether a specific code change improved performance. It runs your function thousands of times under controlled conditions and reports statistical confidence intervals. Use cargo flamegraph when you need to understand where CPU time is going inside a slow function. The aggregated stack view highlights hot paths and unexpected library calls. Use samply when you work across multiple operating systems or prefer an interactive browser interface over static SVGs. Use tokio-console when debugging async applications that stall on channel backpressure or lock contention. Use valgrind --tool=cachegrind when you suspect memory access patterns are causing cache thrashing. It simulates the CPU cache hierarchy and reports miss rates with high precision.

Pick the tool that matches the question. Measuring speed needs a benchmark. Finding bottlenecks needs a profiler.

Where to go next