How to use profile guided optimization PGO

Enable Profile Guided Optimization in Rust by building with instrumentation, running your app to collect data, and rebuilding with the profile for faster execution.

When release mode isn't enough

You have optimized your Rust code. You switched to --release. You inlined the hot functions. You unrolled the loops. The profiler still points to the same function taking 40% of the time. You feel stuck. The compiler is doing its best, but it is guessing. It does not know which branch your users actually take. It does not know that 99% of the time, the input is valid. It does not know that the error path is cold dust.

The compiler lays out your code based on static analysis. It assumes branches are 50/50 unless you tell it otherwise. It assumes every function might be called. It plays it safe. That safety costs performance. Profile Guided Optimization (PGO) replaces those guesses with hard data from your actual workload.

The compiler is smart, but it is blind to your users. PGO gives it eyes.

PGO in plain words

PGO stands for Profile Guided Optimization. It is a feedback loop between the compiler and your running program. The process has three phases. First, you build your binary with instrumentation. The compiler injects tiny counters into your code to track how often branches are taken and how often functions are called. Second, you run the instrumented binary with representative data. The binary generates a profile file containing the counts. Third, you rebuild the binary using that profile. The compiler reads the counts and rearranges your code to match reality.

Think of a chef preparing a kitchen. Without PGO, the chef sets up the station for every possible dish because they do not know what customers will order. The knife block is in the middle. The spices are on a high shelf. The prep bowls are scattered. The chef has to walk around to handle any order.

With PGO, the chef looks at the order tickets from last night. The data shows 80% of orders are steaks, 15% are salads, and 5% are desserts. The chef rearranges the kitchen. The steak knife and searing pan move to the primary station. The salad ingredients move to a secondary zone. The dessert setup goes to the back. The chef now moves efficiently because the layout matches the actual traffic pattern.

PGO does the same for your binary. It moves frequently executed code close together so the CPU cache stays happy. It reorders branches so the most likely path falls through without a jump. It separates rarely executed code, like error handling, into a different section so it does not pollute the cache.

The compiler stops guessing. It starts knowing.

What changes under the hood

PGO affects three main areas of code generation. Understanding these helps you see why PGO matters.

Branch reordering is the most common win. CPUs have branch predictors that guess which way an if statement goes. If the predictor guesses wrong, the CPU flushes its pipeline and loses cycles. PGO tells the compiler which branch is likely. The compiler rearranges the code so the likely branch is the fall-through path. The CPU predictor sees a consistent pattern and guesses right almost every time.

Hot and cold splitting separates code by frequency. The compiler identifies "hot" code that runs often and "cold" code that runs rarely. It places hot code in a contiguous block. It moves cold code to a separate section. When the cold code is called, the CPU jumps to that section. This keeps the hot path compact. The CPU instruction cache holds more of the hot code, reducing cache misses. Error handling paths are classic cold code. PGO moves them out of the way.

Inlining decisions change based on call counts. The compiler might decide not to inline a function if the profile shows it is rarely called, saving code size. Conversely, it might inline a function that looks expensive but the profile shows it is always called with a constant argument, enabling further optimizations.

PGO rearranges the binary to match reality, not theory.

Minimal example

Here is a minimal setup to enable PGO. You need a custom profile in Cargo.toml and a build workflow that captures and uses the data.

# Cargo.toml

[package]
name = "pgo-demo"
version = "0.1.0"
edition = "2021"

# Define a profile for PGO.
# Inherits release settings but adds PGO configuration.
[profile.perf]
inherits = "release"
# PGO often benefits from single codegen unit to maximize optimization scope.
codegen-units = 1
# Enable PGO. This tells cargo to use the profile data during the final build.
# The exact key depends on your Rust version and toolchain.
# For nightly or recent stable, this is often handled via RUSTFLAGS or profile settings.
# Check your rustc version for the supported syntax.
// src/main.rs

/// A function with branches the compiler cannot predict statically.
/// The compiler assumes each branch has equal probability.
fn process_data(input: &[u8]) -> usize {
    let mut count = 0;
    for byte in input {
        // Without PGO, the compiler guesses this branch is 50% likely.
        // With PGO, the compiler learns the actual distribution.
        if *byte > 128 {
            count += 1;
        }
    }
    count
}

fn main() {
    // Generate some data.
    let data = vec![0u8; 1024];
    
    // Process the data.
    // The instrumented build will count how often the branch is taken.
    let result = process_data(&data);
    
    println!("Result: {}", result);
}

The codegen-units = 1 setting is a convention in performance profiles. It forces the compiler to process the entire crate in one pass. This allows more aggressive inlining and optimization. PGO benefits from this because the profile data can guide decisions across the whole crate. The trade-off is slower build times.

Run the binary. The profile is the gold.

Walk through the workflow

PGO requires a specific sequence of commands. You cannot just add a flag and build. You must generate the profile first.

First, switch to the nightly toolchain if your stable version does not support PGO natively. PGO support has been moving to stable, but nightly often has the latest features.

rustup default nightly

Second, build the binary with instrumentation. This creates a version of your binary that records branch counts.

cargo build --profile=perf

Third, run the instrumented binary. You must run it with data that represents your real workload. If you run it with empty input, the profile will be useless. If you run it with test data that differs from production, the profile will mislead the compiler.

./target/perf/pgo-demo

The binary writes profile data to the target directory. The files usually have a .profraw extension.

Fourth, merge the profile data if you ran the binary multiple times. Merging combines counts from multiple runs to get a more accurate picture. This is a convention for robust profiling. Single runs can be noisy.

# Example merge command using llvm-profdata.
# Adjust paths based on your toolchain.
llvm-profdata merge -o default.profdata target/perf/*.profraw

Fifth, rebuild the binary using the merged profile. This step uses the data to optimize the code.

# Set the environment variable to point to the profile.
# The exact variable name depends on your Rust version.
export RUSTFLAGS="-Cprofile-use=default.profdata"
cargo build --profile=perf

The final binary is optimized based on the profile. It is faster than the standard release build for the workload you profiled.

Treat the profile as a snapshot of truth. If the snapshot is wrong, the optimization is wrong.

Realistic example: The parser

PGO shines in code with many branches and unpredictable paths. Parsers are a classic example. A parser reads input and makes decisions based on the tokens. Most inputs are valid. Errors are rare. The compiler does not know this.

// src/parser.rs

/// Parse a command string into an action.
/// Most commands are valid. Errors are rare.
/// Without PGO, the compiler might lay out the match arms linearly.
/// With PGO, the compiler learns that "move" and "jump" are common.
fn parse_command(input: &str) -> Result<Command, &'static str> {
    match input.trim() {
        "move" => Ok(Command::Move),
        "jump" => Ok(Command::Jump),
        "attack" => Ok(Command::Attack),
        // This branch is cold. PGO moves it out of the hot path.
        _ => Err("Unknown command"),
    }
}

#[derive(Debug)]
enum Command {
    Move,
    Jump,
    Attack,
}

fn main() {
    let commands = vec!["move", "jump", "move", "attack", "jump"];
    
    for cmd in commands {
        match parse_command(cmd) {
            Ok(action) => println!("Executing: {:?}", action),
            Err(e) => eprintln!("Error: {}", e),
        }
    }
}

In this example, the parse_command function has a match statement. The compiler sees five arms. It might arrange them in the order they appear. If the error case is at the end, the CPU has to check the common cases first. This is fine. But if the error case is in the middle, or if the compiler decides to use a jump table, the layout might not be optimal for the common cases.

PGO analyzes the profile. It sees that "move" and "jump" are called 90% of the time. It rearranges the code so those cases are at the top. It moves the error case to a cold section. The hot path becomes a tight sequence of checks. The CPU executes the common commands with minimal overhead.

Keep your hot path tight. PGO helps you do that automatically.

Pitfalls and gotchas

PGO is powerful, but it has risks. Using it incorrectly can make your code slower or break the build.

Profile mismatch is the biggest danger. If you change your code significantly after generating the profile, the instrumentation IDs might change. The profile data will no longer match the code. The compiler might ignore the profile or apply optimizations based on stale data. This can degrade performance. Always regenerate the profile when you make substantial changes to the hot paths.

Unrepresentative data is another risk. If you profile with a small dataset, the profile might capture noise. If you profile with a dataset that differs from production, the compiler will optimize for the wrong patterns. For example, if you profile with mostly error inputs, the compiler will optimize the error path and slow down the success path. Use data that matches your real workload as closely as possible.

Build time increases. Instrumentation adds overhead to the build. The instrumented binary runs slower than the release binary. Merging profiles takes time. Rebuilding with the profile can take longer due to codegen-units = 1. PGO is a cost. You pay in build time and workflow complexity for runtime speed.

Compiler errors can occur if the profile is missing or corrupted. You might see an error like error: profile data not found or error: could not load profile data. This usually means the RUSTFLAGS variable is not set correctly or the profile file path is wrong. Check your environment variables and file paths.

Garbage in, garbage out. Validate your workload.

Decision: When to use PGO

PGO is not for every project. It adds complexity and build time. Use it when the benefits outweigh the costs.

Use PGO when you have a hot loop or critical path that is already optimized but still bottlenecked by branch mispredictions or cache misses. PGO can squeeze out the last 1-5% of performance.

Use PGO when your application has predictable runtime behavior, like a server handling a steady stream of requests or a game with a consistent loop. The profile will be stable and representative.

Use PGO when you are shipping a performance-critical binary and have measured that PGO provides a significant speedup for your workload. Do not use it blindly. Measure first.

Reach for standard --release optimization when you are in early development and build speed matters more than peak performance. PGO slows down iteration.

Reach for --release when your workload is highly variable and unpredictable, making a single profile unrepresentative. If the behavior changes constantly, the profile will be stale quickly.

Use Link Time Optimization (LTO) alongside PGO when you want to combine cross-crate inlining with profile data for maximum speed. LTO and PGO complement each other. LTO allows the compiler to optimize across crate boundaries. PGO guides those optimizations with data.

PGO is the final polish. Fix the algorithm first.

Where to go next