How to use cargo bench

When `cargo test` lies to you about speed

You just refactored your string parsing function. The new version uses a state machine instead of regex. It feels cleaner. You suspect it's faster. You run cargo test. The tests pass. The output says test result: ok. It tells you nothing about speed.

To measure performance, you need cargo bench. But benchmarks in Rust are dangerous. The compiler is smarter than you. It will look at your benchmark loop, realize the result is constant, and delete the entire loop. Your benchmark will report zero time. You think you have a supercomputer. You actually have a compiler that cheated.

Benchmarks require a different mindset than tests. Tests verify correctness. Benchmarks measure time under optimized conditions. If you treat a benchmark like a test, the compiler will optimize your measurement away. You need to trick the compiler into thinking the result matters, or your data is worthless.

The anatomy of a benchmark

A benchmark runs code in release mode. Release mode enables optimizations that strip away debug checks and reorder instructions for speed. Debug builds are slow and full of assertions. Benchmarks must run on optimized code to reflect reality. cargo bench compiles with optimizations enabled by default. You do not need to pass --release.

The core problem is constant propagation. If you benchmark a function that always returns the same value for the same input, the compiler computes the value once at compile time. It replaces the function call with the constant. The runtime loop does nothing. The timing is zero.

You need to break the compiler's ability to predict the result. The tool for this is black_box. It is a compiler intrinsic that tells LLVM, "Treat this value as opaque. Do not assume you know what is inside. Do not optimize across this boundary." It forces the compiler to generate code that actually runs the computation.

Without black_box, your benchmark measures the speed of the compiler's optimizer, not your code. With black_box, you measure the speed of your code.

Setting up Criterion

The built-in #[bench] attribute is legacy. It produces noisy results and lacks statistical analysis. The community standard is the criterion crate. Criterion runs your benchmark many times, calculates statistical distributions, detects regressions, and generates an HTML report. It is the tool you should use for everything.

Add Criterion to your Cargo.toml. It belongs in [dev-dependencies] because benchmarks are development tools. You also need to configure a [[bench]] section. Criterion provides its own main function, so you must tell Cargo not to generate the default test harness.

# Cargo.toml
[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "my_bench"
harness = false

The harness = false line is critical. If you omit it, Cargo tries to link a test harness alongside Criterion's main function. The linker fails with a duplicate symbol error. The convention is to always set harness = false when using Criterion.

Create the benchmark file in benches/my_bench.rs. The file structure uses macros to generate the entry point.

// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

/// Sums all integers from 0 to n.
fn sum_to(n: u64) -> u64 {
    (0..n).sum()
}

fn bench_sum(c: &mut Criterion) {
    c.bench_function("sum_to_1000", |b| {
        // black_box prevents the compiler from optimizing the input away.
        // The compiler cannot assume the value of black_box(1000).
        b.iter(|| sum_to(black_box(1000)))
    });
}

// Generates the main function and registers the benchmark group.
criterion_group!(benches, bench_sum);
criterion_main!(benches);

The criterion_group! macro collects your benchmark functions. The criterion_main! macro generates the main function that runs the group. This separation allows you to organize benchmarks into modules without cluttering your code with boilerplate.

Run the benchmark with cargo bench. Criterion will run the function thousands of times, warm up the CPU, and calculate statistics. The output shows the median time, standard deviation, and a confidence interval.

cargo bench

Criterion also generates an HTML report in target/criterion/my_bench/report/index.html. Open it in your browser. The report contains plots, regression detection, and raw data. It lets you compare runs over time.

The compiler will optimize your benchmark to nothing unless you use black_box. Trust nothing without it.

Measuring real code with groups and scaling

Simple functions are easy to benchmark. Real code often involves setup, teardown, and varying inputs. Criterion supports benchmark groups and multiple inputs. Groups let you compare related benchmarks side by side. Multiple inputs reveal how your code scales.

Consider a Vec operation. You want to compare pushing items one by one versus reserving capacity upfront. You also want to see how performance changes as the vector grows.

// benches/vec_ops.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

/// Pushes items one by one without pre-allocation.
fn push_loop(n: usize) -> Vec<i32> {
    let mut v = Vec::new();
    for i in 0..n {
        v.push(i as i32);
    }
    v
}

/// Reserves capacity upfront to avoid reallocations.
fn reserve_loop(n: usize) -> Vec<i32> {
    let mut v = Vec::with_capacity(n);
    for i in 0..n {
        v.push(i as i32);
    }
    v
}

fn bench_vec_ops(c: &mut Criterion) {
    let mut group = c.benchmark_group("vec_ops");
    
    // Measure multiple sizes to observe scaling behavior.
    // Small sizes may hide allocation overhead.
    // Large sizes expose reallocation costs.
    for n in [100, 1000, 10000] {
        group.bench_function(format!("push_{}", n), |b| {
            b.iter(|| push_loop(black_box(n)))
        });
        
        group.bench_function(format!("reserve_{}", n), |b| {
            b.iter(|| reserve_loop(black_box(n)))
        });
    }
    
    // Finalizes the group and adds it to the report.
    group.finish();
}

criterion_group!(benches, bench_vec_ops);
criterion_main!(benches);

The benchmark_group creates a container for related benchmarks. The HTML report groups these together, making comparison easier. The loop over sizes generates separate benchmarks for each input. This reveals algorithmic complexity. If push_loop scales worse than reserve_loop as n grows, the plot will show the divergence.

The black_box wraps the input n. This prevents the compiler from unrolling the loop based on a known constant size. It forces the code to handle the size dynamically.

Run the benchmark. Open the HTML report. Look at the bars. If reserve_loop is significantly faster for large n, the data confirms that pre-allocation matters. If the bars are identical, the compiler may have optimized the allocation away, or the overhead is negligible for your workload.

Run the benchmark. Open the HTML report. If the bars look the same, your optimization didn't matter.

Handling setup and teardown

Some benchmarks require state that must be created before measurement. For example, measuring the time to serialize a struct requires creating the struct first. If you create the struct inside the iteration, you measure both creation and serialization. You only want to measure serialization.

Criterion provides iter_batched for this scenario. It separates setup from measurement. You provide a setup closure and a measurement closure. Criterion runs the setup once per batch, then runs the measurement closure with the result.

// benches/serialize.rs
use criterion::{black_box, criterion_group, criterion_main, BatchSize, Criterion};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Clone)]
struct User {
    id: u64,
    name: String,
    email: String,
}

fn create_user() -> User {
    User {
        id: 1,
        name: "Alice".to_string(),
        email: "alice@example.com".to_string(),
    }
}

fn bench_serialize(c: &mut Criterion) {
    c.bench_function("serialize_user", |b| {
        // iter_batched runs create_user() for each iteration.
        // The measurement closure receives the user and measures serialization.
        // BatchSize::PerIteration ensures fresh state every time.
        b.iter_batched(
            || create_user(),
            |user| serde_json::to_string(&black_box(user)),
            BatchSize::PerIteration
        );
    });
}

criterion_group!(benches, bench_serialize);
criterion_main!(benches);

The BatchSize::PerIteration argument tells Criterion to run the setup closure for every single measurement. This ensures that each iteration starts with a fresh object. If you used iter, the setup would happen once, and you would measure serialization of the same object repeatedly. Caches and CPU state would skew the results.

Use iter_batched when your measurement needs fresh state every time. iter reuses the result.

Pitfalls and compiler tricks

Benchmarks are fragile. Small mistakes lead to misleading data.

Forgetting harness = false in Cargo.toml causes a linker error. Criterion defines main, and Cargo tries to define main again. The build fails. Always check your Cargo.toml configuration.

Forgetting black_box leads to 0.00 ns results. The compiler optimizes the function away. You see a result, but it measures nothing. If your benchmark reports zero time, the compiler won. Use black_box.

Running benchmarks in debug mode produces meaningless numbers. Debug builds include assertions and lack optimizations. The timing reflects interpreter overhead, not algorithmic speed. cargo bench uses release mode by default. If you script benchmarks, ensure you are not overriding the profile.

Noisy environments corrupt statistics. Background processes, thermal throttling, and power saving modes introduce variance. Criterion runs many iterations to smooth noise, but extreme variance breaks statistical confidence. Close other applications. Run on a stable power source. If the standard deviation is huge, your environment is unstable.

The legacy #[bench] attribute is unstable and requires a nightly compiler. It lacks statistical analysis. It does not generate reports. It is prone to optimization artifacts because it does not enforce black_box usage. Avoid it. Use Criterion.

If your benchmark reports zero time, the compiler won. Use black_box.

Choosing the right tool

Benchmarks measure speed. Profilers measure hotspots. You need both to optimize effectively.

Use Criterion for almost all benchmarking. It provides statistical rigor, regression detection, and HTML reports. It is the community standard. It handles setup, teardown, and scaling. It integrates with CI to detect performance regressions.

Use #[bench] only when you are maintaining legacy code that cannot upgrade dependencies. It lacks statistical analysis and is prone to optimization artifacts. It requires nightly Rust. Prefer Criterion for everything else.

Use perf or flamegraph when you need to find where time is spent, not just how much time is spent. Benchmarks tell you if you got faster. Profilers tell you why. Use a benchmark to measure the impact of a change. Use a profiler to identify the bottleneck to fix.

Criterion tells you if you got faster. Profilers tell you why.

Where to go next

Cargo bench measures how fast your code runs instead of just checking if it works. It helps you find slow parts of your program so you can make them faster. Think of it like a stopwatch for your code's performance.