How to benchmark with criterion

The stopwatch lie

You write a function to parse a configuration file. You wrap it in std::time::Instant::now(), run it, and see 0.4 microseconds. You celebrate. Then you run it again. 12 microseconds. You run it a third time. 0.4 microseconds. The number jumps around wildly. Worse, you realize the compiler optimized the function call away entirely because the result wasn't used. You didn't measure performance. You measured the cost of doing nothing.

Stopwatches fail in Rust for two reasons. The optimizer deletes code that has no observable side effects. Background processes, CPU frequency scaling, and cache effects introduce noise that a single measurement cannot filter out. You need a tool that fights the optimizer and gives you statistics, not just a number.

Criterion fights back

Criterion is a statistical benchmarking framework. It runs your code hundreds of times, fits the results to a distribution, and reports a median with a confidence interval. It also holds the return value hostage so the compiler cannot optimize your work away.

Think of criterion as a lab experiment. You don't measure a chemical reaction once and declare the result. You run the reaction multiple times to account for temperature fluctuations and measurement error. Criterion does this for code. It accounts for system noise and tells you how confident you can be in the result.

The output looks like this:

time:   [100 ns 101 ns 102 ns]

The middle value is the estimate. The brackets are the 95% confidence interval. If the interval is narrow, the benchmark is stable. If the interval is wide, the benchmark is noisy. Criterion also detects outliers and reports them.

Trust the brackets. The single number is a guess. The interval is the truth.

Setup and the walltime convention

Add criterion to your dev-dependencies. Create a benches directory. Configure Cargo.toml to disable the default harness.

[dev-dependencies]
criterion = { version = "0.5", default-features = false, features = ["cargo_bench_support"] }

[[bench]]
name = "my_bench"
harness = false
required-features = ["walltime"]

[features]
walltime = ["dep:criterion"]

The harness = false setting is mandatory. Cargo injects its own benchmark harness by default. Criterion provides its own main function via a macro. If you leave the harness enabled, cargo and criterion both try to define main. The compiler rejects this with a duplicate entry point error.

The walltime feature is a community convention. Gating criterion behind a feature flag keeps your dependency tree clean for normal builds. It signals that benchmarks are a development concern. It also allows you to run cargo bench --features walltime explicitly, which prevents accidental benchmark runs in CI pipelines that don't need them.

Gate your benchmarks behind a feature. Keep your production build lean.

Minimal benchmark

Create benches/my_bench.rs. Define a function to benchmark. Define a benchmark function that takes &mut Criterion. Use criterion_group and criterion_main to wire everything together.

// benches/my_bench.rs
use criterion::{criterion_group, criterion_main, Criterion};

/// Sorts a small vector of integers.
fn sort_vec() -> Vec<i32> {
    let mut v = vec![3, 1, 4, 1, 5];
    v.sort();
    v
}

/// Benchmarks the sort function.
fn bench_sort(c: &mut Criterion) {
    c.bench_function("sort_vec", |b| {
        // b.iter runs the closure many times.
        // The return value prevents the optimizer from deleting the work.
        b.iter(sort_vec)
    });
}

criterion_group!(benches, bench_sort);
criterion_main!(benches);

Run the benchmark with cargo bench --features walltime. Criterion generates an HTML report in target/criterion/report/index.html. Check the report to see charts and regression detection.

Return a value from your benchmark closure. If you return (), the optimizer is eating your code for breakfast.

How b.iter works

The b.iter method takes a closure. It runs the closure repeatedly until it has enough samples to fit a distribution. It captures the return value of the closure and uses it to prevent optimization.

If your function returns Vec<i32>, b.iter captures that vector. The compiler sees the vector being used, so it cannot delete the sort. If your function returns (), b.iter captures (). The compiler sees no side effects and deletes the function call. The benchmark reports 0 nanoseconds.

This is the most common pitfall. If you benchmark a function that returns (), you must use black_box.

use criterion::{black_box, Criterion};

fn side_effect_only() {
    // Does something but returns ()
}

fn bench_side_effect(c: &mut Criterion) {
    c.bench_function("side_effect", |b| {
        b.iter(|| {
            // black_box tells the compiler to treat the input as opaque.
            // It prevents the compiler from optimizing across the call.
            side_effect_only();
            black_box(())
        })
    });
}

black_box is a hint to the compiler. It says "pretend this value is magic. You cannot optimize it away." Use it when you cannot return a meaningful value. The community convention is to avoid black_box when possible. Returning the actual value is safer because it guarantees the work happens. black_box relies on compiler heuristics that can change.

Don't reach for black_box first. Return the value. Use black_box only when the function signature forces ().

Realistic scenario: parameters and comparison

Real benchmarks often need to measure different input sizes or compare implementations. Use bench_with_input to pass parameters. Use BenchmarkId to label the results. Use benchmark_group to compare multiple functions.

use criterion::{BenchmarkId, Criterion, black_box};
use std::collections::HashMap;

/// Benchmarks HashMap insertion for different sizes.
fn bench_hashmap_sizes(c: &mut Criterion) {
    let sizes = [10, 100, 1000, 10000];

    let mut group = c.benchmark_group("hashmap_insert");

    for size in sizes {
        group.bench_with_input(
            BenchmarkId::new("insert", size),
            &size,
            |b, &size| {
                // Clone the input inside the iter to reset state.
                // If you mutate outside, you measure inserting into a full map.
                let mut map = HashMap::new();
                b.iter(|| {
                    let mut m = map.clone();
                    for i in 0..size {
                        m.insert(i, i * 2);
                    }
                    m
                })
            },
        );
    }

    group.finish();
}

The clone inside the iteration is critical. If you build the map outside the iter and insert inside, the first iteration measures empty insertion. The second iteration measures insertion into a full map. The benchmark measures the wrong thing. Always reset the input inside the iteration.

You can add multiple functions to the same group to compare them. Criterion plots them on the same chart and calculates relative speed.

group.bench_function("std_sort", |b| b.iter(std_sort));
group.bench_function("my_sort", |b| b.iter(my_sort));

Compare implementations in the same group. Relative speed matters more than absolute speed.

Pitfalls and compiler errors

Optimized away code. If your benchmark returns () and you don't use black_box, the compiler deletes the code. The result is 0 nanoseconds. Check your return types. If the benchmark is instant, the optimizer won.

State leakage. If you mutate shared state outside the iteration, you measure the cost of mutated state, not the fresh operation. Clone or reset inside the iter.

Missing harness. If you forget harness = false in Cargo.toml, cargo injects its own harness. You get a duplicate main error. The compiler rejects the build with error[E0601]: no main function found or a conflict about multiple entry points. Add harness = false.

Debug builds. Cargo runs benchmarks in release mode by default. This is correct. Debug builds include checks and lack optimization. Benchmarking debug measures the wrong thing. If you see cargo bench running slowly, check your .cargo/config.toml. You might have overridden the profile.

Noise. CPU frequency scaling causes jitter. Background processes cause spikes. Criterion filters some noise, but not all. Run benchmarks on a quiet machine. Disable turbo boost if you need stability. The confidence interval tells you if the noise is significant.

Check your build profile. Debug benchmarks lie.

Decision matrix

Use criterion when you need statistical rigor, confidence intervals, and protection against the optimizer. Use criterion when you are comparing implementations or measuring regression over time. Use criterion for any performance-critical code where accuracy matters.

Use std::time::Instant when you need a quick wall-clock check for a whole program flow and don't care about noise. Use std::time for logging execution time in production or for rough estimates during early prototyping.

Use perf or flamegraph when you need to find where the time is spent, not just how much time. Use profiling tools to identify hotspots before you benchmark them. Criterion tells you if a change is faster. Profiling tells you what to change.

Use divan or tinytemplate when you want a simpler API with less boilerplate. These crates provide a lighter weight alternative with fewer configuration options.

Criterion is the gold standard for Rust benchmarks. If you are serious about performance, this is the tool.

Where to go next

Criterion is a tool that measures how fast your Rust code runs by executing it many times and calculating the average speed. You use it when you need to know if a code change made your program faster or slower. Think of it like a stopwatch that runs your code thousands of times to give you a precise time measurement.