How to Use criterion for Statistical Benchmarking

Why wall-clock time lies to you

You write a string parsing function. You wrap it in std::time::Instant::now() and elapsed(). You run it once. It reports 42 microseconds. You refactor the loop, run it again, and it reports 38 microseconds. You celebrate. You run it a third time and it reports 51 microseconds. The number jumped back up. Did your refactor break something, or did the operating system just decide to schedule a background disk sync at the exact wrong moment?

Naive timing measures the system's mood, not your algorithm. Modern CPUs throttle frequencies, context-switch threads, cache lines evict and refill, and garbage collectors or background services fire unpredictably. A single measurement captures one slice of that chaos. It tells you almost nothing about actual performance.

criterion solves this by treating your code like a scientific experiment. It runs your function thousands of times, models the noise, filters out outliers, and calculates a statistical regression. You get a median execution time with confidence bounds, not a single lucky or unlucky number. The tool separates signal from system jitter so you can actually trust the results.

How statistical benchmarking actually works

Think of measuring a sprinter's time. One lap is useless. Wind resistance, track temperature, a false start, or a heavy shoe all skew the result. You run twenty laps. You discard the worst and best. You calculate the median. You check how tightly the times cluster. That cluster tells you the sprinter's true speed and how much environmental noise affects them.

criterion does the exact same thing for code. It executes your benchmark function in tight loops, records the time for each iteration, and fits a linear regression model to the data. The slope of that regression gives you the time per iteration. The tool also calculates a 95% confidence interval, which tells you the range where the true execution time likely falls. If the interval is wide, your benchmark is noisy. If it is narrow, your measurement is stable.

The library also handles warm-up automatically. Cold caches and uninitialized CPU pipelines skew early runs. criterion runs a preliminary phase to fill caches and stabilize frequencies before it starts recording data. You never have to write manual warm-up loops.

Stop guessing whether a number is real. Let the statistics do the heavy lifting.

Setting up the harness

Rust's standard cargo bench command used to rely on a built-in harness that expected #[bench] functions. That system was deprecated and removed because it lacked statistical rigor and could not handle modern async or complex setup. criterion replaces the harness entirely.

You need to tell Cargo to disable its default benchmark runner and hand control to criterion. This requires two changes in Cargo.toml.

[dev-dependencies]
criterion = { version = "0.5", default-features = false, features = ["cargo_bench_support"] }

[[bench]]
name = "string_parse"
harness = false
required-features = ["walltime"]

The harness = false line is mandatory. Without it, Cargo tries to run its own benchmark driver, which immediately panics because your file does not export the expected #[bench] functions. The walltime feature switches the measurement from CPU time to wall-clock time. CPU time ignores time spent waiting on I/O or other threads. Wall-clock time captures the actual elapsed duration, which matters when your code touches the filesystem, network, or sleeps.

Create a file at benches/string_parse.rs. This is where your benchmark lives.

use criterion::{black_box, criterion_group, criterion_main, Criterion};

/// Measures the time required to parse a fixed string into tokens.
fn parse_tokens(input: &str) -> Vec<&str> {
    // Split on whitespace and collect into a vector.
    // This mimics a real lexical analysis step.
    input.split_whitespace().collect()
}

fn bench_string_parse(c: &mut Criterion) {
    // Create a group to organize related benchmarks.
    // Groups share configuration and appear together in reports.
    let mut group = c.benchmark_group("parsing");

    // Run the benchmark with a fixed input string.
    // black_box prevents the compiler from optimizing away dead code.
    group.bench_function("fixed_input", |b| {
        b.iter(|| parse_tokens(black_box("hello world benchmark test")))
    });

    // Finalize the group and generate the report.
    group.finish();
}

criterion_group!(benches, bench_string_parse);
criterion_main!(benches);

The black_box call is a community convention. It tells the compiler that the value flows into an opaque boundary. Without it, the optimizer sees that the return value of parse_tokens is never used. It deletes the entire function call. Your benchmark would measure the time it takes to do nothing. Always wrap inputs and outputs in black_box unless you are intentionally measuring allocation or drop behavior.

Run the benchmark with cargo bench --features walltime --bench string_parse. Cargo compiles the benchmark in release mode by default. This is intentional. Debug builds include bounds checks and skip optimizations that production code uses. Benchmarking debug builds measures the compiler's safety machinery, not your algorithm.

What happens under the hood

When you execute the command, criterion takes over. It compiles your benchmark code with optimization level 3 and link-time optimization enabled. It then enters the sampling phase. The library runs your closure in tight loops, gradually increasing the number of iterations per sample. It stops when it has collected enough data points to build a stable regression model.

The output prints a table to your terminal. You see the benchmark name, the median time per iteration, the lower and upper bounds of the confidence interval, and the iterations per second. If you run the benchmark multiple times, criterion saves the results to a JSON file in target/criterion/. You can compare runs across commits to track performance regressions over time.

The tool also generates HTML reports with plots. The plots show the raw sample distribution, the fitted regression line, and the confidence bounds. You can open the report in a browser by running cargo bench -- --open. The visual feedback makes it obvious when a benchmark is noisy or when an optimization actually moved the needle.

Do not trust a single run. Check the confidence interval. If the bounds span more than 10 percent, your benchmark is measuring system noise, not code.

Benchmarking real workloads

Real code rarely consists of isolated functions. You usually measure pipelines, data structures, or concurrent tasks. criterion handles this through setup closures and input generators.

use criterion::{black_box, BenchmarkId, Criterion, Throughput};
use std::collections::HashMap;

/// Inserts a batch of string keys into a hash map and measures throughput.
fn bench_hashmap_insert(c: &mut Criterion) {
    let mut group = c.benchmark_group("collections");

    // Define input sizes to test scaling behavior.
    let sizes = vec![100, 1_000, 10_000, 100_000];

    for size in sizes {
        // Create a fresh map for each iteration to avoid state leakage.
        // The setup closure runs once per iteration, not once per benchmark.
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            &size,
            |b, &n| {
                b.iter(|| {
                    let mut map = HashMap::with_capacity(n);
                    for i in 0..n {
                        map.insert(format!("key_{i}"), i);
                    }
                    black_box(map)
                })
            },
        );
    }

    group.finish();
}

criterion_group!(benches, bench_hashmap_insert);
criterion_main!(benches);

The bench_with_input method passes a parameter to your closure. This lets you test how performance scales with data size. The BenchmarkId ensures each configuration gets its own row in the report. The setup pattern inside the closure guarantees that each iteration starts from a clean state. If you allocate outside the closure, you are measuring the cost of reusing a warm cache, not the cost of the operation itself.

You can also declare Throughput if you care about bytes per second or items per second instead of raw time. criterion converts the timing data into throughput automatically. This is useful for parsers, serializers, and compression routines where the business metric is volume, not latency.

Keep the closure tight. Move all allocation and initialization inside the b.iter block unless you are explicitly measuring warm-state performance.

The traps that ruin benchmarks

Benchmarks are notoriously easy to break. The compiler, the runtime, and your own assumptions will conspire to give you false confidence.

The most common trap is dead code elimination. If your function returns a value that is never used, the optimizer removes it. You get a time of zero nanoseconds. Wrap the result in black_box. If your function takes an input that is a compile-time constant, the optimizer precomputes the result. Pass the input through black_box or generate it at runtime.

The second trap is measuring the wrong thing. Hash map benchmarks often measure string allocation instead of hashing. If you pass &str slices, you measure lookup speed. If you pass String values, you measure allocation, hashing, and insertion. Be explicit about what you are testing. Name your benchmarks accordingly.

The third trap is forgetting harness = false. If you omit it, Cargo runs its legacy driver. You get a panic like thread 'main' panicked at 'cannot benchmark: no benches found'. The error code is not standard, but the message is unambiguous. Fix the Cargo.toml and rebuild.

The fourth trap is benchmarking in debug mode. cargo bench defaults to release, but if you override it with --release=false or run cargo run --bench, you measure debug instrumentation. Bounds checks, debug assertions, and unoptimized branches dominate the timing. Always verify the build profile. Check the target/release directory for the benchmark binary.

The fifth trap is ignoring variance. A median of 120 nanoseconds with bounds of 80 to 200 nanoseconds means your code is unstable. Background processes, CPU frequency scaling, or cache misses are interfering. Run the benchmark on a quiet machine. Disable CPU governors. Close background apps. If the variance remains high, your algorithm is inherently sensitive to memory layout or system state. Document it.

Treat every benchmark result as a hypothesis until the confidence interval proves otherwise.

When to reach for criterion

Use criterion when you need statistically sound measurements for algorithmic changes, data structure tuning, or serialization performance. Use criterion when you want to track performance regressions across commits with automated CI integration. Use criterion when your code involves allocation, hashing, parsing, or mathematical loops where microsecond differences matter.

Use std::time::Instant when you need a quick sanity check during development and do not care about precision. Use std::time::Instant when you are measuring long-running operations like network requests or database queries where system noise is negligible compared to the actual duration.

Use profiling tools like perf, flamegraph, or samply when you need to find hotspots inside a large codebase. Use profiling tools when you want to see call graphs, cache miss rates, or branch prediction failures. Criterion tells you how fast something is. Profilers tell you why it is slow.

Pick the tool that matches the question. Do not use a statistical hammer to drive a profiling nail.

Where to go next

Criterion is a tool that runs your code many times to measure how fast it really is, ignoring random computer noise. You use it when you want to know if a new version of your code is faster than the old one. Think of it like a stopwatch that takes hundreds of photos of a race to find the true winner, rather than just timing one run.