How to Use Thread Pools in Rust

The missing thread pool

You have a vector of ten thousand images. You need to resize them. You write a loop. It takes forty seconds. You think, "I'll spawn a thread for each image." You spawn ten thousand threads. The operating system panics. Your memory usage spikes. The context switch overhead eats all your CPU time. The program runs slower than the single-threaded loop.

Rust's standard library does not include a ThreadPool struct. This catches developers from Go, Java, or C++ off guard. The omission is intentional. Rust separates concerns. Data parallelism and task concurrency are different problems. The ecosystem provides specialized tools for each. You do not reach for a generic thread pool. You reach for the abstraction that matches your workload.

Stop trying to build a thread pool from scratch. The ecosystem has solved this.

Data parallelism with Rayon

Most "thread pool" needs in Rust are actually data parallelism. You have a collection. You want to do work on every element. The rayon crate is the standard solution. It turns your sequential iterators into parallel ones with almost zero code changes.

Add rayon to your Cargo.toml.

[dependencies]
rayon = "1.10"

Rayon provides parallel iterators. You write the logic once. Rayon splits the work across threads.

use rayon::prelude::*;

fn main() {
    // Create a large dataset.
    let data: Vec<i32> = (0..100_000).collect();

    // par_iter splits the vector across threads automatically.
    // sum() is a reduction operation that combines results from all threads.
    let sum: i32 = data.par_iter().sum();

    println!("Sum: {}", sum);
}

The code looks identical to a sequential loop, except for par_iter. Rayon handles the splitting, the thread management, and the joining. You get parallelism without managing threads.

Write the loop once. Add par_ and you're done.

Under the hood: work stealing

Rayon does not assign fixed chunks of work to fixed threads. It uses a work-stealing scheduler. This design keeps all cores busy even when work is uneven.

Imagine a kitchen with three chefs. Each chef has a stack of tickets. When a chef finishes their stack, they do not sit idle. They walk over to a busy chef and steal a ticket from the bottom of their stack. The busy chef keeps working on the top of their stack. The stealing happens without locking a central queue. This minimizes contention.

In Rayon, each thread has a deque of tasks. When a thread runs out of work, it steals from another thread's deque. The scheduler balances load dynamically. You do not need to tune chunk sizes. Rayon figures it out.

Trust the work-stealing scheduler. It is more efficient than manual chunking for almost all workloads.

The Send trait wall

Parallel iterators require your data to be Send. The Send trait marks types that can be transferred across thread boundaries. If you try to parallelize a vector of Rc<T>, the compiler rejects you with E0277 (trait bound not satisfied). Rc is not thread-safe. It uses atomic operations only on single-threaded code.

You must use Arc<T> instead. Arc stands for Atomic Reference Counted. It uses atomic operations to manage the reference count safely across threads.

use rayon::prelude::*;
use std::sync::Arc;

fn main() {
    // Arc is Send and Sync, so it works with par_iter.
    let data: Vec<Arc<String>> = (0..100)
        .map(|i| Arc::new(format!("item-{}", i)))
        .collect();

    // This compiles because Arc<String> is Send.
    let count: usize = data.par_iter().map(|s| s.len()).sum();
    println!("Total length: {}", count);
}

The compiler error E0277 is a feature. It prevents data races before they happen. If your struct holds a Rc, par_iter will reject it. Swap to Arc or rethink the design.

Check your types for Send. If your struct holds a Rc, par_iter will reject it.

Real-world processing

Data parallelism shines when you process collections. You can chain operations just like sequential iterators. Rayon keeps the chain parallel as long as possible.

use rayon::prelude::*;

struct Task {
    id: u32,
    payload: String,
}

// Simulate a CPU-heavy operation.
fn process(task: &Task) -> u64 {
    // In real code, this might be image resizing, crypto, or parsing.
    // The function must be pure or handle synchronization internally.
    task.payload.len() as u64 * 42
}

fn main() {
    let tasks: Vec<Task> = (0..1000)
        .map(|i| Task { id: i, payload: format!("data-{}", i) })
        .collect();

    // Map each task to a result in parallel.
    // collect() preserves the original order of elements.
    let results: Vec<u64> = tasks.par_iter().map(|t| process(t)).collect();

    assert_eq!(results.len(), tasks.len());
}

The collect() call at the end synchronizes all threads and assembles the results in order. This synchronization has a cost. If order does not matter, use collect_unordered() for a speed boost.

// collect_unordered skips the ordering step.
// Use this when the result is a set, a histogram, or order doesn't matter.
let results: Vec<u64> = tasks.par_iter().map(|t| process(t)).collect_unordered();

Convention aside: keep the parallel chain long. Switching back to iter() forces a synchronization point. If you can chain par_iter().map().filter().collect(), do it. Minimizing synchronization maximizes throughput.

Async tasks and Tokio

The kernel mentions tokio. Tokio is an async runtime. It has a thread pool, but you do not interact with it directly. You spawn tasks. Tokio is for I/O-bound work, not CPU-bound data processing.

If your tasks involve waiting for a database, an HTTP request, or a file read, use Tokio. Tokio's thread pool is optimized for many concurrent tasks that spend most of their time waiting. It multiplexes thousands of tasks onto a small number of threads.

use tokio::task;

#[tokio::main]
async fn main() {
    // Spawn a task onto the runtime's thread pool.
    let handle = task::spawn(async {
        // Simulate I/O wait.
        tokio::time::sleep(std::time::Duration::from_millis(100)).await;
        42
    });

    // Await the result.
    let result = handle.await.unwrap();
    println!("Result: {}", result);
}

Do not use Rayon for waiting on network calls. You will block the worker threads and kill throughput. Rayon threads are meant for computation. Tokio threads are meant for multiplexing I/O.

Don't use Rayon for waiting on network calls. You'll block the worker threads and kill throughput.

Tuning Rayon

Rayon defaults to the number of logical cores. This is usually correct. You can override this with the RAYON_NUM_THREADS environment variable.

In production, you usually want the default. If you are running Rayon inside a container with CPU limits, set the variable to match the limit. Otherwise, Rayon will oversubscribe the cores and thrash the scheduler.

use rayon::current_num_threads;

fn main() {
    // Check how many threads Rayon is using.
    println!("Rayon threads: {}", current_num_threads());
}

Convention aside: never hardcode thread counts in your code. Use the environment variable or the default. Hardcoding counts breaks portability and ignores the runtime environment.

Decision: choosing the right tool

Rust gives you choices. Pick the tool that matches your workload.

Use rayon when you have a collection of data and want to apply a CPU-intensive operation to every element. Use rayon when you can express your work as map, filter, or reduce operations over iterators. Use tokio when your tasks involve waiting for I/O, such as database queries, HTTP requests, or file reads. Use std::thread with a channel when you need a long-running worker loop with custom backpressure or priority logic that a data-parallel iterator cannot express.

Build a custom pool only when the abstraction leak hurts. Otherwise, use the crate.

Where to go next

A thread pool is a group of worker threads ready to run tasks, preventing the overhead of creating new threads for every small job. Think of it like a restaurant kitchen with a fixed number of chefs; instead of hiring a new chef for every order, existing chefs grab the next ticket from the queue. You use this when you need to process large amounts of data quickly without crashing your computer's memory.