The first time you reach for parallelism
You wrote a function that processes a giant Vec<Sample>. Maybe it parses log lines, maybe it transforms images, maybe it computes hashes. On a single thread it takes 8 seconds. You glance at your CPU monitor and notice 11 of your 12 cores napping. There has to be a better way.
In most languages "better" means thread pools, work queues, and a chapter on data races you skim guiltily. In Rust, you add one dependency, change one method call, and the work spreads across every core you own. That dependency is Rayon.
What Rayon actually is
Rayon is a data-parallelism library. The pitch: take a sequential iterator, change .iter() to .par_iter(), and let Rayon figure out how to split the work across threads. It uses a technique called work-stealing under the hood. Imagine a small office where every worker has their own to-do list. When a worker finishes their list, they walk over to a busy colleague and grab a task off the bottom of theirs. No central manager, very little coordination overhead, and the load balances itself.
The magical part for a Rust beginner: you don't have to think about threads, locks, channels, or mutexes for the common case. Rayon's parallel iterators are wired into the same trait machinery as normal iterators (map, filter, fold, collect), so the code reads almost identically.
Adding the dependency
First, add Rayon to Cargo.toml:
[dependencies]
# Rayon's API is stable. "1" pins to the 1.x line.
rayon = "1"
Then cargo build once to fetch it. That's the entire setup.
A minimal example
Here's the smallest thing worth running:
// Bringing in the prelude unlocks `.par_iter()` on standard collections
// like Vec, slices, and HashMap.
use rayon::prelude::*;
fn main() {
let nums: Vec<i32> = (1..=10_000).collect();
// par_iter splits the slice into chunks, runs the closure on each chunk
// across multiple threads, and collects the results back in order.
let squared: Vec<i32> = nums.par_iter().map(|n| n * n).collect();
println!("first three: {:?}", &squared[..3]);
println!("total items: {}", squared.len());
}
For 10,000 trivial multiplications you won't see a speedup. The work per item is too small compared to the cost of dispatching it to other threads. Rayon really shines when each item takes meaningful CPU time. Let's bump the workload.
Walking through what happens
When par_iter() returns, you get a ParallelIterator. It hasn't done any work yet. It's a recipe. The actual computation kicks off when you call a consumer like collect, sum, count, for_each, or reduce.
At that point Rayon hands the recipe to its global thread pool, which by default has one worker thread per CPU core. The input slice gets split roughly in half, then each half gets split again, and so on, until the chunks are small enough that splitting further would cost more than just doing the work. Each thread processes its chunk and the results are reassembled. If one thread finishes early, it steals work from a slower thread's queue.
You never wrote a thread::spawn. You never picked a chunk size. You never wrote a join. Rayon decided.
A more realistic example: hashing files
Here's something closer to a real workload. Imagine you have a directory of files and you want a SHA-256 hash of each one. That's CPU-bound (well, a mix of disk and CPU), and the per-item cost is high enough that parallelism actually helps.
use rayon::prelude::*;
use sha2::{Digest, Sha256};
use std::fs;
use std::path::PathBuf;
// Read a file fully into memory and return (path, hex-digest) on success.
fn hash_file(path: PathBuf) -> Option<(PathBuf, String)> {
// fs::read returns Err if the path is unreadable; we skip those rather than panic.
let bytes = fs::read(&path).ok()?;
let mut hasher = Sha256::new();
hasher.update(&bytes);
Some((path, format!("{:x}", hasher.finalize())))
}
fn main() -> std::io::Result<()> {
// Collect the directory entries into a Vec so we can iterate in parallel.
// Rayon needs to know the size up-front to split work, which a Vec gives it.
let paths: Vec<PathBuf> = fs::read_dir(".")?
.filter_map(|e| e.ok())
.map(|e| e.path())
.filter(|p| p.is_file())
.collect();
// .into_par_iter() consumes the Vec and yields owned PathBufs to each worker.
// filter_map drops None results (files we couldn't read) on the way through.
let hashes: Vec<(PathBuf, String)> = paths
.into_par_iter()
.filter_map(hash_file)
.collect();
for (path, digest) in &hashes {
println!("{} {}", digest, path.display());
}
Ok(())
}
Cargo.toml for the above would also list sha2 = "0.10". The interesting line is into_par_iter(). There are three flavors to know:
par_iter() borrows the collection, yielding &T to each worker. Use this when the input lives on after the parallel block.
par_iter_mut() borrows mutably, yielding &mut T. Useful for in-place updates.
into_par_iter() consumes the collection, yielding owned T. Use this when each worker needs ownership, like sending each item to a function that takes T by value.
When par_iter won't compile
Because Rayon spreads work across threads, the closure body has to play nicely with multi-threading. Two errors come up early.
First, the closure must implement Send. If you capture a non-Send value like an Rc<T>, the compiler refuses:
error[E0277]: `Rc<...>` cannot be sent between threads safely
--> src/main.rs:14:9
|
14 | .par_iter()
| ^^^^^^^^ `Rc<...>` cannot be sent between threads safely
Fix: switch to Arc<T> (atomic reference count), which is Send.
Second, if you try to mutate shared state from inside the closure, you'll hit a borrow-checker error or a Sync error. The Rayon-flavored solution is to avoid shared mutation: use map to produce values, then collect, reduce, or sum them. If you genuinely need shared state, wrap it in Mutex<T> or use a thread-safe accumulator like AtomicU64.
Common pitfalls
You parallelized something tiny and it got slower. The dispatch cost ate the gains. Rule of thumb: per-item work should be at least a few microseconds before parallelism wins. Profile before assuming.
You ran out of file handles or memory. Rayon will happily kick off work on every core, so if your task opens files or allocates buffers, expect peak resource use to multiply by core count.
You expected ordering. par_iter().map().collect() preserves order in the output Vec, but for_each does not run in order. If order matters for side effects, use collect then iterate sequentially.
You forgot use rayon::prelude::*. Without the prelude in scope, .par_iter() doesn't exist on your types. Compiler will say no method named par_iter found.
When to reach for Rayon vs alternatives
Use Rayon when your work is CPU-bound and naturally data-parallel: transforming a collection, computing aggregates, processing files, running per-row pipelines. It's the easiest way in Rust to spread compute across cores.
Reach for tokio instead when your work is I/O-bound: HTTP requests, database queries, network servers. Tokio's async tasks are about waiting efficiently; Rayon's threads are about computing efficiently. They solve different problems.
For anything in between, you can mix them. A common pattern is a Tokio app that occasionally hands a CPU-heavy chunk off to Rayon via rayon::spawn or tokio::task::spawn_blocking.
If you have very few items but each takes a long time, raw threads (std::thread::spawn) might be simpler than pulling in Rayon. The crate earns its keep when you're slicing up a large collection.
Where to go next
Parallelism is a rabbit hole, and Rayon is the gentlest entrance.
How to use rayon for parallel iteration