When a mutex is too slow
You are building a concurrent web scraper. Ten threads are racing to fetch URLs. Every time one finishes, it needs to bump a global success counter. You slap a Mutex on it. The profiler screams. The threads spend more time waiting for the lock than fetching data. The bottleneck isn't the network; it's the mutex. You need a way to update shared state without making threads queue up.
That is where lock-free programming comes in. Lock-free doesn't mean "no synchronization." It means "no blocking." Instead of grabbing a key and waiting for the previous person to finish, you use atomic operations that the CPU guarantees happen indivisibly. You can update shared data across threads without any thread ever getting stuck waiting for another.
The lock-free idea
The core primitive is Compare-and-Swap. Think of it like updating a shared whiteboard. You read the current number, calculate what it should be, and ask the board: "If the number is still what I read, update it to my new value." If someone else changed it while you were calculating, the board says "No, the number changed. Read again and try." You loop until you win.
This is optimistic concurrency. You assume no one else will touch the value. If you are wrong, you retry. If contention is low, you succeed on the first try with almost zero overhead. If contention is high, you spin, but you never block the scheduler.
Rust provides these operations in std::sync::atomic. The types are AtomicUsize, AtomicBool, AtomicPtr, and others. They are Send and Sync, so you can share them across threads safely. The compiler enforces this. If you try to share a plain usize, you get E0277 (trait bound not satisfied). The language forces you to pick the right tool.
Minimal lock-free counter
Here is a counter that increments without a mutex. It uses AtomicUsize and a compare-and-swap loop.
use std::sync::atomic::{AtomicUsize, Ordering};
/// A lock-free counter that increments without blocking other threads.
struct LockFreeCounter {
/// The atomic value holding the current count.
value: AtomicUsize,
}
impl LockFreeCounter {
/// Creates a new counter initialized to zero.
fn new() -> Self {
LockFreeCounter {
// AtomicUsize starts at zero, but explicit initialization is clear.
value: AtomicUsize::new(0),
}
}
/// Increments the counter using a compare-and-swap loop.
fn increment(&self) {
// Load the current value. Relaxed is sufficient for the initial read.
let mut current = self.value.load(Ordering::Relaxed);
loop {
// Calculate the desired next value based on what we read.
let next = current + 1;
// Try to update the value only if it hasn't changed since we read it.
// compare_exchange_weak is the convention for loops; it may fail spuriously.
match self.value.compare_exchange_weak(
current,
next,
Ordering::SeqCst,
Ordering::Relaxed,
) {
// Success: the value was updated. We are done.
Ok(_) => return,
// Failure: someone else changed the value.
// The error contains the actual current value.
Err(actual) => current = actual,
}
}
}
}
The increment method reads the value, computes current + 1, and attempts to swap. If another thread modified the counter between the load and the swap, compare_exchange_weak returns Err. The error payload gives you the real value, so you update current and try again. The loop runs until the swap succeeds.
How compare-and-swap works
The compare_exchange_weak function takes four arguments: the expected value, the new value, the success ordering, and the failure ordering. If the memory holds the expected value, it writes the new value and returns Ok. If the memory holds something else, it returns Err with the actual value.
Why weak? On some architectures, the weak variant is significantly cheaper than the strong variant. The weak version might fail even if the value matches, a behavior called spurious failure. In a loop, a spurious failure just means one extra iteration. It does not break correctness. The community convention is to use compare_exchange_weak inside loops and compare_exchange only when you cannot retry.
The orderings control memory visibility. Ordering::SeqCst is sequential consistency. It acts like a mutex in terms of ordering: all threads see operations in the same order. It is the safest choice and often the fastest on x86. Ordering::Relaxed provides no ordering guarantees. It only ensures the atomic operation itself is indivisible.
In the counter, we use SeqCst for success and Relaxed for failure. The success path needs to be visible to other threads in order. The failure path just needs to retrieve the correct value to retry. This is a common pattern. You pay for ordering only when the update succeeds.
Realistic bounded counter
Simple increments can often use fetch_add, which is a single-instruction atomic operation. The loop is necessary when the new value depends on the current value. For example, a counter that stops at a limit.
use std::sync::atomic::{AtomicUsize, Ordering};
/// A counter that caps at a maximum value without locks.
struct BoundedCounter {
/// The atomic count.
count: AtomicUsize,
/// The hard limit.
limit: usize,
}
impl BoundedCounter {
/// Creates a new bounded counter.
fn new(limit: usize) -> Self {
Self {
count: AtomicUsize::new(0),
limit,
}
}
/// Increments only if the count is below the limit.
/// This requires compare_exchange because the update depends on the current value.
fn try_increment(&self) -> bool {
let mut current = self.count.load(Ordering::Relaxed);
loop {
// If we hit the limit, stop incrementing.
if current >= self.limit {
return false;
}
let next = current + 1;
// Attempt to swap. Use SeqCst for success to ensure visibility.
match self.count.compare_exchange_weak(
current,
next,
Ordering::SeqCst,
Ordering::Relaxed,
) {
Ok(_) => return true,
Err(actual) => current = actual,
}
}
}
/// Returns the current count.
fn get(&self) -> usize {
self.count.load(Ordering::SeqCst)
}
}
The try_increment method checks the limit before attempting the swap. If the limit is reached, it returns false immediately. Otherwise, it tries to increment. The dependency on the current value forces the loop. You cannot use fetch_add here because fetch_add would overshoot the limit.
Pitfalls and errors
Atomics are fast, but they do not fix logic errors. The compiler helps with types, but it cannot verify your memory model reasoning.
If you try to share a non-atomic type across threads, the compiler rejects you with E0277. usize is not Sync. You must wrap it in AtomicUsize or a Mutex. This error saves you from data races.
A subtle trap is the ABA problem. Imagine you read value A. Another thread changes it to B, then back to A. You compare and see A, so you swap. The swap succeeds, but the context changed. This breaks lock-free linked lists and queues. Rust's standard library does not expose raw ABA-prone primitives. If you build complex structures, you need hazard pointers or epoch-based reclamation. That requires unsafe and deep knowledge. Stick to simple counters and flags unless you have a specific need.
Another pitfall is misusing orderings. Using Relaxed everywhere is tempting, but it can lead to stale reads. If thread A writes X and then sets a flag, thread B might see the flag before seeing X if you use Relaxed. Use SeqCst by default. Optimize to Acquire/Release only after profiling proves SeqCst is the bottleneck.
Convention aside: fetch_add returns the previous value. If you need the new value, add one to the result. This trips up many beginners. The return value is the old state, not the new state.
Decision matrix
Use a Mutex when the update logic is complex, involves multiple fields, or holds references. The overhead is worth the simplicity and the borrow checker guarantees.
Use fetch_add or fetch_or when you need a simple arithmetic or bitwise update that does not depend on the current value. These are single-instruction atomics and faster than loops.
Use a compare_exchange loop when the new value depends on the old value, like a bounded counter or a state machine transition.
Use Arc<T> to share ownership of the atomic structure across threads. Atomics handle the data; Arc handles the lifetime.
Use UnsafeCell and raw pointers only when building a custom lock-free queue or stack. This requires deep knowledge of memory models and the ABA problem.
Lock-free means no blocking, not no thinking. Reach for fetch_add first. The loop is the escape hatch, not the default. Trust the borrow checker for safety, but verify your Orderings for correctness.