When duplicates clog the pipeline
You're building a log analyzer. The backend streams a vector of error codes: vec![404, 500, 404, 503, 500, 404]. You need to generate a report of unique errors. Printing the raw vector is useless because the same error repeats dozens of times. You need to clean the list.
Rust's standard library doesn't provide a single .unique() method on Vec. This is a deliberate design choice. Removing duplicates involves trade-offs that depend on your data. Do you care about the original order? Do you have enough memory to build a lookup table? Is the data already sorted? The standard library forces you to pick the strategy that matches your constraints instead of hiding the cost behind a magic method.
Two paths to uniqueness
Removing duplicates boils down to two fundamental approaches. You can sort the data first, which groups identical values together so you can sweep them away in a single pass. Or you can walk through the data once, remembering what you've seen, and only keep the first occurrence of each item.
Sorting is fast and uses almost no extra memory. It modifies the vector in place. The downside is that it destroys the original order. If the sequence of events matters, sorting breaks that information.
Tracking what you've seen preserves the order. You iterate through the vector, check a set of seen items, and build a new vector with only the first occurrences. This keeps the original sequence intact. The downside is memory. You need to store a copy of every unique element in a hash set, which doubles the memory usage for large datasets.
The sort-and-sweep approach
When order doesn't matter, sorting is the most efficient path. You sort the vector to bring duplicates next to each other, then call dedup to remove consecutive duplicates.
fn main() {
// Start with a messy list of error codes
let mut codes = vec![404, 500, 404, 503, 500, 404];
// Sort brings identical values together
// sort_unstable is faster than sort because it skips stability checks
codes.sort_unstable();
// dedup removes consecutive duplicates in place
// It returns the new length, but modifies the vector directly
codes.dedup();
println!("{:?}", codes); // [404, 500, 503]
}
The dedup method only removes consecutive duplicates. If you call dedup on [1, 2, 1], nothing happens because the 1s are not neighbors. Sorting is the prerequisite that makes dedup effective for general deduplication.
Under the hood, dedup uses a two-pointer algorithm. It maintains a read pointer that scans the vector and a write pointer that tracks where the next unique element should go. When the read pointer finds an element different from the previous one, it copies that element to the write pointer and advances the write pointer. This avoids shifting elements unnecessarily and runs in linear time after the sort.
Convention aside: always use sort_unstable before dedup. Stable sorting preserves the relative order of equal elements. dedup throws away all but one equal element, so that stability is wasted work. sort_unstable skips the stability checks and runs faster. The community treats sort_unstable().dedup() as the idiomatic pattern for in-place deduplication.
Use sort_unstable before dedup. You get the speed of an unstable sort without paying for stability you don't need.
The HashSet approach
When you must preserve the original order, a HashSet is the tool. You iterate through the vector, insert each item into the set, and only keep items that the set hasn't seen before.
use std::collections::HashSet;
fn main() {
let codes = vec![404, 500, 404, 503, 500, 404];
// HashSet tracks seen items for O(1) lookups
let mut seen = HashSet::new();
let mut unique = Vec::new();
for code in codes {
// insert returns true only if the value was new
// This checks and adds in a single operation
if seen.insert(code) {
unique.push(code);
}
}
println!("{:?}", unique); // [404, 500, 503]
}
The key here is that HashSet::insert returns a boolean. It returns true if the value was inserted because it wasn't present. It returns false if the value was already in the set. This lets you check for duplicates and add the item in one step.
Convention aside: write if seen.insert(item) instead of checking contains first. Calling contains performs a hash lookup. Calling insert performs another hash lookup. Doing both wastes work. The insert pattern is idiomatic and faster because it does the lookup once.
Hash sets require the elements to implement Hash and Eq. The Hash trait provides the hash function that maps values to buckets. The Eq trait resolves collisions. Different values can map to the same bucket, and Eq checks actual equality to distinguish a collision from a true duplicate. If your type doesn't implement these traits, the compiler will reject the code.
Trust insert to do the work. It checks and adds in one operation. Calling contains first wastes a lookup.
Real-world structs and custom logic
Real code rarely deduplicates plain integers. You usually have structs, and you might want to deduplicate based on a specific field rather than the whole struct.
If you need uniqueness based on the entire struct, derive Hash and Eq.
use std::collections::HashSet;
#[derive(Debug, Clone, Hash, Eq, PartialEq)]
struct Event {
id: u32,
name: String,
}
fn main() {
let events = vec![
Event { id: 1, name: "Login".into() },
Event { id: 2, name: "Click".into() },
Event { id: 1, name: "Login".into() }, // Duplicate
];
let mut seen = HashSet::new();
let mut unique_events = Vec::new();
for event in events {
if seen.insert(event.clone()) {
unique_events.push(event);
}
}
println!("{:?}", unique_events);
}
If you only care about uniqueness by ID, dedup_by is often better. It lets you define a custom equality predicate. You still need to sort first, but you can sort by the key you care about.
fn main() {
let mut events = vec![
Event { id: 2, name: "Click".into() },
Event { id: 1, name: "Login".into() },
Event { id: 1, name: "LoginRetry".into() }, // Same ID, different name
];
// Sort by the field that defines uniqueness
events.sort_by_key(|e| e.id);
// dedup_by removes duplicates based on the closure
// It keeps the first element of each group
events.dedup_by(|a, b| a.id == b.id);
println!("{:?}", events);
// Output: [Event { id: 1, name: "Login" }, Event { id: 2, name: "Click" }]
}
dedup_by compares adjacent elements using your closure. If the closure returns true, the second element is considered a duplicate and gets removed. This is useful when two objects are "the same" for your purposes even if they aren't byte-for-byte identical.
Derive the traits. Don't implement Hash and Eq by hand unless you have a specific reason. The derive macros generate correct implementations that stay in sync with your struct fields.
Pitfalls and compiler errors
The most common mistake with dedup is trying to assign its result back to the vector. dedup modifies the vector in place and returns the new length as a usize.
let mut numbers = vec![1, 2, 2, 3];
// This fails: dedup returns usize, not Vec
// let numbers = numbers.dedup();
If you try this assignment, the compiler rejects you with E0308 (mismatched types). The error message points out that you expected a Vec but found a usize. Just call dedup() as a statement. The vector is already updated.
Another trap is forgetting that dedup only removes consecutive duplicates. If you skip the sort step, dedup does nothing for scattered duplicates.
let mut numbers = vec![1, 2, 1, 3];
numbers.dedup(); // No change: [1, 2, 1, 3]
dedup is not magic. It only removes neighbors. If duplicates aren't neighbors, dedup does nothing.
When using HashSet, you'll hit E0277 (trait bound not satisfied) if your type doesn't implement Hash and Eq. This happens often with custom structs. The compiler tells you exactly which trait is missing. Add #[derive(Hash, Eq, PartialEq)] to the struct definition. Note that Eq requires PartialEq, so derive both.
Read the return type. dedup modifies in place and returns the count. Assigning the result back to the vector is a type error.
Decision matrix
Use sort_unstable and dedup when order doesn't matter and you want the fastest in-place solution with minimal memory overhead.
Use HashSet with insert when you must preserve the original order of first occurrences and have O(n) memory available.
Use dedup_by when uniqueness depends on a specific field or custom logic rather than full equality, and the data can be sorted by that field.
Use BTreeSet when you need sorted unique output and the data is already partially ordered, or you want to avoid the O(n log n) sort step by inserting into a tree structure directly.
Match the tool to the constraint. Memory constraints favor sorting. Order constraints favor hashing.