How to Use ICU Collation in Rust

Use the `icu_collator` crate to sort strings according to Unicode Collation Algorithm rules. Add the dependency to your `Cargo.toml` and initialize a collator with your target locale.

When ASCII sort breaks your users

You build a contact list. You sort the names. Your German user complains that "MΓΌller" appears next to "Mack" instead of "Muller". Your French user points out that "ZΓΌrich" is at the very end of the list, far from "Zurich". Your application works fine on your machine, but the sort order feels wrong to anyone who speaks a language with accents, special characters, or complex alphabet rules.

The problem is that you sorted by bytes. Rust's default String::cmp compares UTF-8 bytes. That gives you a deterministic, fast order. It also gives you garbage for human-readable text in most languages. Text sorting requires rules. Those rules depend on the user's locale. The Unicode Collation Algorithm (UCA) defines how characters compare across scripts and languages. The icu_collator crate brings those rules to Rust.

Collation rules live in the ICU

Collation is the process of defining rules for how strings compare. A collator is an object that knows those rules for a specific locale. When you ask a collator to compare "apple" and "Banana", it doesn't just look at the bytes. It checks the locale data. In English, case might be secondary to letter identity. In some locales, accents matter more than case. In others, special characters like "ß" expand to "ss" for sorting purposes.

The icu_collator crate is part of the ICU4X project. It provides a safe, Rust-native interface to the ICU data. The crate ships with data files that contain the collation rules for hundreds of locales. You don't need to install system libraries. The rules are embedded in your binary or loaded from a data directory.

Convention aside: The Rust community prefers icu_collator over the older unicode-collation crate. The newer crate aligns with the ICU4X architecture, receives active maintenance, and integrates better with the rest of the icu_* ecosystem. If you start a new project, use icu_collator.

Minimal example

Add the dependencies to your Cargo.toml. You need icu_collator for the logic and icu_locid for locale handling.

[dependencies]
icu_collator = "1.5"
icu_locid = "1.5"

Create a collator for a locale and use it to sort a vector.

use icu_collator::Collator;
use icu_locid::Locale;

fn main() -> icu_collator::Result<()> {
    // Parse the locale string. This validates the BCP 47 format.
    // It returns an error if the string is malformed.
    let locale: Locale = "de-DE".parse()?;

    // Create a collator for German rules.
    // This loads the collation data for the locale.
    let mut collator = Collator::try_new(&locale)?;

    let mut cities = vec!["ZΓΌrich", "Berlin", "MΓΌnchen", "Aachen"];

    // Sort the vector using the collator's rules.
    collator.sort(&mut cities);

    println!("{:?}", cities);
    Ok(())
}

The output respects German sorting rules. "ZΓΌrich" sorts near "Zurich". "MΓΌnchen" sorts by "M". The collator handles the nuances that byte comparison misses.

How the collator works

The Collator struct holds a reference to the collation data for a locale. Creating a collator is the expensive step. It parses the locale, loads the data, and builds internal tables. Once created, the collator is fast. You can reuse it for thousands of comparisons.

The sort method takes a mutable slice and sorts it in place. It uses the collator's compare method under the hood. The compare method returns an Ordering (Less, Equal, or Greater). You can also call compare directly if you need custom sorting logic.

use icu_collator::Collator;
use icu_locid::Locale;
use std::cmp::Ordering;

fn main() -> icu_collator::Result<()> {
    let locale: Locale = "en-US".parse()?;
    let collator = Collator::try_new(&locale)?;

    let a = "apple";
    let b = "Banana";

    // compare returns an Ordering.
    // In en-US, case is often secondary, so "apple" might come before "Banana".
    let result = collator.compare(a, b);

    match result {
        Ordering::Less => println!("{} comes before {}", a, b),
        Ordering::Greater => println!("{} comes after {}", a, b),
        Ordering::Equal => println!("{} equals {}", a, b),
    }

    Ok(())
}

Don't create a new collator for every comparison. Cache the collator. Construction loads data. Reuse the object.

Sorting structs and reusing the collator

Real code rarely sorts raw strings. You sort structs. Use sort_by with the collator's compare method. This lets you sort by a specific field while respecting locale rules.

use icu_collator::Collator;
use icu_locid::Locale;

#[derive(Debug)]
struct Person {
    name: String,
    city: String,
}

fn sort_contacts(contacts: &mut [Person], locale: &Locale) -> icu_collator::Result<()> {
    // Create the collator once.
    // Passing the locale reference is cheap; the collator holds the data.
    let collator = Collator::try_new(locale)?;

    // Sort by name using the collator.
    contacts.sort_by(|a, b| {
        // compare takes &str. The String fields coerce automatically.
        collator.compare(&a.name, &b.name)
    });

    Ok(())
}

fn main() -> icu_collator::Result<()> {
    let locale: Locale = "fr-FR".parse()?;

    let mut people = vec![
        Person { name: "Γ‰milie".to_string(), city: "Paris".to_string() },
        Person { name: "Alice".to_string(), city: "Lyon".to_string() },
        Person { name: "Zach".to_string(), city: "Marseille".to_string() },
    ];

    sort_contacts(&mut people, &locale)?;

    for p in &people {
        println!("{:?}", p);
    }

    Ok(())
}

If you need to sort by multiple fields, chain comparisons. Return the first non-equal result.

contacts.sort_by(|a, b| {
    // Compare by city first.
    let city_cmp = collator.compare(&a.city, &b.city);
    if city_cmp != std::cmp::Ordering::Equal {
        return city_cmp;
    }
    // If cities are equal, compare by name.
    collator.compare(&a.name, &b.name)
});

If you try to call .sort() on a Vec<Person> without implementing Ord, the compiler rejects you with E0277 (trait bound not satisfied). The collator approach bypasses this by providing a custom comparison function.

Tuning strength and options

Collation has "strength" levels. These control how strictly the collator compares characters. The default strength depends on the locale. You can override it with CollatorOptions.

  • Primary: Ignores case and accents. "Apple", "apple", and "Γ„pple" are equal. Useful for search indexing or fuzzy matching.
  • Secondary: Considers accents but ignores case. "Apple" equals "apple", but differs from "Γ„pple".
  • Tertiary: Considers case and accents. This is the default for most display sorting.
  • Quaternary: Includes punctuation and other minor differences.

Use options when you need specific behavior. For example, a search feature might want primary strength to match "cafe" and "cafΓ©".

use icu_collator::{Collator, CollatorOptions, Strength};
use icu_locid::Locale;

fn main() -> icu_collator::Result<()> {
    let locale: Locale = "en-US".parse()?;

    // Configure the collator to ignore case and accents.
    let options = CollatorOptions {
        strength: Strength::Primary,
        ..Default::default()
    };

    let collator = Collator::try_new_with_options(&locale, options)?;

    let mut words = vec!["Apple", "apple", "APPLE", "Γ„pple"];
    collator.sort(&mut words);

    // With Primary strength, the order might be unstable or based on
    // tie-breaking rules, since all words are considered equal.
    println!("{:?}", words);
    Ok(())
}

Convention aside: When you use try_new_with_options, always spread ..Default::default(). New options may be added in future versions. Spreading defaults ensures your code compiles against updates.

Pitfalls and performance

Collation is slower than byte comparison. The collator must look up rules for each character. If you sort millions of strings, measure the cost. Profile before optimizing. For most applications, the difference is negligible. User-facing sort order matters more than micro-optimizations.

Memory usage depends on the data. The ICU data files are large. If you embed all locales, your binary grows. You can reduce size by embedding only the locales you need. Check the crate documentation for feature flags to control data inclusion.

Locale parsing can fail. Locale::parse returns a Result. Always handle the error. If you accept locale strings from users, validate them. A malformed locale string can crash your application if you unwrap blindly.

let locale_str = user_input;
let locale: Locale = match locale_str.parse() {
    Ok(l) => l,
    Err(_) => {
        // Fall back to a default locale.
        "en-US".parse().expect("Default locale must be valid")
    }
};

Thread safety is straightforward. Collator is not Send or Sync by default because it holds references to data. If you need to share a collator across threads, wrap it in Arc. The collator itself is immutable after construction, so sharing is safe.

use std::sync::Arc;
use icu_collator::Collator;
use icu_locid::Locale;

fn main() -> icu_collator::Result<()> {
    let locale: Locale = "en-US".parse()?;
    let collator = Collator::try_new(&locale)?;

    // Wrap in Arc to share across threads.
    let shared_collator = Arc::new(collator);

    // Clone the Arc to send to worker threads.
    let worker_collator = shared_collator.clone();
    // Use worker_collator in the thread.
    Ok(())
}

Don't fight the compiler here. If you need thread-safe access, reach for Arc. The collator is cheap to clone behind an Arc.

Decision matrix

Use icu_collator when you display sorted text to users and the order must match their cultural expectations. Use String::cmp when you sort internal identifiers, file paths, or keys where byte-order stability matters more than human readability. Use to_lowercase combined with standard sorting when you need a quick case-insensitive sort for ASCII-only data and can ignore locale-specific case mappings. Use unicode-segmentation when you need to split text into graphemes or words but don't need full collation rules.

Where to go next