How to Use the unicode-segmentation Crate for Grapheme Clusters

When one character isn't one thing

You're building a text editor. The user types "Hello 🌍". You want to move the cursor back one character. You grab the last char and delete it. The cursor moves, but the screen still shows the globe. The user thinks your app is broken. You check the length. len() says 12. You count on your fingers: H-e-l-l-o-space-emoji. That's seven things. Where did the extra five come from?

Rust isn't lying. Your definition of "character" is wrong.

Grapheme clusters vs code points

Rust's char type is a Unicode scalar value. It represents a single code point. A code point is a number in the Unicode standard. Most of the time, one code point equals one thing you see on screen. The letter A is one code point. The digit 7 is one code point.

Unicode is not that simple. Many things you see on screen are built from multiple code points glued together. A flag emoji is two regional indicator symbols. An emoji with a skin tone is a base emoji plus a modifier. A letter with an accent mark might be the letter plus a combining mark.

The term for what a user sees as one character is a grapheme cluster. A grapheme cluster can be one code point, or it can be many.

Think of a char as a single Lego brick. A grapheme cluster is the finished model you build from those bricks. You might need one brick for a simple block, or five bricks for a complex figure. When you look at the shelf, you see models, not loose bricks. Your text editor needs to handle models, not bricks.

Minimal example

The unicode-segmentation crate provides the UnicodeSegmentation trait. This trait adds methods to &str for splitting text into grapheme clusters.

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    // The string contains a base emoji and a skin tone modifier.
    // Visually, this is one character. Technically, it's two code points.
    let text = "👋🏽";

    // chars() splits on code points.
    // This loop runs twice.
    for c in text.chars() {
        println!("Char: {:?}", c);
    }

    // graphemes() splits on user-perceived characters.
    // This loop runs once.
    for g in text.graphemes(true) {
        println!("Grapheme: {:?}", g);
    }
}

The chars() iterator yields char values. A char is always four bytes. It represents one Unicode scalar value. When you iterate 👋🏽 with chars(), you get the hand emoji, then the skin tone modifier. They are separate.

The graphemes() iterator yields &str slices. It returns a slice because a grapheme cluster can span multiple code points. You can't put multiple code points into a single char. The iterator walks the string, checks the Unicode rules, and yields slices that correspond to what a human sees as one character.

The true argument tells the iterator to use extended grapheme clusters. This is the standard for modern text processing. It handles things like flags, ZWJ sequences, and combining marks correctly.

Your text editor needs to handle models, not bricks.

How the iterator works

The graphemes() method returns an iterator that yields &str slices. These slices borrow from the input string. No new String allocations happen. This is zero-copy iteration. You can process a massive file without blowing up memory.

The lifetime of each slice is tied to the input. If the input is a &str, the graphemes are &str with the same lifetime. If you collect graphemes into a Vec<&str>, you must ensure the input lives long enough. The compiler enforces this. If you try to return a grapheme slice from a function where the input is dropped, the compiler rejects you with E0597 (borrowed value does not live long enough).

The UnicodeSegmentation trait is not in the standard library. You must import it to use the methods. This is a common Rust pattern. The crate extends &str with new methods via a trait. You don't wrap the string in a new type. You just import the trait and call the method. This keeps the API ergonomic.

The iterator has overhead. It must look ahead to determine cluster boundaries. For ASCII text, the overhead is small but non-zero. For complex text, the overhead is higher. If you are processing terabytes of logs and only need to split by newline, graphemes() is overkill. Use split('\n'). Use graphemes() only when you care about user perception. Profile your code. If graphemes() is the bottleneck, consider whether you actually need graphemes. Maybe chars() is sufficient for your algorithm.

The iterator gives you slices, not copies. You get zero-cost iteration over user-perceived characters.

Combining marks and normalization

Unicode allows multiple representations for the same visual character. The letter é can be a single code point U+00E9. Or it can be e followed by U+0301 (combining acute accent). Both look identical on screen.

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    // Precomposed form: single code point for é.
    let cafe1 = "café";

    // Decomposed form: e plus combining accent.
    let cafe2 = "cafe\u{0301}";

    // chars() sees different sequences.
    println!("cafe1 chars: {:?}", cafe1.chars().count()); // 4
    println!("cafe2 chars: {:?}", cafe2.chars().count()); // 5

    // graphemes() sees the same clusters.
    println!("cafe1 graphemes: {:?}", cafe1.graphemes(true).count()); // 4
    println!("cafe2 graphemes: {:?}", cafe2.graphemes(true).count()); // 4
}

chars() sees two different sequences. graphemes() sees one cluster in both cases. This matters for equality. If you compare strings by chars(), café and cafe\u{0301} are different. If you compare by graphemes, they are the same. For user input, grapheme equality is usually what you want.

Graphemes normalize the view. Two different byte sequences can look identical and behave as one unit.

Realistic example: counting and reversing

You have a text input that allows 10 characters. You can't use .len(). That gives bytes. You can't use .chars().count(). That gives code points. You need graphemes.

use unicode_segmentation::UnicodeSegmentation;

/// Counts user-perceived characters in a string.
/// This is the correct way to enforce a character limit for UI display.
fn count_graphemes(text: &str) -> usize {
    text.graphemes(true).count()
}

/// Reverses a string while keeping grapheme clusters intact.
/// Reversing by chars() breaks emojis and combining marks.
fn reverse_graphemes(text: &str) -> String {
    text.graphemes(true).rev().collect()
}

fn main() {
    // A flag is two regional indicator symbols.
    // chars().count() returns 2.
    // graphemes().count() returns 1.
    let flag = "🇺🇸";
    println!("Flag graphemes: {}", count_graphemes(flag));

    // Reversing by chars() splits the flag.
    // Reversing by graphemes() keeps the flag whole.
    let text = "Hi🇺🇸";
    println!("Reversed: {}", reverse_graphemes(text));
}

The count_graphemes function is safe for UI limits. The reverse_graphemes function preserves visual integrity. If you reverse by chars(), the flag breaks into two separate symbols. The user sees garbage.

Reverse graphemes, not chars. Your users expect the emoji to stay whole.

Pitfalls and compiler errors

You can't use graphemes to fix indexing. Rust strings are byte-indexed. text[0..5] panics if the slice cuts a UTF-8 sequence. Graphemes don't change this. You still have to iterate to find byte boundaries. If you need random access, you need a different data structure.

If you try to store graphemes in a Vec<char>, the compiler rejects you. graphemes() yields &str, not char. You get E0308 (mismatched types). You need Vec<&str> or Vec<String>.

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let text = "Hello";

    // This fails. graphemes() returns &str, not char.
    // Error[E0308]: mismatched types
    let _bad: Vec<char> = text.graphemes(true).collect();

    // This works. Collect into Vec<&str>.
    let good: Vec<&str> = text.graphemes(true).collect();
}

The graphemes() method requires a boolean argument. You can't call text.graphemes(). You must pass true or false. The convention is to always pass true unless you have a specific legacy reason not to. Extended grapheme clusters are the standard. Passing false uses the older simple grapheme cluster algorithm, which breaks many modern emojis and sequences.

Graphemes give you slices, not indices. Iterate to find your boundaries.

When to use graphemes

Use graphemes() when you need to count characters for a UI limit, a tweet length, or a text field constraint. Use graphemes() when reversing a string for display, so emojis and accented characters stay intact. Use graphemes() when implementing cursor movement in a text editor, where "one character" means one visual unit. Use graphemes() when comparing user input for equality, to handle combining marks and normalization correctly.

Use chars() when you need to process individual Unicode scalar values, such as checking for specific code points or performing normalization. Use chars() when you are building a search index that operates on code points. Use chars() when you are parsing a protocol that defines tokens by code points.

Use bytes() when you are parsing binary protocols or checking for ASCII-only content. Use bytes() when you need maximum performance and the text is guaranteed to be ASCII.

Use split_whitespace() when you need to tokenize text by words, not characters. Use split() when you need to split by a delimiter string.

Pick the iterator that matches your mental model of the text. If the user sees it as one thing, treat it as one thing.

Where to go next

Grapheme clusters are the visual units of text that users see, which can differ from the raw data bytes or standard characters. This tool ensures your code treats complex symbols, like emojis or accented letters, as single items rather than breaking them apart. Think of it as counting words by how they look to a human, not by how many letters are typed.