How to Handle Unicode in Rust

When bytes betray your length check

You build a username validator. It accepts "alice". It accepts "bob". Then a user signs up as "café". Your length check rejects them because the string is "too long". Or worse, you try to slice the string at index 5 to extract a prefix, and the program crashes with a panic about an invalid UTF-8 sequence. You aren't dealing with raw bytes. You are dealing with human text, and humans use accents, emojis, and scripts that span multiple bytes. Rust forces you to confront this reality immediately.

In languages like JavaScript, "café".length returns 5 because the engine counts UTF-16 code units. In Python 3, len("café") returns 4 because it counts Unicode code points. Rust takes a different path. Rust stores text as UTF-8 bytes. When you call .len() on a string, you get the byte count. When you iterate with .chars(), you get Unicode scalar values. The compiler blocks you from making assumptions that lead to memory safety violations or data corruption. You must choose the right granularity for your task.

UTF-8 is the law, `char` is a scalar

Rust's standard library types String and str are always valid UTF-8. A String is a growable, owned buffer of bytes. A &str is a borrowed slice of bytes. Both guarantee that the underlying bytes decode to valid Unicode. This guarantee is baked into the type system. You cannot construct a String or str containing invalid UTF-8 without using unsafe code.

UTF-8 is a variable-width encoding. ASCII characters (U+0000 to U+007F) take one byte. Characters from Latin-1 Supplement take two bytes. Most common scripts like Chinese, Japanese, and Korean take three bytes. Rare characters and emojis take four bytes. This design keeps Rust strings compatible with C strings for ASCII data while supporting the full Unicode range.

A char in Rust is a Unicode Scalar Value. It is a 32-bit integer representing a code point in the range U+0000 to U+10FFFF. It excludes surrogate halves, which are an implementation detail of UTF-16. When you iterate over a string with .chars(), Rust decodes the UTF-8 bytes and yields char values. Each char corresponds to one code point.

Think of bytes as individual tiles on a floor. A char is a specific pattern you can form with those tiles. Some patterns use one tile. Some use two, three, or four. The String is the floor. The .chars() iterator walks the floor and tells you about the patterns, not the individual tiles.

Minimal example: Bytes versus Scalars

The distinction between byte length and character count shows up immediately. Use .len() for byte length. Use .chars().count() for scalar count.

fn main() {
    let text = "café";
    
    // .len() returns the number of bytes.
    // 'c', 'a', 'f' are 1 byte each. 'é' is 2 bytes in UTF-8.
    println!("Byte length: {}", text.len()); // 5
    
    // .chars() iterates over Unicode scalar values.
    // This counts the logical characters, not the storage size.
    println!("Char count: {}", text.chars().count()); // 4
    
    // Iterate to see the scalar values.
    for c in text.chars() {
        println!("Char: {} (U+{:04X})", c, c as u32);
    }
}

The output shows four characters. The byte length is five. If you need to allocate a buffer for a C API that expects a null-terminated UTF-8 string, you need the byte length plus one. If you need to limit a username to four characters, you need the char count. Mixing these up causes bugs.

What the compiler enforces

The compiler prevents you from indexing a string with an integer. Writing s[5] is a hard error. The type str does not implement Index<usize>. This restriction exists because an integer index refers to a byte offset, and there is no guarantee that byte offset 5 aligns with a character boundary. If Rust allowed s[5], you could slice a string in the middle of a multi-byte character. The result would be invalid UTF-8, which violates the type system's invariants.

The error message tells you that str cannot be indexed by {integer}. You must use methods that respect UTF-8 boundaries. Use .get(5..10) to attempt a slice, which returns Option<&str>. It returns None if the range is invalid. Use .chars().nth(5) to get the sixth scalar value. Use .is_char_boundary(5) to check if index 5 is a valid split point.

The compiler also ensures that String operations maintain validity. When you push a char to a String, the method encodes it to UTF-8 bytes and appends them. You cannot push a raw byte that would break the encoding. This safety comes at a small cost. Encoding a char requires checking its value and writing one to four bytes. The operation is fast, but it is not free.

The grapheme cluster trap

A char is a code point, not necessarily a visible character. Users perceive text as grapheme clusters. A grapheme cluster is a sequence of one or more code points that form a single user-perceived character. Combining marks and emoji sequences create this gap.

Consider the letter "é". It can be represented as a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE). It can also be represented as two code points: U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT). Both render identically on screen. Rust's .chars() iterator treats them differently. The first yields one char. The second yields two chars.

fn main() {
    // Precomposed form: one code point.
    let composed = "é";
    println!("Composed chars: {}", composed.chars().count()); // 1
    
    // Decomposed form: base letter plus combining mark.
    let decomposed = "e\u{0301}";
    println!("Decomposed chars: {}", decomposed.chars().count()); // 2
    
    // They look the same but have different internal structure.
    assert_eq!(composed, decomposed); // false, bytes differ
}

If you truncate a string by taking the first N characters with .chars().take(n), you might cut a combining mark off its base letter. The result displays as a stray accent mark floating next to the text. This breaks the user experience.

Convention dictates that for user-facing text manipulation, you should work with grapheme clusters, not scalars. The standard library does not include grapheme segmentation because the rules are complex and locale-dependent. Use the unicode-segmentation crate. It provides Graphemes(true) which handles combining marks, emoji modifiers, and ZWJ sequences correctly.

use unicode_segmentation::UnicodeSegmentation;

fn truncate_graphemes(s: &str, max_graphemes: usize) -> String {
    // Graphemes(true) respects user-perceived characters.
    // This prevents cutting combining marks or emoji sequences.
    s.graphemes(true)
     .take(max_graphemes)
     .collect()
}

Trust the grapheme iterator for display logic. Use .chars() only when you need code point level access, such as validating that a string contains only alphanumeric code points.

Slicing safely with boundaries

Slicing a string with s[start..end] panics at runtime if start or end is not a valid character boundary. This panic is a safety mechanism. It prevents you from creating a &str that points to invalid UTF-8. In production code, panics are unacceptable. You must validate boundaries before slicing.

The method is_char_boundary(index) checks if a byte index aligns with a character start. It runs in O(1) time. It does not decode the string. It inspects the byte value at the index. If the byte is less than 128, it is an ASCII character and a boundary. If the byte is greater than 191, it is the start of a multi-byte sequence and a boundary. Otherwise, it is a continuation byte and not a boundary.

fn safe_slice(s: &str, start: usize, end: usize) -> Option<&str> {
    // Check bounds and character alignment.
    // is_char_boundary is O(1) and does not allocate.
    if start <= end 
        && end <= s.len() 
        && s.is_char_boundary(start) 
        && s.is_char_boundary(end) 
    {
        Some(&s[start..end])
    } else {
        None
    }
}

fn main() {
    let text = "café";
    
    // Index 4 is the start of 'é'. Valid boundary.
    println!("{:?}", safe_slice(text, 0, 4)); // Some("caf")
    
    // Index 5 is inside 'é'. Invalid boundary.
    println!("{:?}", safe_slice(text, 0, 5)); // None
}

Use is_char_boundary() whenever you compute indices from external input or complex logic. Never assume an index is safe. The panic is cheaper than corrupted data flowing through your system.

Performance and memory considerations

A char occupies 4 bytes in memory. It is a fixed-size type. When you collect characters into a Vec<char>, you use four times the memory of the original UTF-8 string for ASCII text. For text with heavy non-ASCII usage, the ratio shrinks, but Vec<char> is still rarely the right choice.

Iterating with .chars() decodes UTF-8 on the fly. The iterator yields char values without allocation. The cost is the decoding logic. For ASCII-only data, the decoder is extremely fast. For mixed data, the cost is proportional to the number of characters. If you only need byte length, use .len(). It returns the length of the underlying buffer in O(1) time. Calling .chars().count() iterates the entire string and decodes every character. It is O(N).

Convention favors .len() for size checks and .chars() for logical processing. If you need to count characters frequently, cache the count. Do not call .chars().count() in a hot loop.

When converting between String and &str, no allocation occurs. &str is a view into the String's buffer. Use .as_str() or implicit coercion. This zero-cost view is essential for passing text to functions without copying.

Decision matrix

Use String when you need to own the text data and modify it, such as building a response or accumulating input. Use &str when you only need to read text and want to avoid allocation, such as passing arguments to functions. Use .chars() when you need to iterate over Unicode scalar values, such as validating code points or transforming letters. Use .bytes() when you are parsing a binary protocol or processing ASCII-only data where performance is critical. Use is_char_boundary() when you must slice a string at a specific byte offset and want to avoid a runtime panic. Use the unicode-segmentation crate when you need to handle grapheme clusters, such as counting visible characters, truncating text for display, or implementing text editors. Use .len() when you need the byte size for buffer allocation or network transmission.

Pick the tool that matches your granularity. Bytes for speed, chars for logic, graphemes for humans.

Where to go next

Rust treats text as a sequence of characters that can represent any language, not just English letters. You don't need to manually manage encoding because the language ensures your text is always valid. Think of it like a smart text box that automatically handles emojis and foreign letters without breaking your code.