How to Work with Non-ASCII Text in Rust

Use Rust's built-in UTF-8 String type and the .chars() iterator to safely handle non-ASCII text.

The string indexing trap

You write a function to reverse a string. It works for "hello". You test it with "cafΓ©" and it works. You test it with "πŸ¦€" and the output looks like broken pixels. Or you try to get the first character of "δΈ–η•Œ" by writing text[0] and the compiler screams.

Rust treats strings differently than Python or JavaScript. The difference is the source of most confusion, but it also prevents a whole class of bugs. Rust stores text as UTF-8, and it refuses to lie to you about what that means.

UTF-8 is the law

Rust uses UTF-8 for all text. UTF-8 is a variable-width encoding. ASCII characters take one byte. Non-ASCII characters take two, three, or four bytes. A String in Rust is just a bag of bytes. It does not know where characters start or end until you ask it to look.

Think of a book where some letters take one page and others take three. You cannot jump to the "third letter" by counting pages. You have to walk from the start, counting pages until you find the boundaries. This is efficient for storage and network transmission, but it means random access by character index is not free.

Rust makes this cost explicit. You cannot index a string by integer because the compiler cannot guarantee the index points to a valid character boundary. If it allowed text[0], it would have to walk the string every time, hiding the cost behind a convenient syntax. Rust prefers to make the cost visible.

Bytes, chars, and the heap

A String lives on the heap. It holds three pieces of data: a pointer to the bytes, the current length in bytes, and the allocated capacity in bytes. Under the hood, a String is essentially a Vec<u8> with a guarantee that the bytes are valid UTF-8.

fn main() {
    // The String allocates memory on the heap.
    let text = String::from("Hello, δΈ–η•Œ!");

    // .len() returns the byte length, not the character count.
    // "Hello, " is 7 bytes. "δΈ–" is 3 bytes. "η•Œ" is 3 bytes. "!" is 1 byte.
    // Total is 14 bytes.
    println!("Byte length: {}", text.len());

    // .chars() iterates over Unicode scalar values.
    // This walks the UTF-8 bytes and decodes them into chars.
    for c in text.chars() {
        println!("{c}");
    }
}

The method .len() always returns bytes. This is a frequent tripwire. If you need the number of characters, you must count them.

fn main() {
    let text = String::from("Hello, δΈ–η•Œ!");

    // Counting characters requires walking the string.
    // This is O(n) where n is the number of bytes.
    let char_count = text.chars().count();
    println!("Character count: {char_count}");
}

Treat .len() as a byte counter. If you need character count, use .chars().count().

Iterating correctly

When you iterate over a string, you have two main choices: .chars() and .bytes().

Use .chars() when you need to treat text as a sequence of Unicode scalar values. This is the safe default for text processing. It yields char values, which are 32-bit Unicode scalar values.

Use .bytes() when you are processing raw data, checking for ASCII-only content, or need maximum iteration speed. It yields u8 values. It does not decode UTF-8, so it is faster, but you lose character semantics.

/// Checks if a string contains only ASCII characters.
fn is_ascii_only(text: &str) -> bool {
    // Iterating bytes is faster than chars for this check.
    // We can stop early if we find a byte >= 128.
    text.bytes().all(|b| b < 128)
}

Convention aside: char in Rust is a Unicode scalar value, not a grapheme cluster. It occupies exactly four bytes. This makes char a fixed-size type, unlike String. A char can represent most letters and symbols, but it cannot represent everything a human considers a "character".

The grapheme cluster problem

A Unicode scalar value is not always a single visible character. Some characters are composed of multiple scalar values combined together. Emojis with skin tones, flags, and families are common examples.

A flag like πŸ‡ΊπŸ‡Έ is not one scalar value. It is two regional indicator symbols combined. If you iterate with .chars(), you get two items, not one.

fn main() {
    let flag = "πŸ‡ΊπŸ‡Έ";

    // This loop runs twice, not once.
    // It prints the code points for the two regional indicators.
    for c in flag.chars() {
        println!("{:04x}", c as u32);
    }

    // .len() returns 8 bytes because each indicator is 4 bytes in UTF-8.
    println!("Byte length: {}", flag.len());
}

If you reverse a string using .chars().rev(), you might split these combined characters, producing garbage output. This is why your emoji reversal broke.

Rust's standard library does not include grapheme cluster segmentation because it is complex and depends on locale and context. The standard library stops at scalar values to keep the core small and fast.

Reach for the unicode-segmentation crate when you need to handle grapheme clusters. It provides .graphemes(true) to iterate over user-perceived characters.

Slicing without panicking

You can slice a string, but only if the indices point to valid UTF-8 boundaries. If you slice in the middle of a multi-byte character, the program panics at runtime.

fn main() {
    let text = "Hello, δΈ–η•Œ!";

    // "Hello" is all ASCII, so indices 0 to 5 are safe.
    let ascii_part = &text[0..5];
    println!("{ascii_part}");

    // Index 6 falls inside "δΈ–", which starts at byte 7.
    // This panics at runtime with "byte index 6 is not a char boundary".
    // let bad_slice = &text[0..6];
}

The compiler cannot check slice bounds at compile time because the bounds are often variables. You must ensure the indices are valid. The safest way to slice by character count is to find the byte index first.

/// Returns a slice containing the first `n` characters.
/// Returns None if the string has fewer than `n` characters.
fn first_n_chars(text: &str, n: usize) -> Option<&str> {
    let mut count = 0;
    let mut byte_index = 0;

    // Walk the string to find the byte index for the nth character.
    for (i, _) in text.char_indices() {
        if count == n {
            break;
        }
        byte_index = i;
        count += 1;
    }

    if count < n {
        None
    } else {
        Some(&text[..byte_index])
    }
}

The method char_indices() yields pairs of (byte_index, char). This lets you map character positions to byte positions safely.

Don't guess byte indices. Walk the string to find boundaries.

Pitfalls and compiler errors

Indexing a string with an integer is the most common error. The compiler rejects text[0] with E0277 (trait bound not satisfied). String and &str do not implement Index<usize>. This is by design.

If you need the first character, use .chars().next(). This returns an Option<char>.

fn main() {
    let text = "Hello, δΈ–η•Œ!";

    // Safe way to get the first character.
    if let Some(first) = text.chars().next() {
        println!("First char: {first}");
    }
}

Another pitfall is assuming char is a byte. A char is four bytes. If you convert a char to a byte array, you need to encode it back to UTF-8.

fn main() {
    let c: char = 'δΈ–';

    // A char is a u32 scalar value.
    println!("Code point: {c}");

    // To get the UTF-8 bytes, encode into a buffer.
    let mut buf = [0u8; 4];
    let bytes = c.encode_utf8(&mut buf);
    println!("UTF-8 bytes: {:?}", bytes.as_bytes());
}

Convention aside: str is the unsized slice type. You almost always see it as &str. String is the owned, growable wrapper. Think of &str as a view into text and String as the text itself. Functions should accept &str to allow callers to pass either String or string literals without cloning.

Decision matrix

Use String when you need to own and modify text data. Use &str when you are reading text or passing it to a function without taking ownership. Use .chars() when you need to iterate over Unicode scalar values and performance is not the bottleneck. Use .bytes() when you are processing raw data, checking for ASCII-only content, or need maximum iteration speed. Use the unicode-segmentation crate when you need to handle grapheme clusters, such as emojis with modifiers or combined accents. Reach for char when you need a single Unicode scalar value as a type.

Treat the string as bytes until you need characters. Decode only when necessary.

Where to go next