The RTL Trap
You paste Arabic text into your Rust program. You print it. The terminal shows "مرحبا" perfectly. You feel good. Then you try to extract the first word. You write text[0..5]. The program panics at runtime. Or you reverse the string for a mirror effect and the output looks like random symbols. Rust didn't corrupt your data. You treated text like an array of bytes when it's a sequence of variable-width code points.
Right-to-Left text works in Rust because Rust stores text as UTF-8 bytes. UTF-8 supports every Unicode character, including Arabic, Hebrew, and Thai. The storage is transparent. The complexity lives in how you process the text and how the renderer displays it. Rust gives you the logical order. The renderer handles the visual order. Confusing those two orders is where bugs hide.
Logical order versus visual order
Text has two distinct orders. Logical order is the sequence of characters as stored in memory. Visual order is the sequence humans read on screen. For English, logical and visual order match. For Arabic, they often differ. A sentence might start with an English word, switch to Arabic, and end with a number. The renderer applies the Bidirectional algorithm to rearrange characters for display while preserving the logical structure.
Rust's String and &str types store logical order. They do not implement the Bidirectional algorithm. When you iterate over a string, you get characters in logical order. When you print a string, you send logical order to the output stream. The terminal, browser, or GUI library receives those bytes and decides how to paint them. If your terminal doesn't support RTL, you'll see the text in logical order, which might look backwards. That's a rendering issue, not a Rust issue.
Rust stores logical order. Your renderer handles the rest.
Storing and counting text
A String is a vector of bytes that guarantees valid UTF-8. You can store RTL text exactly like LTR text. The only difference is the byte width of the characters. Arabic characters live in the Basic Multilingual Plane, so UTF-8 encodes them as 2 bytes each. English characters are 1 byte. Emojis and rare characters can be 3 or 4 bytes.
fn main() {
let text = "مرحبا"; // Arabic for "Hello"
// UTF-8 encodes Arabic characters as 2 bytes each.
// The String holds 10 bytes, not 5.
println!("Bytes: {}", text.len());
// chars() iterates over Unicode scalar values.
// This respects character boundaries and returns 5 items.
println!("Chars: {}", text.chars().count());
}
The len() method returns the byte length. Never use len() to count characters. It returns 10 for this string. The chars() method returns an iterator over char values. Each char is a Unicode Scalar Value. Iterating with chars() handles the variable-width encoding automatically. You get one item per code point.
Convention aside: text.chars().count() is the idiomatic way to get a character count when you don't need grapheme precision. It's slower than len() because it must decode the UTF-8, but it gives you the number of code points. If you need the number of user-perceived characters, you need grapheme clusters, which requires an external crate.
The reversal problem
Reversing text is the most common trap. A naive reversal using chars().rev() reverses the sequence of code points. This breaks text visually and logically when the text contains combining marks or mixed scripts.
fn reverse_naive(text: &str) -> String {
// Reverses code points.
// This splits combining marks and breaks grapheme clusters.
text.chars().rev().collect()
}
Arabic uses diacritics. A word like "مرحَبًا" has a base letter and a mark. chars() sees the base and the mark as separate items. Reversing them puts the mark before the letter. The renderer might display the mark correctly due to shaping rules, but the logical structure is damaged. If you later process the text, the mark is detached from its base.
Grapheme clusters solve this. A grapheme cluster is the smallest unit of text that the user perceives as a single character. It includes the base letter plus any combining marks. Reversing by grapheme clusters preserves the integrity of each visible unit.
use unicode_segmentation::UnicodeSegmentation;
/// Reverses text by grapheme clusters to preserve user-perceived characters.
fn reverse_display(text: &str) -> String {
// graphemes(true) handles combining marks and ZWJ sequences.
// The rev() iterator reverses the clusters, not the code points.
text.graphemes(true).rev().collect()
}
The unicode_segmentation crate is the community standard for grapheme processing. It implements the Unicode Standard Annex #29 algorithm. Don't roll your own grapheme logic. The rules for what constitutes a grapheme are complex and change with Unicode versions.
Reversing by code points breaks text. Reverse by graphemes.
Indexing and panics
You cannot index a string by character position. str does not implement Index<usize>. If you try text[0], the compiler rejects you with E0277 (the trait bound str: Index<usize> is not satisfied). This is a feature. Indexing by character would require scanning the string to find the byte boundary, which is O(n). Rust forces you to be explicit about what you're doing.
You can slice a string by byte range using text[start..end]. The slice must start and end on character boundaries. If you cut a multi-byte character in half, the program panics at runtime.
fn main() {
let text = "مرحبا";
// Arabic characters are 2 bytes.
// Index 1 is in the middle of the first character.
// This panics: byte index 1 is not a char boundary.
let bad_slice = &text[0..1];
}
The panic message tells you exactly what went wrong. To slice safely, you must find valid byte boundaries. The char_indices() method gives you both the byte offset and the character.
fn first_word(text: &str) -> &str {
// Find the byte index of the first space.
// char_indices yields (byte_index, char).
for (i, c) in text.char_indices() {
if c.is_whitespace() {
// i is a valid char boundary.
return &text[..i];
}
}
// No space found. Return the whole string.
text
}
Using char_indices() ensures you always slice on valid boundaries. You can also use is_char_boundary() to check a byte index before slicing. This method returns true if the index is at the start of a character or at the end of the string.
Never index a string by byte offset unless you control the encoding. Iterate instead.
Bidirectional layout
Rust does not implement the Bidirectional algorithm. The algorithm determines visual order from logical order. It handles embedding levels, directionality overrides, and mixed-script paragraphs. Implementing Bidi correctly is hard. The Unicode Standard defines the algorithm in Annex #9.
If you are building a terminal user interface or a custom renderer, you need to compute visual order. Use the unicode-bidi crate. It provides the Bidi algorithm implementation. You pass it a string and it returns the reordered indices for display.
Most applications don't need unicode-bidi. Browsers, terminals, and GUI frameworks handle Bidi automatically. You store text in logical order and let the environment render it. Only reach for unicode-bidi when you are responsible for the final pixel placement.
Choosing the right tool
Use String when you need owned text storage; it handles UTF-8 and RTL characters transparently. Use &str when you need a borrowed view of text; it preserves logical order and avoids allocation. Use text.chars() when you need to iterate over Unicode scalar values and performance matters; be aware this splits grapheme clusters. Use unicode_segmentation when you need user-perceived characters; grapheme clusters handle combining marks and complex scripts correctly. Use unicode_bidi when you must compute visual layout in Rust; the standard library leaves rendering to the environment. Reach for text.len() only when you care about byte size for storage or network transmission; never use it for character counts.
Treat text as bytes until the user sees it. Then treat it as graphemes.