How to Compare Strings with Different Encodings in Rust

The encoding trap

You are building a tool to merge user profiles. One profile arrives from a modern web API. The other comes from a legacy database dump that screams "Latin-1" in every byte. You load both names into memory. You check if they match. The check fails. The names look identical on screen. The bytes are completely different. You just hit the encoding wall.

This happens because computers do not store text. They store numbers. Encodings are the maps that translate those numbers into characters. If you compare the numbers directly without consulting the map, you are comparing ink patterns, not meaning.

Rust forces you to consult the map. It refuses to let you treat arbitrary bytes as text. This design eliminates a massive class of bugs found in other languages, where strings can silently contain garbage or mixed encodings. In Rust, text is always valid UTF-8. If you have bytes in another encoding, they stay bytes until you explicitly decode them.

Rust's UTF-8 guarantee

Rust's String type is UTF-8. The &str slice is UTF-8. There is no configuration flag. There is no String<Latin1>. This constraint is the foundation of Rust's text safety.

When you hold a &str, the compiler guarantees that the underlying bytes form a valid UTF-8 sequence. This guarantee allows Rust to provide efficient operations. You can iterate over characters safely. You can search for substrings without worrying about multi-byte boundaries. You can compare strings semantically, knowing that the comparison respects Unicode scalar values.

If you have a Vec<u8> containing Latin-1 data, Rust treats it as raw binary. You cannot pass it to a function expecting &str. The compiler rejects the code. This rejection is a feature. It forces you to handle encoding at the boundaries of your system. You decode the data as soon as it enters your program. Once decoded, the data becomes a String. Inside your code, you only deal with String and &str. The encoding problem is solved at the door.

Think of encodings as different alphabets for the same language. Latin-1 is a limited alphabet. UTF-8 is the universal set. If you want to compare two texts, you must translate both into the universal set first. Comparing raw bytes is like comparing the shapes of letters in two different alphabets. The shapes differ even if the words are the same.

Minimal comparison

When both strings are UTF-8, comparison is straightforward. Rust compares the Unicode scalar values, not the raw bytes. This matters because UTF-8 is variable-width. Some characters take one byte. Others take two, three, or four.

fn main() {
    // String literals are UTF-8 by default.
    let name1 = "Café";
    let name2 = "Café";

    // Comparison iterates over Unicode scalar values.
    // Rust handles the variable-width encoding automatically.
    println!("Match: {}", name1 == name2);
}

The == operator on &str does not perform a byte-level memory comparison. It decodes the bytes on the fly and compares the characters. This ensures that "Café" equals "Café" even if the internal byte representation varies due to different Unicode normalization forms, provided the scalar values match.

If you try to compare a Vec<u8> with a &str, the compiler rejects you with E0277 (trait bound not satisfied). The types are incompatible. You must decode the bytes first.

Decoding legacy data

Real-world data is rarely clean. You will encounter files in Latin-1, Shift-JIS, Windows-1252, and other encodings. Rust's standard library provides std::str::from_utf8, but that function only accepts valid UTF-8. If you pass Latin-1 bytes to from_utf8, it returns an Err.

To handle other encodings, you need a decoding library. The community standard is the encoding_rs crate. It supports hundreds of legacy encodings and provides fast, safe decoding.

Add encoding_rs to your Cargo.toml. Then use it to decode bytes into a String.

use encoding_rs::Encoding;

fn main() {
    // Simulate a Latin-1 byte sequence.
    // The byte 0xe9 represents 'é' in Latin-1.
    let latin1_bytes: Vec<u8> = vec![b'C', b'a', b'f', 0xe9];

    // Look up the encoding by its IANA name.
    let encoding = Encoding::for_iana_name("iso-8859-1").unwrap();

    // Decode the bytes into a String.
    // The result is a tuple: (String, bool).
    let (decoded, had_errors) = encoding.decode(&latin1_bytes);

    // Check the loss flag.
    // A true value means some bytes were invalid and replaced.
    if had_errors {
        eprintln!("Warning: Lossy conversion occurred.");
    }

    // Now you have a standard Rust String.
    let modern_string = "Café";

    // Compare the decoded string with a UTF-8 string.
    println!("Match: {}", decoded == modern_string);
}

The decode method returns a tuple. The first element is the decoded String. The second element is a boolean indicating whether any bytes were invalid. When encoding_rs encounters a byte that cannot be decoded, it inserts the Unicode replacement character �. This keeps the output valid UTF-8. The loss flag tells you whether this substitution happened.

Convention aside: Always check the loss flag when processing external data. Ignoring it is a common source of silent data corruption. If you see the replacement character in your output, you have lost information. Log the error or reject the record.

The cost of decoding

Decoding is not free. It requires iterating over the bytes, validating the sequence, and allocating a new String. If you are comparing millions of strings, decoding every time adds overhead.

If you know that two byte sequences share the same encoding, you can compare them as bytes. This is faster. However, this optimization is dangerous. It only works if the encoding is identical and the data is clean. If one stream is Latin-1 and the other is Windows-1252, byte comparison will produce false negatives. Windows-1252 extends Latin-1 with extra characters in the 0x80 to 0x9F range. Bytes in that range differ between the two encodings.

Decode at the boundary. Keep the interior pure. Decode the data as soon as it enters your system. Store it as String. Compare String values. This approach is safer and easier to reason about. The performance cost is usually negligible compared to I/O and network latency.

Pitfalls and gotchas

String comparison in Rust is robust, but subtle issues remain.

Normalization

Unicode allows multiple ways to represent the same character. The character é can be a single code point U+00E9 or a base e followed by a combining acute accent U+0301. These are called NFC and NFD forms.

Rust's == compares code points. It does not normalize. If one string uses the single code point and the other uses the combining sequence, the comparison fails.

fn main() {
    // Single code point for 'é'.
    let s1 = "Café";

    // Base 'e' plus combining acute accent.
    // This looks identical on screen but has different bytes.
    let s2 = "Caf\u{0065}\u{0301}";

    // Comparison fails because code points differ.
    println!("Match: {}", s1 == s2); // false
}

If your data comes from mixed sources, you might need normalization. Use the unicode-normalization crate to convert strings to a canonical form before comparison. This is a common requirement when dealing with user input from web forms and mobile apps.

Normalization is the silent killer of string equality. Check it before you blame the encoding.

Binary data

Not all byte sequences are text. Cryptographic hashes, image headers, and protocol buffers are binary. Do not attempt to decode binary data as UTF-8. The result will be garbage or errors.

If you need to compare binary data, use Vec<u8> or &[u8]. The == operator works on slices. It compares bytes directly. This is correct for binary data.

fn main() {
    // Binary data, not text.
    let hash1 = vec![0xde, 0xad, 0xbe, 0xef];
    let hash2 = vec![0xde, 0xad, 0xbe, 0xef];

    // Byte-level comparison is correct here.
    println!("Match: {}", hash1 == hash2);
}

If you try to treat binary data as a string, you risk panics and mojibake. Keep binary and text separate. Use Vec<u8> for binary. Use String for text.

Decision matrix

Use String and &str when your data is UTF-8 and you need semantic comparison. Use encoding_rs when you encounter legacy encodings like Latin-1, Shift-JIS, or Windows-1252 and must decode to text. Use Vec<u8> comparison when you are dealing with binary protocols, cryptographic hashes, or raw data where encoding is irrelevant. Use unicode-normalization when you need to compare strings that might use different Unicode representations for the same character.

Decode at the boundary. Keep the interior pure. Trust the loss flag. It tells you when your data is lying.

Where to go next

Rust strings are strictly UTF-8, so you cannot directly compare a UTF-8 string with data in a different format like Latin-1 or ASCII bytes. You must first translate the foreign data into UTF-8 so both sides of the comparison speak the same language. Think of it like translating two people speaking different languages into English before asking if they said the same thing.