How to Use SIMD Intrinsics in Rust

Enable SIMD intrinsics in Rust using nightly `std::arch` with target features or the stable `stdsimd` crate for high-performance parallel data processing.

When loops aren't fast enough

You are processing a high-resolution image. A nested loop over pixels is grinding the CPU to a halt. You know the processor can handle multiple pixels in a single clock cycle, but Rust's standard loop touches only one at a time. You want to unlock SIMD.

SIMD stands for Single Instruction, Multiple Data. It lets you perform the same operation on a block of data simultaneously. Instead of adding two numbers, you add eight numbers in one instruction. Instead of comparing one value, you compare sixteen. The speedup is real, but the API is raw. Rust exposes SIMD through std::arch, a module full of hardware-specific intrinsics that bypass the safety net.

The old advice to use the stdsimd crate is obsolete. std::arch has been stable since Rust 1.27. You do not need external crates to access intrinsics. You need unsafe, discipline, and a clear understanding of what the hardware expects.

What SIMD actually is

Think of a chef chopping vegetables. A normal loop is one chef, one knife, one carrot at a time. SIMD is a specialized machine that takes eight carrots, aligns them perfectly, and slices all eight in one motion. The instruction is the same ("slice"), but the data moves in parallel.

The machine is picky. The carrots must be aligned in memory. You must feed the machine exactly eight carrots, not seven or nine. If you misalign the carrots, the machine jams. In Rust terms, misalignment causes a segmentation fault. The compiler cannot always check alignment for you, which is why intrinsics live behind unsafe.

SIMD registers are wide. On x86_64, you have 128-bit registers (SSE), 256-bit registers (AVX/AVX2), and 512-bit registers (AVX-512). A 256-bit register can hold eight f32 values or sixteen i16 values. The intrinsics are named after the instruction set. _mm prefixes usually mean 128-bit. _mm256 means 256-bit. _mm512 means 512-bit.

The minimal intrinsic

Here is the smallest working example. It adds two slices of f32 using AVX2 on x86_64.

/// Adds two vectors of f32 using AVX2 intrinsics.
///
/// # Safety
/// The caller must ensure `a`, `b`, and `out` are aligned to 32 bytes
/// and have at least 8 elements.
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn simd_add_aligned(a: *const f32, b: *const f32, out: *mut f32) {
    use std::arch::x86_64::*;

    // Load 8 floats from aligned memory into a 256-bit register.
    // _mm256_load_ps requires 32-byte alignment.
    let va = _mm256_load_ps(a);
    let vb = _mm256_load_ps(b);

    // Add the vectors in parallel.
    let vsum = _mm256_add_ps(va, vb);

    // Store the result back to aligned memory.
    _mm256_store_ps(out, vsum);
}

Three attributes control this function. #[cfg(target_arch = "x86_64")] tells the compiler to only compile this code on x86_64 CPUs. If you run this on ARM, the function does not exist. #[target_feature(enable = "avx2")] tells the compiler that this function uses AVX2 instructions. The compiler generates AVX2 machine code for this function only. unsafe marks the function as unsafe because it performs raw memory loads and stores without bounds checking.

The intrinsics _mm256_load_ps and _mm256_store_ps assume the pointers are aligned to 32 bytes. If you pass a pointer that is not aligned, the program crashes. The u variants, _mm256_loadu_ps and _mm256_storeu_ps, handle unaligned memory. They are slightly slower on some CPUs but much safer for beginners.

Convention dictates using the unaligned variants unless profiling proves alignment is the bottleneck. The performance difference is often negligible compared to the risk of a segfault.

How the compiler handles this

The #[target_feature] attribute does more than enable instructions. It changes the ABI of the function. Functions with target_feature may use different calling conventions. The compiler might assume certain registers are preserved or clobbered differently.

If you take a function pointer to a SIMD function and call it from a non-SIMD context, you risk undefined behavior. The caller might clobber a register the callee expects to be intact. This is why you should never expose #[target_feature] functions in your public API. Wrap them in safe abstractions.

The compiler also rejects code that violates trait bounds or types. If you pass an i32 pointer to an f32 intrinsic, you get E0308 (mismatched types). If you forget the unsafe block when calling an intrinsic, you get E0133 (dereference of raw pointer requires unsafe). These errors are helpful. They catch mistakes early.

Runtime detection and fallback

Not every CPU supports AVX2. Some older machines only have SSE4.2. Some ARM devices have NEON. You cannot rely on cfg alone. You need runtime detection.

Rust provides macros like is_x86_feature_detected! to check CPU features at runtime. You combine this with a fallback path for CPUs that lack the feature.

/// Processes data with SIMD if available, falls back otherwise.
pub fn process_safe(a: &[f32], b: &[f32], out: &mut [f32]) {
    assert!(a.len() == b.len() && b.len() == out.len());

    // Check if the CPU supports AVX2 at runtime.
    if is_x86_feature_detected!("avx2") {
        // SAFETY: We verified the CPU supports avx2.
        // The slices are valid for the length of the operation.
        // We use unaligned loads to avoid alignment crashes.
        unsafe { simd_add_unaligned(a.as_ptr(), b.as_ptr(), out.as_mut_ptr(), a.len()); }
    } else {
        // Fallback to scalar loop.
        for i in 0..a.len() {
            out[i] = a[i] + b[i];
        }
    }
}

/// Unaligned SIMD add helper.
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn simd_add_unaligned(a: *const f32, b: *const f32, out: *mut f32, len: usize) {
    use std::arch::x86_64::*;

    // Process 8 elements at a time.
    let count = len / 8;
    for i in 0..count {
        let offset = i * 8;
        let va = _mm256_loadu_ps(a.add(offset));
        let vb = _mm256_loadu_ps(b.add(offset));
        let vsum = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(out.add(offset), vsum);
    }

    // Handle remaining elements.
    for i in (count * 8)..len {
        *out.add(i) = *a.add(i) + *b.add(i);
    }
}

The is_x86_feature_detected! macro expands to a fast CPUID check. It is cheap to call. You can put it in hot paths without fear. The fallback loop ensures the code runs on any x86_64 CPU. The cfg attribute ensures the SIMD code only compiles on x86_64. On ARM, the if block is dead code and the compiler optimizes it away.

Convention aside: always check feature detection at runtime, not just at compile time. A user might run your binary on a server with AVX-512 and then on a laptop with only SSE4.2. Your code must adapt.

The safe wrapper pattern

Never expose unsafe intrinsics in your public API. The community calls this the "safe wrapper" pattern. You write the dirty work in a private unsafe function, then wrap it in a public safe function that enforces invariants.

The safe function checks lengths, alignment, and null pointers. It calls the unsafe function only when the invariants hold. The unsafe block is small and isolated. This keeps the blast radius of unsafe minimal.

/// Adds two slices element-wise.
///
/// Uses SIMD if available on x86_64.
pub fn add_slices(a: &[f32], b: &[f32], out: &mut [f32]) {
    // Enforce length invariant.
    if a.len() != b.len() || a.len() != out.len() {
        panic!("Slices must have equal length");
    }

    // SAFETY: Lengths are equal. Pointers are valid for the slice length.
    // The unsafe helper handles architecture dispatch internally.
    unsafe {
        add_slices_impl(a.as_ptr(), b.as_ptr(), out.as_mut_ptr(), a.len());
    }
}

/// Internal implementation with architecture dispatch.
#[inline(always)]
unsafe fn add_slices_impl(a: *const f32, b: *const f32, out: *mut f32, len: usize) {
    #[cfg(target_arch = "x86_64")]
    if is_x86_feature_detected!("avx2") {
        simd_add_unaligned(a, b, out, len);
        return;
    }

    // Scalar fallback for all architectures.
    for i in 0..len {
        *out.add(i) = *a.add(i) + *b.add(i);
    }
}

The #[inline(always)] attribute encourages the compiler to inline the dispatch logic. This avoids function call overhead. The safe wrapper is the only thing users see. They get performance without touching unsafe.

Treat the SAFETY comment as a proof. If you cannot write the invariants, you do not have a proof. The comment lists exactly what the caller must guarantee. If the caller violates the proof, the behavior is undefined.

Pitfalls and traps

SIMD code is a minefield of subtle bugs. Alignment is the biggest trap. Intrinsics like _mm256_load_ps require 32-byte alignment. If you pass a pointer to a Vec<f32> that was not allocated with alignment, the program crashes. Use Vec::with_capacity and align_to if you need guaranteed alignment. Or just use the u variants.

Another trap is the compiler optimizing away your SIMD. If the compiler thinks your loop is equivalent to a simpler operation, it might replace your intrinsics with scalar code. This is rare but happens. Use #[inline(always)] and check the assembly output to verify the intrinsics are present.

Calling conventions can also bite you. Functions with target_feature have a different ABI. You cannot pass them as function pointers to standard library functions like Iterator::map. The iterator expects a standard ABI. If you pass a SIMD function, the iterator might clobber registers the function needs. Wrap SIMD functions in safe abstractions to avoid this.

Compiler errors help here. If you try to use a raw pointer without unsafe, you get E0133. If you mix types, you get E0308. If you borrow a value while moving it, you get E0382. These errors are your friends. They stop you from shipping broken code.

Alignment isn't a suggestion. It's a contract with the hardware. Break it and you segfault.

Decision: intrinsics vs alternatives

Use std::arch intrinsics when profiling proves the compiler's auto-vectorization is insufficient and you need explicit control over the instruction stream. Use intrinsics when you are implementing a library that requires deterministic performance across different compiler versions. Use intrinsics when you are writing a hot loop that processes large arrays of uniform data.

Use the portable-simd crate when you need SIMD performance but want to avoid maintaining separate code paths for x86, ARM, and RISC-V. The crate provides a safe, cross-platform API that compiles to optimal intrinsics on each architecture. It is still experimental but maturing rapidly.

Reach for plain loops and #[inline] first. The Rust compiler is aggressive at auto-vectorization. Simple loops over slices often compile to SIMD automatically. Check the assembly with perf or Godbolt before writing intrinsics. Intrinsics add maintenance cost and complexity. They are worth it only when the numbers demand it.

Counter-intuitive but true: the more you use unsafe, the harder the rest of your code becomes to reason about. Keep intrinsics isolated. Wrap them tightly. Let the safe code do the heavy lifting.

Where to go next

SIMD intrinsics let your code run math operations on multiple numbers at once, speeding up tasks like image processing or physics calculations. Think of it as having a team of workers instead of one person doing the same task repeatedly. You use them when you need maximum performance for number-crunching tasks and are willing to write code specific to your computer's processor.