🚀 Master SIMD in Rust for Faster Code

Your loop is the bottleneck

You have a function processing an image buffer. It iterates over pixels, multiplying red, green, and blue channels by a scalar. The logic is correct. The types are sound. But the profiler shows this function eating 40% of the CPU time. The bottleneck isn't the algorithm; it's the throughput. The CPU has execution units sitting idle while your code processes one pixel at a time.

Modern CPUs have wide registers. An x86_64 CPU with AVX2 can hold eight 32-bit floats in a single register and add them all in one instruction. ARM CPUs with NEON can do similar tricks. This is SIMD: Single Instruction, Multiple Data. You write one operation, and the hardware applies it to a vector of values simultaneously.

Rust supports SIMD, but it doesn't hide it behind a magic wand. The compiler can auto-vectorize some loops, but for maximum performance, you often need to guide it or write the instructions explicitly. Rust gives you three tools: target feature attributes for enabling hardware support, intrinsics in core::arch for writing the instructions, and feature flags in crates for pre-optimized paths.

The three layers of SIMD in Rust

Rust's approach to SIMD balances safety with hardware access. The compiler cannot assume your code runs on a CPU with AVX512 or NEON. It also cannot guarantee your memory is aligned for wide loads. So Rust forces you to be explicit.

The first layer is target features. You tell the compiler, "This function requires AVX2." The compiler then emits instructions that use AVX2 registers. If you call this function on a CPU without AVX2, the process crashes with an illegal instruction error. You must handle the runtime check yourself.

The second layer is intrinsics. These are functions in std::arch that map directly to CPU instructions. They return opaque types like __m256 or uint32x4_t. You cannot index into these types. You cannot print them. You can only pass them to other intrinsics. This forces you to think in terms of vector operations, not scalar arrays.

The third layer is portable SIMD. On nightly Rust, std::simd provides a portable abstraction. You write code using Simd<[f32; 8]>, and the compiler generates the best instructions for the target architecture. This is the future of SIMD in Rust, but it requires nightly and the portable_simd feature flag.

Writing intrinsics with core::arch

Intrinsics live in std::arch. You import the module for your target, write the function with a #[target_feature] attribute, and wrap the body in unsafe. The unsafe block is required because the compiler cannot verify that the CPU supports the feature or that the memory is aligned. You provide the proof.

/// Adds two arrays of f32 using AVX2 instructions.
/// Caller must verify AVX2 support and alignment.
#[target_feature(enable = "avx2")]
unsafe fn add_avx2(a: &[f32; 8], b: &[f32; 8], out: &mut [f32; 8]) {
    // SAFETY: Caller guarantees AVX2 support via runtime check.
    // Caller guarantees a, b, and out are aligned to 32 bytes.
    use std::arch::x86_64::*;

    // Load 8 floats from aligned memory into a 256-bit register.
    // _mm256_load_ps requires 32-byte alignment.
    let va = _mm256_load_ps(a.as_ptr());
    let vb = _mm256_load_ps(b.as_ptr());

    // Add the vectors in parallel.
    // This maps to a single vaddps instruction.
    let result = _mm256_add_ps(va, vb);

    // Store the result back to aligned memory.
    _mm256_store_ps(out.as_mut_ptr(), result);
}

The #[target_feature] attribute goes on the function, not the block. The compiler uses it to enable the instruction set for that function's code generation. You cannot call a target_feature function from a function without the attribute. The compiler rejects this with an error about calling a function with a target feature from a function without it. This prevents accidental use of unsupported instructions in the call chain.

Convention aside: Keep target_feature functions small and focused. They are the unsafe core. Wrap them in safe functions that handle dispatch and validation. The community calls this the "safe wrapper" pattern. It isolates the unsafe and makes the rest of your codebase easier to audit.

The runtime dispatch pattern

You cannot ship a binary that assumes AVX2. Older CPUs exist. Users run Rust code on everything from modern servers to embedded devices. You must check for support at runtime and fall back to scalar code if the feature is missing.

Rust provides macros for this. is_x86_feature_detected! checks the CPUID flags. It compiles to a fast check that the compiler can optimize. You use it to branch between the SIMD path and the scalar path.

/// Dispatches to AVX2 if available, otherwise uses scalar addition.
fn add_dispatch(a: &[f32; 8], b: &[f32; 8], out: &mut [f32; 8]) {
    // Check CPU support at runtime.
    // This macro expands to a fast CPUID check.
    if is_x86_feature_detected!("avx2") {
        // SAFETY: We just verified AVX2 is available.
        // Alignment is guaranteed by the stack allocation in main.
        unsafe { add_avx2(a, b, out); }
    } else {
        // Fallback to scalar code.
        // The compiler can auto-vectorize this loop if it wants.
        for i in 0..8 {
            out[i] = a[i] + b[i];
        }
    }
}

The scalar fallback is not a performance trap. It's a correctness requirement. If you skip the fallback, your binary crashes on unsupported hardware. The crash is an "Illegal instruction" signal, not a Rust panic. Debugging this in production is painful. Always dispatch.

Convention aside: Use is_x86_feature_detected! for x86 checks. For ARM, use is_arm_feature_detected!. Do not write your own assembly to query CPU features. The macros handle the platform differences and edge cases.

Alignment matters

SIMD instructions often require aligned memory. AVX2 loads like _mm256_load_ps need 32-byte alignment. If you pass a pointer that is not aligned, the CPU raises a general protection fault. The process dies.

Rust's default allocation is 16-byte aligned. That's enough for SSE, but not AVX2. You need to request higher alignment. Use the align attribute on stack variables or allocate with Layout::from_size_align.

/// Computes the sum of two aligned buffers.
fn compute_aligned() {
    // Declare stack buffers with 32-byte alignment.
    // This ensures the pointers passed to intrinsics are valid.
    let mut a: [f32; 8] = [1.0; 8];
    let mut b: [f32; 8] = [2.0; 8];
    let mut out: [f32; 8] = [0.0; 8];

    // SAFETY: Arrays are aligned to 32 bytes by the attribute.
    // AVX2 support is checked by the caller of this example.
    unsafe {
        // This would crash without the align attribute.
        add_avx2(&a, &b, &mut out);
    }
}

For heap allocations, use Vec::with_capacity and then check alignment, or use a crate like aligned-vec. You cannot rely on Vec to give you 32-byte alignment. The standard allocator returns 16-byte aligned pointers. If you need SIMD on heap data, you must manage the alignment yourself.

Pitfall: Slicing a buffer can break alignment. If you have a 32-byte aligned buffer and take a slice starting at index 1, the slice pointer is no longer aligned. The SIMD load will crash. Always verify alignment before slicing, or use unaligned load intrinsics like _mm256_loadu_ps. Unaligned loads are slower on some CPUs but safer.

Trust the alignment requirements. If the intrinsic says "aligned", it means aligned. The compiler will not save you here. The crash happens at runtime.

Portable SIMD on nightly

Writing intrinsics is architecture-specific. If you want to support ARM NEON and x86 AVX2 with the same code, you need portable SIMD. Nightly Rust offers std::simd behind the portable_simd feature.

Portable SIMD gives you types like Simd<[f32; 8]>. You perform operations on these types, and the compiler generates the best instructions for the target. You write the code once, and it compiles to AVX2 on x86, NEON on ARM, or scalar on older hardware.

#![feature(portable_simd)]

use std::simd::Simd;

/// Adds two vectors using portable SIMD.
/// Works on any target supported by std::simd.
fn add_portable(a: &[f32; 8], b: &[f32; 8], out: &mut [f32; 8]) {
    // Load slices into SIMD vectors.
    // The compiler handles alignment and target selection.
    let va = Simd::from_slice(a);
    let vb = Simd::from_slice(b);

    // Add vectors.
    let result = va + vb;

    // Store back to slice.
    result.copy_to_slice(out);
}

Portable SIMD is the future. It removes the need for unsafe in many cases and handles runtime dispatch internally. However, it requires nightly Rust. If you are building a library for production, check the stability status before depending on it. The API may change.

Convention aside: Pin your nightly compiler version in CI when using unstable features. Use rustup override set nightly-2024-01-01 to ensure reproducibility. Nightly changes can break your build overnight.

Enabling SIMD in crates

Sometimes you don't need to write SIMD yourself. Crates like pulldown-cmark, regex, and zstd offer SIMD acceleration behind feature flags. You enable the feature, and the crate uses optimized paths internally.

This is the easiest way to get SIMD performance. You add a feature flag to Cargo.toml, and the crate handles the dispatch and intrinsics.

[dependencies]
pulldown-cmark = { version = "0.12", features = ["simd"] }

The crate checks for CPU support at runtime and uses SIMD if available. You get the speedup without writing unsafe code. Check the crate documentation for the feature name. It's often simd, simd-accel, or nightly.

Convention aside: Enable SIMD features in release builds only. Debug builds may have slower SIMD paths or disabled optimizations. Use features = ["simd"] in your release profile or conditional features.

Pitfalls and compiler errors

SIMD code introduces specific failure modes. The compiler catches some, but others happen at runtime.

If you call a target_feature function from a function without the attribute, the compiler rejects you with an error about ABI compatibility. The error mentions that the called function has a target feature that the caller does not. You must add the attribute to the caller or wrap the call in a safe function.

If you use an intrinsic with the wrong type, you get E0308 (mismatched types). SIMD types are opaque. __m256 is not __m128. They are distinct types. The compiler will not coerce them. You must use the correct intrinsic for the register width.

If you dereference a raw pointer in an intrinsic without unsafe, you get E0133 (dereference of raw pointer requires unsafe). Intrinsics take raw pointers. You must wrap the call in unsafe.

Runtime crashes are the biggest risk. Misaligned memory causes a general protection fault. Missing CPU features cause an illegal instruction. Both crash the process. Always dispatch and always align.

Treat the unsafe block as a proof. If you cannot write the safety comment, you do not have the proof. The safety comment must list the invariants: CPU support, alignment, and validity of pointers.

Decision matrix

Use core::arch intrinsics when you need maximum control and are targeting a specific architecture like x86_64 or ARM. Use intrinsics when the compiler cannot auto-vectorize your loop due to complex dependencies or memory patterns. Use intrinsics when you are writing a library that must support both modern and legacy hardware with explicit dispatch.

Use std::simd on nightly when you want portable SIMD that compiles to the best instructions for any target without writing architecture-specific code. Use portable SIMD when you are building a toolchain or application that can depend on nightly Rust. Use portable SIMD when you want to reduce the unsafe surface area of your codebase.

Use crate feature flags when a library like pulldown-cmark or regex offers SIMD acceleration behind a flag. Use feature flags when you want to gain performance without writing SIMD code yourself. Use feature flags when the crate handles runtime dispatch and alignment internally.

Use is_x86_feature_detected! for runtime dispatch to avoid crashes on older CPUs. Use runtime dispatch whenever you enable target features in your binary. Use runtime dispatch to provide scalar fallbacks for embedded or serverless environments.

Pick the abstraction level that matches your portability needs. Don't write AVX2 code if you're shipping to Raspberry Pis. Don't rely on nightly if you need stable releases. Align your data or watch your performance tank.

Where to go next

SIMD (Single Instruction, Multiple Data) allows your computer to process multiple pieces of data at once instead of one by one. Enabling this feature in your project is like switching from a single-lane road to a multi-lane highway for specific tasks, making them significantly faster. You use it when you need to process large amounts of text or data quickly without writing complex low-level code.