How to Use GPU Computing in Rust (wgpu, CUDA bindings)

When the CPU hits the wall

You are rendering a particle system. Your CPU is screaming at 100% usage, and your frame rate is stuck at 12 FPS. You have thousands of independent calculations to do every frame. The CPU is a single-lane highway; the GPU is a thousand-lane superhighway sitting idle. You need to offload the work.

Rust gives you two main paths for GPU computing. The wgpu crate provides a portable abstraction over Vulkan, Metal, DX12, and WebGL. It works everywhere and keeps you in safe Rust. The cuda-sys crate gives you direct bindings to NVIDIA's CUDA runtime. It unlocks maximum performance on NVIDIA hardware but requires unsafe code and locks you to a specific vendor.

The GPU is not a magic CPU

The GPU isn't just a graphics card. It is a massively parallel processor. Think of the CPU as a master chef who can chop, sauté, and plate perfectly but only one thing at a time. The GPU is an army of interns. Each intern can only chop onions, but you have ten thousand of them. If you need to chop ten thousand onions, the interns finish in seconds. If you need to plate a complex dish with dependencies, the interns are useless.

GPU computing shines when you have data-parallel work. Every element in an array gets the same operation applied. Image processing, physics simulation, matrix multiplication, and particle updates all fit this pattern. If your logic has heavy branching or dependencies, the GPU will stall waiting for the slowest intern.

wgpu abstracts the complexity of talking to different graphics APIs. You write one code path, and wgpu translates it to Vulkan on Linux, Metal on macOS, or DX12 on Windows. cuda-sys skips the abstraction. You talk directly to the NVIDIA driver. You get raw power, but you lose portability and safety guarantees.

Pick the tool that matches your audience. Portability usually wins until you hit a wall.

Minimal setup with wgpu

The wgpu crate uses an async API. You request resources, and the driver hands them back when ready. The mental model follows a strict hierarchy. You create an Instance to talk to the graphics API. You request an Adapter to find a physical GPU. You request a Device and Queue from the adapter to submit work.

use wgpu::{BufferUsages, Device, Instance, Queue};

/// Initializes a GPU device and creates a storage buffer.
async fn setup_gpu() -> Result<(), Box<dyn std::error::Error>> {
    // Create the instance to interface with the graphics API
    let instance = Instance::new(&wgpu::InstanceDescriptor::default());

    // Request an adapter representing the physical GPU
    let adapter = instance
        .request_adapter(&wgpu::RequestAdapterOptions::default())
        .await
        .expect("Adapter not found");

    // Get the device for resource creation and queue for submitting work
    let (device, queue) = adapter
        .request_device(&wgpu::DeviceDescriptor::default(), None)
        .await
        .expect("Device request failed");

    // Create a buffer on the GPU for data storage
    let buffer = device.create_buffer(&wgpu::BufferDescriptor {
        label: Some("Data Buffer"),
        size: 1024,
        // STORAGE allows shaders to read/write; COPY_DST allows CPU uploads
        usage: BufferUsages::STORAGE | BufferUsages::COPY_DST,
        mapped_at_creation: false,
    });

    // Upload data from CPU memory to the GPU buffer
    queue.write_buffer(&buffer, 0, &[0; 1024]);

    Ok(())
}

The community convention is to always add label: Some("...") to every resource. Labels show up in GPU profilers like RenderDoc or Nsight. When your frame time spikes, you need to know which buffer or pipeline caused it. A label costs nothing and saves hours of debugging.

What happens under the hood

When you call queue.write_buffer, you are not copying data immediately. You are recording a command. The queue buffers commands and sends them to the GPU driver in batches. The GPU executes these commands asynchronously. Your CPU code continues running while the GPU works in the background.

This async nature is a trap for beginners. If you write data to a buffer and immediately try to read it back, you will get stale data. The GPU hasn't finished the copy yet. You must synchronize. wgpu provides device.poll() to check for completion, or you can use fences and events for fine-grained control.

The BufferUsages flags define what a buffer can do. STORAGE means shaders can access it. COPY_DST means the CPU can copy data into it. COPY_SRC means the GPU can copy data out of it. If you forget a flag, the driver rejects the operation. The compiler won't catch this; the GPU validation layers will.

The queue is your only way to talk to the GPU. If you don't submit to the queue, nothing happens.

Realistic compute pipeline

Real workloads require a compute pipeline. You define a shader, bind data to it, and dispatch work groups. wgpu supports WGSL (WebGPU Shading Language) and SPIR-V. You can write shaders in WGSL directly, or use a crate like rust-gpu to compile Rust code to SPIR-V.

use wgpu::{BindGroupLayout, ComputePipeline, Device, PipelineLayout};

/// Sets up a compute pipeline to run a shader.
fn create_compute_pipeline(device: &Device) -> ComputePipeline {
    // Define the layout for shader inputs
    let bind_group_layout = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
        label: Some("Compute Layout"),
        entries: &[],
    });

    // Create the pipeline layout linking bind groups
    let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
        label: Some("Compute Layout"),
        bind_group_layouts: &[&bind_group_layout],
        push_constant_ranges: &[],
    });

    // Load the shader module from WGSL source
    let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
        label: Some("My Shader"),
        source: wgpu::ShaderSource::Wgsl(std::borrow::Cow::Borrowed(
            "@compute @workgroup_size(64)\nfn main() { }",
        )),
    });

    // Create the compute pipeline
    device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
        label: Some("Compute Pipeline"),
        layout: Some(&pipeline_layout),
        module: &shader,
        entry_point: "main",
        compilation_options: Default::default(),
        cache: None,
    })
}

If you want to write compute shaders in Rust syntax, the rust-gpu crate compiles Rust functions to SPIR-V. You annotate functions with #[compute_shader], and rust-gpu handles the translation. This lets you use Rust's type system and macros in your shaders. The trade-off is an extra compilation step and a dependency on the rust-gpu toolchain.

The community generally prefers WGSL for wgpu projects unless you have a strong reason to use rust-gpu. WGSL is the standard for WebGPU, and it keeps your shader code portable across tools.

Pitfalls and compiler errors

GPU programming introduces new failure modes. The borrow checker protects you from memory errors in CPU code, but the GPU lives in a different memory space. You can create use-after-free bugs by dropping a buffer while a shader is still reading it. wgpu tracks resource lifetimes, but you must manage the synchronization yourself.

If you try to use a buffer after it is dropped, wgpu panics at runtime with a validation error. Enable validation layers by setting RUST_LOG=wgpu=debug. The logs will tell you exactly which resource is invalid.

The compiler will catch some mistakes. If you move the Device into a function and try to use it again, you get E0382 (use of moved value). The Device owns the resources. You must clone the handle or pass references carefully.

// This fails with E0382: use of moved value `device`
let device = get_device();
use_device(device);
use_device(device); // Error: device moved here

Another common error is mismatched types in bind groups. If your shader expects a uniform buffer but you bind a storage buffer, the driver rejects the pipeline. The error message points to the bind group creation. Check your BufferUsages and shader declarations.

Synchronization is the hardest part. If you read back data without waiting, you get garbage. Use device.poll(wgpu::Maintain::Wait) to block until the GPU finishes, or structure your code so the CPU never needs to read GPU data synchronously.

The GPU doesn't care about your deadlines. Synchronize or live with stale data.

When to use wgpu, cuda-sys, or rust-gpu

Use wgpu when you need cross-platform support for games, UI, or compute workloads. It abstracts Vulkan, Metal, DX12, and WebGL behind one API. You get safe Rust, async resource management, and access to a growing ecosystem of tools.

Use cuda-sys when you are targeting NVIDIA hardware exclusively and need maximum performance or access to CUDA-specific libraries like cuBLAS. You are building a scientific application or a data processing pipeline where every millisecond counts and portability is irrelevant.

Use rust-gpu when you want to write compute shaders in Rust syntax instead of WGSL, and you are okay with an extra compilation step to SPIR-V. You value type safety and macro support in your shader code more than raw simplicity.

Reach for glow when you need OpenGL compatibility on legacy systems, though the ecosystem is shrinking and wgpu is the future.

Where to go next

GPU computing in Rust lets you offload heavy math or graphics tasks to your graphics card instead of your main processor. You use the wgpu library to talk to the GPU in a way that works on Windows, Mac, and Linux without writing separate code for each. Think of it like hiring a specialized team to handle the heavy lifting while you manage the overall project.