The missing piece in Rust ML
You spend three hours trying to get a neural network running inside a Rust CLI tool. Every tutorial points to Python. Every Python wrapper drags in a half-gigabyte virtual environment and a C++ runtime you did not ask for. You just want a tensor that multiplies, adds, and stays out of your way. That is exactly where candle lives.
candle is not a monolithic machine learning framework. It is a lightweight, composable tensor library built specifically for Rust's type system and ownership model. The name comes from the idea of a small, focused light source rather than a stadium floodlight. You get tensors, a small set of neural network layers, and a clean way to move data between CPU and GPU. You build the training loop, the data pipeline, and the application logic. candle handles the math and the memory.
How candle actually works
Machine learning in Rust usually falls into two camps. You either bind to Python libraries through FFI, or you write everything from scratch using linear algebra crates. candle sits in the middle. Think of it like a set of precision calipers instead of a full workshop. It gives you the exact tools you need without forcing you into a specific architecture.
The core abstraction is the Tensor. In Python frameworks, tensors are dynamic objects that change shape on the fly. In candle, a tensor is a typed, device-aware container. You declare its shape, you pick its device, and the compiler guarantees you won't accidentally mix a CPU tensor with a GPU tensor. The trade-off is upfront explicitness for runtime safety and zero-cost abstractions.
Under the hood, candle uses a single Device enum to abstract hardware. The same forward call works on CPU, CUDA, or Metal. The library routes the operation to the correct backend at runtime. This means you can write your model once and compile it for multiple targets without changing a single line of inference code.
The Module trait defines how layers behave. It requires a forward method that takes &self and a tensor, returning a new tensor. This design choice is intentional. Inference is read-only. The weights do not change during a forward pass. By enforcing immutability, candle lets you share model instances across threads without locks. Training requires a different trait, ModuleMut, which allows weight updates.
Keep your models behind &self during inference. The borrow checker will thank you when you scale to concurrent requests.
Minimal working example
Add the two core crates to your Cargo.toml. candle-core provides tensors and device management. candle-nn provides layers and the Module trait.
[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
Wrap the logic in a function so you can see the error handling and documentation patterns in action.
use candle_core as candle;
use candle_nn::Module;
/// Runs a single forward pass through a linear layer.
/// Returns the output tensor wrapped in a Result.
fn run_linear_pass() -> candle::Result<()> {
// Pick the CPU device. Candle abstracts hardware behind this enum.
let device = candle::Device::Cpu;
// Create a linear layer: 10 input features, 5 output features.
// Weights and biases are allocated on the chosen device.
let model = candle_nn::linear(10, 5, &device)?;
// Generate a random input tensor with shape (1, 10).
// Mean 0.0, std 1.0, using f32 precision by default.
let input = candle::Tensor::randn(0., 1., (1, 10), &device)?;
// Run the forward pass. The ? operator propagates shape or device errors.
let output = model.forward(&input)?;
// Print the output shape to verify the transformation worked.
println!("Output shape: {:?}", output.shape());
Ok(())
}
fn main() -> candle::Result<()> {
run_linear_pass()
}
The convention in the candle community is to use candle::Result<()> as the return type for any function that touches tensors. It catches shape mismatches, device errors, and allocation failures in one unified type. You avoid mixing std::io::Result and candle::Error across your codebase.
Walking through the execution
When run_linear_pass starts, candle::Device::Cpu creates a runtime handle. This handle tracks memory allocations and routes operations to the CPU backend. You could swap it for Device::cuda_if_available()? to automatically fall back to GPU if the driver is present. That pattern is standard in production candle code.
candle_nn::linear allocates two tensors under the hood: a weight matrix of shape (5, 10) and a bias vector of shape (5,). Both are initialized with small random values. The function returns a Linear struct that holds references to those tensors. The ? operator is necessary here because allocation can fail if the system runs out of memory.
Tensor::randn creates the input batch. The shape (1, 10) means one sample with ten features. candle stores tensors in row-major order by default, which matches C and Python conventions. This layout choice matters for cache performance during matrix multiplication.
model.forward(&input) performs the actual math. It computes input @ weights^T + bias. The operation returns a new tensor of shape (1, 5). Notice that forward takes &self. The layer does not mutate its weights. This matches the mathematical definition of inference and allows the compiler to prove that concurrent calls to forward cannot race.
If any step fails, candle returns a descriptive error. Shape mismatches, device conflicts, and type errors all bubble up through the ? operator. You never get a silent segfault.
Trust the ? operator. It turns runtime panics into recoverable control flow.
A realistic inference pipeline
Real applications rarely run a single random tensor. You usually load pre-trained weights, process a batch of inputs, and extract the results as native Rust types. Here is how that looks in practice.
use candle_core as candle;
use candle_nn::{Module, Linear};
/// Loads a model, runs a batch, and extracts f32 results.
fn run_batch_inference() -> candle::Result<Vec<f32>> {
let device = candle::Device::Cpu;
// Build the same linear layer as before.
let model = candle_nn::linear(10, 5, &device)?;
// Create a batch of 4 samples, each with 10 features.
// Explicitly request f32 to match the model's precision.
let batch = candle::Tensor::randn(0., 1., (4, 10), &device)?;
// Forward pass through the network.
let logits = model.forward(&batch)?;
// Squeeze removes the batch dimension if needed, but here we keep it.
// To get raw data back, we must move it to CPU and flatten.
let cpu_logits = logits.to_device(&candle::Device::Cpu)?;
let flat = cpu_logits.flatten_all()?;
// Extract the underlying buffer as a Rust vector.
// This copies data from the tensor's memory to the heap.
let results: Vec<f32> = flat.to_vec1()?;
Ok(results)
}
The to_device call is a common convention when you mix CPU and GPU tensors. If your model lives on CUDA but you need to return data to a Rust struct, you must explicitly move the tensor back to CPU first. candle does not do this automatically because implicit memory transfers hide latency.
The to_vec1 method extracts the data. It requires the tensor to be one-dimensional, which is why flatten_all is called first. This pattern appears in every candle project that bridges tensors and native Rust collections.
Always make memory transfers explicit. Hidden copies are the fastest way to kill throughput.
Where things break
candle favors runtime checks over compile-time magic for shapes. This design keeps the API flexible but introduces a specific class of errors you will encounter.
Shape mismatches are the most common failure. If you pass a (4, 10) tensor to a layer expecting (1, 10), candle returns a ShapeMismatch error at runtime. The compiler cannot verify tensor dimensions because they are often determined by data loading pipelines. You will see this error when batching logic is off or when a model was trained with a different input size.
Device mismatches happen when you accidentally create a tensor on CPU and pass it to a GPU model. candle catches this immediately with a DeviceMismatch error. The fix is always to call tensor.to_device(&model_device)? before the forward pass.
Trait bound errors appear when you forget that Module requires &self. If you try to call model.forward_mut during inference, the compiler rejects you with E0277 (trait bound not satisfied). The solution is to stick to the immutable forward method unless you are actively updating weights.
Type mismatches surface when you mix f32 and f64 tensors. candle enforces strict dtype alignment. You will get a DTypeMismatch error if you try to add a float32 tensor to a float64 tensor. Cast explicitly with tensor.to_dtype(candle::DType::F32)? before the operation.
Treat every candle::Error as a contract violation. Fix the shape, align the device, and move on.
Choosing your stack
The Rust machine learning ecosystem is young and fragmented. Picking the right tool depends on what you are actually building.
Use candle when you need lightweight inference in a CLI tool, a web backend, or an embedded system. It compiles fast, links cleanly, and gives you full control over the execution loop.
Use burn when you want a complete training pipeline with automatic differentiation, dataset abstractions, and built-in optimizers. It is heavier but saves you from writing gradient tape logic from scratch.
Use Python with PyTorch when you are prototyping, experimenting with new architectures, or need immediate access to a massive ecosystem of pre-trained models and research implementations.
Use ndarray when you only need linear algebra, statistical operations, or custom numerical kernels without neural network layers or GPU acceleration.
Match the tool to the workload. Do not drag a training framework into an inference binary.