When Python gets in the way
You spend an afternoon fine-tuning a language model in Python. It works beautifully in a notebook. Then you try to deploy it. The Python process hogs memory, the startup time stretches into minutes, and every concurrent request adds another gigabyte of RAM. You want the model to run fast, stay lean, and handle multiple users without crashing. That is where Rust steps in.
The engine and the chassis
Large language model inference is just math with a lot of data. The model is a massive set of weights. Your prompt is a sequence of numbers. The job is to multiply matrices, apply activation functions, and spit out the next most likely token. Python handles this by calling into C or CUDA libraries behind the scenes. Rust does the same math, but it owns the memory layout from start to finish. Think of Python as a rental car with a powerful engine but a heavy, clunky transmission. Rust is a custom-built race car where you bolt the engine directly to the chassis. You lose some convenience, but you gain precise control over how memory moves, how threads split work, and how the CPU cache stays warm.
The core idea is straightforward. You load a pre-trained model file, convert your text into tokens, run the forward pass through the neural network, and decode the output probabilities back into text. Rust forces you to handle the tensor shapes, the device placement, and the memory allocation explicitly. That explicitness is what keeps the runtime lean. You do not pay for hidden garbage collection pauses. You do not fight a global interpreter lock. You get raw throughput with predictable latency.
Loading a model and generating one token
The candle crate is the standard starting point. It is a minimal machine learning framework written entirely in Rust. It avoids hidden allocations and gives you direct control over the computation graph.
use candle::{DType, Device, Tensor};
use candle_nn::VarBuilder;
use candle_transformers::models::llama;
/// Load a LLaMA-style model and generate a single token.
fn generate_first_token(prompt: &str, device: &Device) -> Result<(), Box<dyn std::error::Error>> {
// Load the model weights from a safetensors file.
// DType::F16 keeps memory usage low while preserving precision.
let vb = unsafe {
// SAFETY: The file exists and contains valid safetensors data.
// The OS handles the memory mapping. We never write to the mapped region.
VarBuilder::from_mmap("llama-weights.safetensors", DType::F16, device)?
};
// Initialize the model architecture with the loaded weights.
let config = llama::LlamaConfig::v2_7b();
let model = llama::Llama::load(vb, &config)?;
// Tokenize the prompt. Each string becomes a sequence of integer IDs.
let tokenizer = tokenizers::Tokenizer::from_file("tokenizer.json")?;
let tokens = tokenizer.encode(prompt, true).unwrap();
let token_ids = tokens.get_ids();
// Convert token IDs into a tensor on the correct device.
let tokens_tensor = Tensor::new(token_ids, device)?;
// Run the forward pass to get logits for the next token.
let logits = model.forward(&tokens_tensor, 0)?;
// Sample the next token from the probability distribution.
let next_token = candle_nn::ops::softmax_last_dim(&logits)?.sample_multinomial()?;
println!("Next token ID: {:?}", next_token);
Ok(())
}
What happens under the hood
The code above does three distinct jobs. First, it maps the model weights directly from disk into memory using memory-mapped files. This avoids copying gigabytes of data into RAM during startup. Second, it converts your text into a tensor. Tensors are just multi-dimensional arrays with strict shape rules. Rust's type system catches shape mismatches before the program runs. Third, it runs the forward pass. The neural network processes the token sequence and outputs a probability distribution for every possible next token. The sample_multinomial call picks one token based on those probabilities.
Notice the unsafe block around VarBuilder::from_mmap. Memory mapping requires telling the compiler to trust the operating system with file pointers. The community convention here is to wrap the unsafe call in a small helper function and document exactly why it is safe. You are not bypassing safety for speed. You are crossing a boundary where Rust's guarantees end and the OS begins. Keep that block isolated. The rest of your code stays fully safe.
The tokenizer runs on the CPU. It splits your string into subword units using a byte-pair encoding algorithm. The resulting integer IDs feed directly into the tensor constructor. Rust's borrow checker ensures the tokenizer does not outlive the token IDs, and the tensor does not outlive the device it was created on. You get compile-time guarantees that your data pipeline cannot accidentally drop a reference mid-generation.
Treat the tokenizer as a pure function. Feed it strings, get back IDs, and never mutate the original text.
Streaming tokens and sharing state
Real applications need streaming output, context management, and thread safety. You cannot just call forward once and return. You need to keep the model in memory, feed it prompts from multiple users, and stream tokens as they generate.
use std::sync::Arc;
use candle::{DType, Device, Tensor};
use candle_nn::VarBuilder;
use candle_transformers::models::llama;
/// A thread-safe wrapper around the model and tokenizer.
struct InferenceEngine {
model: Arc<llama::Llama>,
tokenizer: tokenizers::Tokenizer,
device: Device,
}
impl InferenceEngine {
/// Initialize the engine with a shared model reference.
fn new(weights_path: &str, tokenizer_path: &str) -> Result<Self, Box<dyn std::error::Error>> {
let device = Device::new_cuda(0).unwrap_or(Device::Cpu);
let vb = unsafe {
// SAFETY: File is verified to exist and contain valid safetensors.
// Memory is mapped read-only. No concurrent writes occur.
VarBuilder::from_mmap(weights_path, DType::F16, &device)?
};
let config = llama::LlamaConfig::v2_7b();
let model = Arc::new(llama::load(vb, &config)?);
let tokenizer = tokenizers::Tokenizer::from_file(tokenizer_path)?;
Ok(Self { model, tokenizer, device })
}
/// Process a prompt and return the generated token IDs.
fn run(&self, prompt: &str, max_tokens: usize) -> Result<Vec<u32>, Box<dyn std::error::Error>> {
let tokens = self.tokenizer.encode(prompt, true).unwrap();
let mut token_ids: Vec<u32> = tokens.get_ids().to_vec();
// Pre-fill the context window with the initial prompt.
let mut logits = self.model.forward(&Tensor::new(&token_ids, &self.device)?, 0)?;
for _ in 0..max_tokens {
// Sample the next token from the current logits.
let next_token = candle_nn::ops::softmax_last_dim(&logits)?.sample_multinomial()?;
let next_id = next_token.to_device(&Device::Cpu)?.to_scalar::<u32>()?;
token_ids.push(next_id);
// Feed the new token back into the model for the next step.
logits = self.model.forward(&Tensor::new(&[next_id], &self.device)?, token_ids.len() - 1)?;
}
Ok(token_ids)
}
}
The Arc wrapper lets multiple threads share the model weights without copying them. Neural network weights are read-only during inference, so sharing them is perfectly safe. The tokenizer lives alongside the model because tokenization is CPU-bound and benefits from staying in the same memory space. The loop handles autoregressive generation. Each new token becomes part of the input for the next step. This is why LLM inference feels slow on CPUs. The model must reprocess the entire context window for every single token it generates.
Production systems solve this with key-value caching. Instead of reprocessing the whole prompt, you store the intermediate attention states in a cache. The next forward pass only computes the new token. Rust's ownership model makes KV caches easy to reason about. You allocate a fixed-size buffer, track the current position with a simple index, and overwrite old states when the context window fills up. No hidden references. No dangling pointers.
Never mutate shared weights during inference. Keep the model read-only and let the cache handle state.
Memory layout and quantization
You will hit memory limits before you hit CPU limits. A 7-billion parameter model in 16-bit floating point takes about 14 gigabytes of RAM. If you try to load it on a machine with 16 gigabytes total, the operating system will swap to disk and your generation speed will drop to zero. Rust will not save you from running out of physical memory. It will only tell you exactly where the allocation failed.
Quantization solves this by compressing the weights. Instead of storing each weight as a 16-bit float, you store it as an 8-bit integer or even a 4-bit integer. The math changes slightly, but the model accuracy stays within acceptable bounds for most applications. Rust handles quantization at the tensor level. You load the quantized weights, the framework applies dequantization on the fly during matrix multiplication, and you get a massive memory reduction with minimal speed penalty.
The community convention is to name quantized files with their precision suffix. model-q4_k_m.gguf tells you exactly what you are loading. Do not guess. Match the file extension to the loading function. candle supports safetensors natively. llama-cpp-rs supports GGUF. Mixing them up will cause silent shape mismatches or outright panics.
Check your available RAM before loading. Load quantized weights if you are close to the limit. The math will run slower, but the program will actually start.
Where the compiler stops you
Type mismatches are the most common compiler error. Tensors require exact shape alignment. If you pass a 2D tensor where a 1D tensor is expected, the compiler rejects you with E0308 (mismatched types). The error message will point to the exact line where the dimensions diverge. Fix it by reshaping the tensor with .reshape() or squeezing extra dimensions with .squeeze().
Trait bounds trip up beginners who try to store models in generic collections. If you write a function that accepts impl Model, the compiler will complain with E0277 (trait bound not satisfied) if the concrete type does not implement the required trait. Rust does not do duck typing. You must explicitly declare which traits your types support. Use Arc<dyn Any> only when you absolutely need runtime polymorphism, but prefer concrete types or Arc<ModelType> for inference engines.
Tokenizers and models must match. If your model expects BPE tokenization but you feed it a sentencepiece tokenizer, the model will output gibberish. Rust will not catch this at compile time. The math will run perfectly. The results will just be wrong. Always verify the tokenizer matches the model card. The community convention is to download the tokenizer file directly from the same repository as the weights and verify the vocab_size matches the model configuration.
If you try to move a tensor out of a borrowed context, the compiler stops you with E0507 (cannot move out of borrowed content). This happens when you accidentally try to consume a tensor inside a loop while still needing it for the next iteration. Clone the tensor explicitly if you need a copy, or restructure the loop to avoid borrowing the same data twice.
Let the borrow checker enforce your data flow. If it complains, your memory layout is already broken.
Picking the right crate
Use candle when you want a pure Rust stack with minimal dependencies and full control over the computation graph. Use llama-cpp-rs when you need maximum compatibility with GGUF models and want to leverage highly optimized C++ kernels without writing your own. Use tch-rs when you are porting existing PyTorch code and need direct access to the C++ PyTorch library. Reach for Python wrappers like pyo3 when you are building a prototype and plan to rewrite the hot path in Rust later. Stick to candle or llama-cpp-rs for production inference. They handle memory mapping, threading, and device placement without hidden overhead.
Match the crate to your deployment target. Do not chase microbenchmarks. Pick the tool that ships with the model format you already have.