How to Write a Container Runtime in Rust
You have a script that calculates pi, but it also tries to delete your home directory. You want to run it without holding your breath. You don't need a full virtual machine with its own kernel and hardware emulation. You need a box that shares the kernel but hides the rest. That's a container. Rust is a strong tool for building the box because it enforces memory safety while you wrestle with low-level OS primitives. The compiler catches buffer overflows and use-after-free bugs before they become container escapes.
A container is just a process with some tricks applied. The tricks come from Linux. Namespaces isolate the view of the system. If you're in a PID namespace, you think you're PID 1. If you're in a mount namespace, you see a different filesystem tree. Cgroups limit resources. They stop the process from eating all your RAM. Seccomp restricts system calls. It's like a bouncer that only lets specific requests through. Rust doesn't invent these. Rust gives you safe handles to turn the knobs.
Namespaces hide the world. Cgroups limit the power. Seccomp guards the door.
The primitives under the hood
Linux provides three main mechanisms for containerization. Namespaces partition kernel resources. Each namespace type isolates a specific aspect of the system. PID namespaces give processes their own process ID space. Mount namespaces isolate the filesystem hierarchy. Network namespaces isolate network stacks, interfaces, and ports. User namespaces map UIDs and GIDs, allowing a process to be root inside the container while being unprivileged on the host.
Cgroups control resource usage. They attach limits to a hierarchy of processes. You can cap CPU time, memory consumption, and I/O bandwidth. Cgroups v2 uses a unified hierarchy, which simplifies management compared to the legacy v1 subsystems.
Seccomp filters restrict the system calls a process can make. A filter can allow, deny, or trap specific calls. This reduces the attack surface. If a vulnerability exists in a syscall handler, the filter can prevent the process from invoking it.
Rust interacts with these primitives through the rustix crate. rustix provides safe, idiomatic wrappers around Linux system calls. It returns Result types instead of raw error codes and uses Rust enums for flags. The libc crate offers lower-level bindings. You'll reach for libc only when rustix lacks a specific feature. The community convention is to keep libc usage minimal and isolated. rustix code compiles to the same efficient machine code, but it gives you type safety and documentation.
Minimal isolation example
Start with a single process that enters new namespaces and executes a command. The clone system call creates a child process with specified flags. You can pass flags to create new namespaces during the fork.
/// Spawns a child process in new PID and mount namespaces.
fn run_in_namespace() -> Result<(), Box<dyn std::error::Error>> {
// NEWPID isolates process IDs. NEWNS isolates mounts.
let flags = rustix::process::CloneFlags::NEWPID | rustix::process::CloneFlags::NEWNS;
// clone is unsafe because the closure runs in a new process context
// and must not use non-reentrant operations.
let child_pid = unsafe {
rustix::process::clone(
|| {
// Inside the child process.
// The closure captures no data, so it's Send and Sync.
// In a real runtime, you would drop privileges here.
std::os::unix::process::CommandExt::exec(
std::process::Command::new("/bin/sh")
.arg("-c", "echo 'I am PID 1 in my world'; exit 0"),
);
},
flags,
)
}?;
println!("Spawned container with PID {}", child_pid);
Ok(())
}
The clone call duplicates the process table entry. The kernel attaches the new namespace structs based on the flags. The child wakes up inside the closure. The parent receives the child's PID. The child calls exec, which replaces the process image with /bin/sh. The namespaces persist across the exec. The shell runs in the isolated environment.
If you try to move a non-Send type into the closure, the compiler rejects it with E0382. The closure runs in a new process, so the data must be safe to transfer. This check prevents race conditions where a child process accesses memory that the parent is still modifying.
The unsafe block is the boundary. Keep it small. Trust the kernel, verify the flags.
The supervisor pattern
Production runtimes use a two-process architecture. A supervisor process sets up the environment. It creates cgroups, configures mounts, and applies seccomp filters. Then it spawns the init process, which becomes PID 1 inside the container. The supervisor monitors the init process and cleans up resources when it exits.
This separation allows the supervisor to hold privileges needed for setup while the init process runs with dropped privileges. The supervisor also handles signal propagation. If you send a signal to the container, the supervisor forwards it to the init process.
/// Supervisor sets up a bind mount and spawns a shell in a new mount namespace.
fn setup_and_run() -> Result<(), Box<dyn std::error::Error>> {
// Create a temporary directory to act as the root.
let root_dir = std::env::temp_dir().join("container_root");
std::fs::create_dir_all(&root_dir)?;
let flags = rustix::process::CloneFlags::NEWNS | rustix::process::CloneFlags::NEWPID;
unsafe {
rustix::process::clone(
|| {
// Child: Mount the root filesystem.
// This isolates the view of the disk.
rustix::fs::mount(
root_dir.as_ref(),
"/",
None,
rustix::fs::MountFlags::BIND,
None,
)?;
// Pivot root would go here in a full runtime.
// For safety, we just exec now.
std::os::unix::process::CommandExt::exec(
std::process::Command::new("/bin/sh")
.arg("-c", "mount | grep container_root; exit 0"),
);
},
flags,
)
}?;
// Supervisor would wait for the child here.
// let _ = rustix::process::wait();
Ok(())
}
The mount call requires root privileges on the host. The bind mount makes the temporary directory visible at / inside the namespace. Other processes on the host still see the original filesystem. The pivot_root syscall is the standard way to finalize the root filesystem. It moves the old root to a mount point and sets the new root. rustix provides pivot_root in the fs module.
Use let _ = cleanup(); to signal you considered the result but chose to drop it. This pattern appears when removing temporary directories or closing file descriptors where the error is non-fatal.
Seccomp and capability dropping
Isolation is incomplete without restricting system calls and privileges. Seccomp filters run in the kernel and check every syscall before execution. You can build a filter that allows only a whitelist of calls. This prevents the container from performing dangerous operations like loading kernel modules or accessing raw devices.
The seccompiler crate provides a high-level API for building seccomp filters. It wraps libseccomp and generates BPF programs. You define rules for specific syscalls and apply the filter to the process.
/// Applies a seccomp filter that allows only a few syscalls.
fn apply_seccomp() -> Result<(), Box<dyn std::error::Error>> {
use seccompiler::BpfProgram;
use seccompiler::SeccompCmpArgLen;
use seccompiler::SeccompCmpOp;
use seccompiler::SeccompAction;
use seccompiler::SeccompRule;
// Allow read, write, exit, and sigreturn.
// Deny everything else.
let rules = vec![
SeccompRule::new(
vec![
seccompiler::SeccompSyscall::from_name("read")?,
seccompiler::SeccompSyscall::from_name("write")?,
seccompiler::SeccompSyscall::from_name("exit")?,
seccompiler::SeccompSyscall::from_name("exit_group")?,
],
SeccompAction::Allow,
)?,
];
let program = BpfProgram::from_filters(rules, SeccompAction::Errno(1))?;
seccompiler::apply_filter(&program)?;
Ok(())
}
Capability dropping is equally important. Linux capabilities break down root privileges into distinct units. A container doesn't need CAP_SYS_ADMIN or CAP_NET_RAW. Drop all capabilities and add back only what's required. The rustix::process::prctl function can drop capabilities. The setuid and setgid syscalls change the user and group IDs. Running as a non-root user inside the container prevents privilege escalation.
If you forget to drop privileges, the container runs as root. The compiler won't stop you. The OS will let you. You need to call setuid. A missing setuid won't trigger a warning. It triggers a breach.
Pitfalls and compiler signals
Container runtimes run with elevated privileges. Any memory safety bug can lead to a container escape. Rust's type system helps, but you must still handle OS-level invariants.
Zombie processes are a common issue. If the supervisor doesn't wait for the child, the child becomes a zombie. The process table entry remains until the parent reaps it. Use rustix::process::wait or waitpid to collect exit status. If you ignore signals, the supervisor might not respond to termination requests. Install signal handlers to forward signals to the init process.
Signal handling in Rust requires care. The closure passed to clone must be async-signal-safe. You can't use println or allocate memory inside a signal handler. Use write to file descriptor 2 for error messages. The compiler enforces some of these constraints, but the kernel enforces the rest.
If you drop down to libc and dereference a raw pointer, the compiler demands unsafe and throws E0133 if you forget the block. This error code reminds you that raw pointer access is unchecked. Keep libc usage in small helper functions. Document the safety contract with // SAFETY: comments. List the invariants that make the code safe.
/// Drops all capabilities using prctl.
///
/// # Safety
/// The caller must ensure the process has the capabilities to drop.
/// This function does not check the current capability set.
pub unsafe fn drop_all_caps() -> Result<(), std::io::Error> {
// SAFETY: prctl is a safe syscall when arguments are valid.
// We pass constant values and null pointers.
rustix::process::prctl::set_keep_capabilities(true)?;
// Additional prctl calls to drop caps would go here.
Ok(())
}
Treat the SAFETY comment as a proof. If you can't write it, you don't have one.
Decision matrix
Use rustix when you need safe, idiomatic access to Linux syscalls like clone, mount, and setns. It wraps the kernel interface in Rust types and returns Result instead of raw error codes.
Use libc when rustix lacks a specific syscall or flag you need. You'll write unsafe blocks and manage raw pointers, so keep the surface area tiny.
Use an existing runtime like runc when you're building an application that needs containers, not the container technology itself. Reinventing the isolation logic introduces security risks.
Use cgroups-rs when you need to manage resource limits via the cgroup filesystem. It handles the path construction and file I/O for CPU and memory controllers.
Use seccompiler when you need to build seccomp filters with a high-level API. It generates BPF programs and applies them safely.
Build the runtime only if you need to control the isolation. Otherwise, wrap runc.