When crate boundaries kill performance
You wrote a high-performance math function in a library crate. You call it from your main binary. You run a profiler and see the function taking a disproportionate amount of time. You check the source code. It is only three lines long. The logic is trivial. The slowdown comes from the function call itself. The compiler could not inline the function because it lives in a different crate. The optimization pass stopped at the crate boundary.
Link Time Optimization (LTO) removes that boundary. It gives the optimizer a view of your entire program, including all dependencies, and allows it to inline functions, eliminate dead code, and reorder operations across crate lines. The result is often a smaller binary and faster execution, paid for with longer build times.
How Rust compiles in silos
Rust compiles code in units called crates. Each crate is processed independently. The compiler runs optimization passes on each crate before moving to the next. This modular approach keeps build times predictable and allows parallel compilation. It also creates a blind spot.
When rustc optimizes a library crate, it does not know how the library will be used. It assumes any public function might be called. It cannot inline a function into a caller it cannot see. It cannot delete a function that might be used by a binary it has not processed yet. The optimizer works with partial information.
LTO changes the game. Instead of emitting object files after optimization, the compiler emits an intermediate representation (IR) that preserves optimization opportunities. The linker collects all IR from every crate and passes it to the LLVM optimizer. LLVM sees the whole program. It can inline functions across crates. It can delete code that is never called. It can merge constants and simplify control flow across the entire dependency graph.
Think of it like a team of writers. Each writer produces a chapter. Without LTO, an editor polishes each chapter in isolation. Sentences might be repetitive across chapters. Transitions might be clunky. With LTO, a final editor reviews the complete manuscript. They rewrite transitions, delete redundant paragraphs, and tighten the prose globally. The story flows better.
Minimal example: enabling LTO
LTO is configured in your Cargo.toml. You enable it per profile. It is almost always used in release builds. Debug builds rarely benefit because the overhead of LTO slows iteration, and debuggers struggle with heavily optimized code.
[profile.release]
# Enable LTO for the release profile.
# This applies to the final binary and all static dependencies.
lto = true
The lto = true setting tells Cargo to pass flags to rustc and the linker that trigger LTO. Cargo handles the complexity. You do not need to manually invoke the linker with special flags.
Here is a library function that benefits from LTO.
// lib.rs
/// Computes the square root using a fixed-point iteration.
/// This function is small enough to inline, but the compiler
/// cannot inline it across crates without LTO.
pub fn fast_sqrt(x: f64) -> f64 {
let mut guess = x / 2.0;
for _ in 0..10 {
guess = (guess + x / guess) / 2.0;
}
guess
}
// main.rs
use math_lib::fast_sqrt;
fn main() {
// Without LTO, this is a function call.
// With LTO, the compiler may inline fast_sqrt directly here.
let result = fast_sqrt(4.0);
println!("Result: {}", result);
}
When you build with cargo build --release, the compiler generates LLVM IR for both crates. The linker merges the IR. LLVM sees that fast_sqrt is called exactly once. It inlines the loop body into main. The function call instruction disappears. The stack frame allocation vanishes. The code runs faster because the CPU executes a straight-line sequence instead of jumping to a separate function.
LTO turns the linker into an optimizer. Use it when the cost of building is less than the cost of running.
Dead code elimination: the hidden win
Inlining is the headline feature of LTO. Dead code elimination (DCE) is often the bigger win. Rust crates export many functions. You might use only a fraction of them. Without LTO, the compiler must include all exported code in the binary. It cannot delete a public function because another binary might link against it.
LTO sees the final binary. It knows exactly which functions are called. If a function is never used, LTO deletes it. This applies recursively. If function A calls function B, and A is deleted, B is also deleted.
This matters for large dependencies. You might depend on a crate that supports multiple serialization formats. You only use JSON. Without LTO, the binary includes code for YAML, TOML, and binary formats. With LTO, the linker sees that only the JSON path is reachable. It strips the rest. Binary size can drop by megabytes.
DCE also removes unused trait implementations. If you implement a trait for a type but never use that implementation, LTO can discard it. This keeps the binary lean.
Convention aside: pair LTO with strip = true in your release profile. LTO shrinks the code, but it leaves debug symbols and metadata. Stripping removes symbols you do not need in production. The combination yields the smallest possible binary.
[profile.release]
lto = true
strip = true
Thin LTO versus Fat LTO
LLVM supports two modes of LTO. Fat LTO loads all IR into memory at once and runs optimization passes over the entire program. This maximizes optimization potential. It also maximizes memory usage and build time. For large projects with hundreds of crates, Fat LTO can exhaust system memory or take minutes longer to build.
Thin LTO partitions the work. It runs a first pass to collect metadata about functions and types. It identifies optimization opportunities. It then partitions the IR and runs optimization passes in parallel. Different functions can be optimized simultaneously. Thin LTO uses less memory and builds faster. The optimization quality is nearly identical to Fat LTO for most workloads.
Rust defaults to Thin LTO when you set lto = true. Cargo passes flags that request Thin LTO from the backend. If you want Fat LTO, you must request it explicitly.
[profile.release]
# Force Fat LTO. Use only when you need every last cycle
# and have enough RAM to handle the memory spike.
lto = "fat"
Fat LTO is rarely worth it. Thin LTO captures the vast majority of inlining and DCE wins. The parallelism keeps build times manageable. Reach for Fat LTO only when profiling shows a specific bottleneck that Thin LTO missed, and you have verified that Fat LTO resolves it.
Use Thin LTO when build time and memory are constraints. Use Fat LTO when you are optimizing a final release binary for a performance-critical application and have resources to spare. Pick Thin LTO for CI pipelines to keep feedback loops fast. Skip Fat LTO for libraries; consumers will apply LTO when they build their binaries.
Pitfalls and trade-offs
LTO is not free. The primary cost is build time. Optimization passes run over more code. The linker does more work. Build times can increase by 50% to 200%, depending on the project size. If your build already takes ten minutes, LTO might push it to twenty.
Memory usage also increases. The linker must hold IR in memory. Large projects can trigger out-of-memory kills. If you see the build process terminated by the OS, reduce parallelism or switch to Thin LTO.
LTO interacts with debugging. Inlined functions make stack traces harder to read. Source-level debugging may jump around unpredictably. If you need to debug a release build, consider split-debuginfo. This separates debug symbols from the binary, keeping the binary small while preserving debuggability.
LTO does not work well with dynamic linking. If you build a shared library, LTO is limited. The linker cannot optimize across shared library boundaries because the final layout is not known until runtime. LTO applies to the shared library itself, but not to code that will be linked against it later. If you distribute a .so or .dll, LTO benefits are reduced.
Convention aside: keep LTO configuration in Cargo.toml, not in environment variables. Explicit configuration is reproducible. If you need to override LTO for a specific build, use cargo build --release -Z build-std or pass RUSTFLAGS="-C lto=off". Do not rely on implicit defaults.
Building the compiler itself
The technical kernel mentions bootstrap.toml and ./configure. These are for building the Rust compiler toolchain, not for application code. If you are compiling rustc from source, you configure LTO for the compiler binary itself. This reduces the size of the compiler and can improve its performance.
In the bootstrap configuration, you set rust.lto = "thin" or rust.lto = "fat". The ./configure script accepts --set rust.lto=thin. This applies LTO to the compiler binary. It does not affect crates you compile with that compiler. Application LTO is always controlled by Cargo.toml or rustc flags passed to your code.
Treat the compiler build configuration as separate from your project configuration. Do not mix bootstrap.toml settings with Cargo.toml profiles.
Decision matrix
Use LTO when binary size is critical, such as for embedded targets, WebAssembly modules, or distribution packages where download size matters. Use LTO when profiling shows performance bottlenecks caused by cross-crate function call overhead, especially for small, frequently called functions. Use Thin LTO when you want the benefits of LTO without excessive build time or memory usage; this is the default and recommended setting for most projects. Use Fat LTO when you have measured that Thin LTO leaves a specific optimization opportunity on the table and you have sufficient RAM to handle the load. Skip LTO when build time is the primary constraint and performance is already acceptable; rapid iteration during development benefits from faster builds. Skip LTO for shared libraries where cross-boundary optimization is limited and the build cost outweighs the gains.
Trust the optimizer to delete what you do not use, but pay the build time tax. LTO is a trade-off, not a magic bullet. Measure before and after. If the binary shrinks by 20% and runs 5% faster, but builds 3x slower, decide whether the runtime gain justifies the developer time cost.