How to Process Large Files Efficiently in Rust

The 10-gigabyte trap

You have a 10-gigabyte log file. You write let content = fs::read_to_string(path)?;. Your laptop fan hits jet engine mode. The process dies with an out-of-memory error. You didn't just waste RAM. You made your program fragile. Loading a file entirely into memory works for a config file. It breaks the moment the file grows beyond your RAM.

The fix isn't more RAM. The fix is streaming. You read the file in small chunks, process each chunk, and discard it before moving to the next. Your memory usage stays flat. You can process a terabyte file on a Raspberry Pi if the logic per chunk is light.

Streaming vs loading

Think of a file like a long conveyor belt of data. read_to_string tries to stop the belt and pile every item onto one table. If the pile gets too big, the table collapses. Streaming keeps the belt moving. You grab an item, process it, put it aside, and grab the next one. The table never fills up.

Rust gives you the tools to build this conveyor belt. The core pattern is BufReader. It wraps a file handle and manages a small buffer. You pull data from the buffer. When the buffer empties, Rust refills it from the disk. You never see the whole file. You only see the current chunk.

The cost of system calls

Buffering isn't just about memory. It's about speed. Every time Rust reads from a file, it makes a system call. A system call switches the CPU from user mode to kernel mode. That switch costs time. If you read one byte at a time, you make millions of system calls. The CPU spends more time switching modes than reading data.

BufReader solves this. It reads a large block, like 8 kilobytes, in one system call. It stores that block in memory. Your code reads from the memory buffer. No more system calls until the buffer empties. You reduce the number of system calls by a factor of thousands. The throughput jumps.

Minimal example

Here is the standard pattern for reading a large text file line by line.

use std::fs::File;
use std::io::{BufRead, BufReader};

/// Reads a large file line by line without loading it all into memory.
fn process_log(path: &str) -> std::io::Result<()> {
    // Open the file. This creates a handle (file descriptor).
    // No data is read yet.
    let file = File::open(path)?;

    // Wrap the file in a BufReader.
    // This allocates a buffer (default 8KB) to minimize system calls.
    let reader = BufReader::new(file);

    // Iterate over lines.
    // `lines()` returns an iterator that yields Results.
    // Each line is allocated, processed, and dropped.
    for line_result in reader.lines() {
        // Unwrap the Result. If reading fails, propagate the error.
        let line = line_result?;

        // Process the line.
        // The String is dropped at the end of this loop iteration.
        println!("{line}");
    }

    Ok(())
}

Trust the iterator. It manages the buffer for you. You focus on the logic per line.

How the buffer works

When you call File::open, Rust talks to the OS and gets a file descriptor. The OS tracks the current position in the file. BufReader::new allocates a buffer, usually 8 kilobytes.

When you call lines(), Rust checks the buffer. If the buffer has data, it scans for a newline. It slices the buffer to create a String for the line. It yields that String. The String is dropped at the end of the loop. The buffer still holds the rest of the data.

If the buffer runs out of data, Rust performs a read system call. It fills the buffer with the next chunk from the disk. It continues scanning. This cycle repeats until the file ends. The memory footprint is the buffer size plus one line. It never grows with the file size.

Realistic processing

Real code does more than print. You might filter, transform, or aggregate. Here is a pattern for counting errors in a log file.

use std::fs::File;
use std::io::{BufRead, BufReader};

/// Counts lines containing "ERROR" in a large log file.
/// Returns the count or an IO error if the file cannot be read.
fn count_errors(path: &str) -> std::io::Result<usize> {
    let file = File::open(path)?;
    let reader = BufReader::new(file);

    let mut count = 0;

    // Iterate lines.
    // `lines()` handles UTF-8 decoding and newline detection.
    for line_result in reader.lines() {
        let line = line_result?;

        // Check condition.
        // `contains` is efficient for short substrings.
        if line.contains("ERROR") {
            count += 1;
        }
    }

    Ok(count)
}

Keep the buffer small. Keep the logic fast. The bottleneck is usually disk I/O, not CPU, so don't over-optimize the line processing unless profiling shows otherwise.

Convention: buffer size

The default buffer size is 8 kilobytes. That works for most cases. You can tune it with BufReader::with_capacity.

use std::io::BufReader;

// Use a larger buffer for network streams or very large lines.
let reader = BufReader::with_capacity(64 * 1024, file);

The community convention is to stick with the default unless you have a reason to change it. 8KB aligns well with disk block sizes and CPU cache lines. If you are reading from a network socket, a larger buffer might reduce latency. If your lines are consistently larger than 8KB, the buffer refills frequently, and a larger buffer helps. Don't guess. Measure.

Pitfalls and errors

Forgetting the buffer

You might try to call lines() directly on a File. That won't compile. File doesn't implement BufRead.

The compiler rejects this with E0277 (the trait bound std::fs::File: std::io::BufRead is not satisfied). The fix is to wrap the file in BufReader.

// This fails to compile.
// let reader = File::open("log.txt")?;
// for line in reader.lines() { ... }

// This works.
let file = File::open("log.txt")?;
let reader = BufReader::new(file);
for line in reader.lines() { ... }

The memory bomb

lines() allocates a String for each line. If your file has no newlines, lines() keeps growing the string until it hits the end of the file. A 10GB file with no newlines becomes a 10GB string. You get an out-of-memory error.

Validate your assumptions. If the file format might lack newlines, check the line length or use a chunked reader.

use std::io::{BufRead, BufReader};

fn safe_read(reader: &mut BufReader<File>) -> std::io::Result<()> {
    let mut buffer = String::new();
    
    loop {
        // Read until newline or buffer full.
        // Returns number of bytes read.
        let bytes_read = reader.read_line(&mut buffer)?;
        
        if bytes_read == 0 {
            break; // EOF
        }

        // Check size before processing.
        if buffer.len() > 1_000_000 {
            // Handle oversized line.
            // Truncate or error out.
            break;
        }

        // Process buffer.
        // Clear for next iteration.
        buffer.clear();
    }

    Ok(())
}

Encoding issues

lines() assumes UTF-8. If the file contains invalid UTF-8, lines() returns an error. Binary files or files with weird encodings will crash this loop.

If you expect non-UTF-8 data, use read_until with a byte delimiter, or use a crate like encoding_rs to handle decoding. Don't assume text files are always UTF-8. Old logs might be Latin-1.

Result handling in loops

lines() yields Result<String>. The ? operator propagates errors. If a read fails halfway through a 10GB file, the function returns immediately. You lose the count.

If you need to skip bad lines, handle the error explicitly.

for line_result in reader.lines() {
    match line_result {
        Ok(line) => {
            // Process line
        }
        Err(e) => {
            // Log error and continue.
            eprintln!("Skipping bad line: {e}");
        }
    }
}

Validate your assumptions. A file without newlines is a memory bomb. A file with bad encoding is a crash waiting to happen. Handle the edge cases.

Decision matrix

Use BufReader when processing text files line by line or in fixed chunks. Use fs::read or read_to_string when the file is small and you need random access or the whole content at once. Use BufWriter when writing large amounts of data to avoid the cost of frequent system calls. Use memory mapping with memmap2 when you need random access to a large file without loading it all, and you are comfortable with unsafe or a safe wrapper. Reach for read_to_end when you are processing binary data in a stream and don't care about lines.

Pick the tool that matches the data shape. Streaming wins for large files. Loading wins for small files.

Where to go next

Instead of trying to read a massive file all at once, which would crash your computer by using too much memory, you read it line by line or in small blocks. This is like reading a book one page at a time instead of trying to hold the entire book in your hands at once. It keeps your program running smoothly regardless of how big the file is.