How to Build a Custom Parser in Rust

Build a custom parser in Rust by defining a lexer to tokenize input, then implementing a recursive descent parser that consumes tokens to construct an Abstract Syntax Tree (AST). Use the `nom` crate for combinator-based parsing or write a manual parser using `match` expressions on token types.

When strings aren't enough

You have a configuration file with max_retries = 3. You need to extract that 3. You reach for split(" = ") and it works. Then someone writes max_retries=3 or max_retries = 3 # comment. Your split breaks. You add more logic. Then you need to parse 1 + 2 * 3 and respect operator precedence. String manipulation turns into a tangled mess of indices and edge cases.

That's when you build a parser. A parser takes raw text and turns it into a structured representation your program can reason about. In Rust, the standard approach is a two-stage process: a lexer chops the text into tokens, and a parser assembles those tokens into a tree.

The factory analogy

Think of parsing as a factory with two assembly lines. The lexer is the cutting machine. It takes a raw log of text and slices it into standardized blocks called tokens. It doesn't care about meaning. It just knows that "123" is a number block, "+" is a plus block, and spaces are scrap to be discarded.

The parser is the assembly line. It grabs tokens from the lexer and snaps them together according to the rules of your grammar. If the grammar says an expression is a number followed by an operator followed by another number, the parser builds a structure that represents that relationship. The output is usually an Abstract Syntax Tree, or AST. The AST is the blueprint your program uses to execute logic, evaluate expressions, or transform data.

Tokenizing with a lexer

The lexer's job is to turn a string into a stream of tokens. The tricky part is handling multi-character tokens like numbers. If you see a 1, you don't know if it's the number 1 or the start of 123 until you look ahead.

Rust's Chars iterator gives you characters one by one, but it doesn't support "putting back" a character once you've consumed it. If you grab a 1 and then see a +, you can't return the + to the stream. The idiomatic solution is Peekable. It wraps an iterator and lets you inspect the next item without consuming it.

Here's the token definition and the lexer structure. We derive Copy on Token so we can pass tokens around without worrying about ownership moves. This avoids E0382 (use of moved value) errors when the parser reads the same token multiple times.

use std::iter::Peekable;
use std::str::Chars;

#[derive(Debug, PartialEq, Clone, Copy)]
// Copy allows passing tokens by value without ownership issues.
enum Token {
    Number(i32),
    Plus,
    Minus,
    Eof,
}

struct Lexer<'a> {
    // Peekable wraps Chars to support look-ahead without consuming.
    chars: Peekable<Chars<'a>>,
}

impl<'a> Lexer<'a> {
    fn new(input: &'a str) -> Self {
        Lexer { chars: input.chars().peekable() }
    }
}

The lexer borrows the input string via the lifetime 'a. This means the lexer doesn't copy the text; it just holds a reference. The AST will also borrow from the lexer or the input, keeping memory usage low.

The next_token method implements the state machine. It skips whitespace, checks the current character, and accumulates digits for numbers. We accumulate mathematically to avoid allocating a String for every number, which keeps the lexer fast.

impl<'a> Lexer<'a> {
    fn next_token(&mut self) -> Token {
        // Skip whitespace by advancing only on space characters.
        while let Some(&ch) = self.chars.peek() {
            if ch.is_whitespace() { self.chars.next(); }
            else { break; }
        }

        match self.chars.peek() {
            Some(&ch) if ch.is_ascii_digit() => {
                // Accumulate digits mathematically to avoid String allocation.
                let mut n = 0;
                while let Some(&ch) = self.chars.peek() {
                    if ch.is_ascii_digit() {
                        n = n * 10 + (ch as i32 - '0' as i32);
                        self.chars.next();
                    } else { break; }
                }
                Token::Number(n)
            }
            Some(&'+') => { self.chars.next(); Token::Plus }
            Some(&'-') => { self.chars.next(); Token::Minus }
            _ => Token::Eof,
        }
    }
}

Trust Peekable. It saves you from buffer headaches and makes the lexer logic clean. The community convention is to keep the lexer simple and dumb. If you find yourself writing complex logic here, you're probably doing the parser's job.

Building the parser

The parser consumes tokens and builds the AST. We'll use a recursive descent parser, which mirrors the grammar rules directly in code. Each grammar rule becomes a function. If the grammar says Expr -> Term { (+|-) Term }, the expr function calls term, then loops while it sees + or -.

The AST uses Box to allocate child nodes on the heap. This keeps the enum size fixed. Without Box, the compiler would need to know the size of Expr recursively, which is impossible. Box adds a pointer, so the enum size is just the size of the largest variant's header plus a pointer.

Here's the parser structure. It holds the lexer and the current token. The advance method fetches the next token, keeping the parser stateful.

#[derive(Debug)]
enum Expr {
    Num(i32),
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
}

struct Parser<'a> {
    lexer: Lexer<'a>,
    current_token: Token,
}

impl<'a> Parser<'a> {
    fn new(input: &'a str) -> Self {
        let mut lexer = Lexer::new(input);
        // Prime the parser by reading the first token.
        let current_token = lexer.next_token();
        Parser { lexer, current_token }
    }

    fn advance(&mut self) {
        // Fetch the next token from the lexer to move forward.
        self.current_token = self.lexer.next_token();
    }
}

The parsing functions follow the grammar. expr handles addition and subtraction. term handles numbers. The expr function is left-recursive in the grammar sense, but we implement it iteratively to avoid stack overflow. We parse the left side, then loop while we see operators, consuming the right side and building the tree.

impl<'a> Parser<'a> {
    fn parse(&mut self) -> Result<Expr, String> {
        let expr = self.expr()?;
        // Ensure the input is fully consumed.
        if self.current_token != Token::Eof {
            return Err("Unexpected token after expression".to_string());
        }
        Ok(expr)
    }

    fn expr(&mut self) -> Result<Expr, String> {
        // Parse the left operand first.
        let mut left = self.term()?;

        // Loop to handle chained operators like 1 + 2 - 3.
        while self.current_token == Token::Plus || self.current_token == Token::Minus {
            let op = self.current_token;
            self.advance();
            let right = self.term()?;

            // Build the tree node based on the operator.
            left = match op {
                Token::Plus => Expr::Add(Box::new(left), Box::new(right)),
                Token::Minus => Expr::Sub(Box::new(left), Box::new(right)),
                _ => unreachable!("Loop condition guarantees Plus or Minus"),
            };
        }
        Ok(left)
    }

    fn term(&mut self) -> Result<Expr, String> {
        match self.current_token {
            Token::Number(n) => {
                self.advance();
                Ok(Expr::Num(n))
            }
            _ => Err(format!("Expected number, found {:?}", self.current_token)),
        }
    }
}

Left recursion kills your stack. If you write expr to call expr at the start of the function, you get infinite recursion and a stack overflow. Structure your grammar to pull, not push. Parse the base case first, then loop for operators. This pattern is essential for recursive descent parsers.

Pitfalls and compiler errors

Parsers expose several Rust-specific traps. The most common is borrowing. If your lexer holds a reference to the input, and your AST holds references to the lexer, you can get tangled lifetimes. The compiler will reject code with E0502 (cannot borrow as mutable because it is also borrowed as immutable) if you try to mutate the parser while holding a reference to the AST.

Another trap is unwrap. The lexer uses parse().unwrap() for numbers. If the input contains a number larger than i32::MAX, the program panics. In production code, use parse::<i32>().map_err(...) to return a proper error. The parser returns Result<Expr, String>, which is good, but the lexer should also propagate errors instead of panicking.

If you forget to derive Copy on Token, you'll hit E0382 when the parser tries to use self.current_token after moving it into a match arm. The compiler enforces ownership strictly. Deriving Copy on small enums like tokens is a convention that saves you from these errors.

Convention aside: cargo fmt formats every file the same way. Don't argue style; argue logic. Parsers are complex enough without fighting indentation.

When to use what

Parsing strategies vary by complexity. Pick the tool that matches your grammar's needs.

Use recursive descent for simple grammars where you want full control and zero dependencies. It's easy to debug, fast, and integrates naturally with Rust's ownership model. Reach for nom when you prefer combinator-based parsing and want rich error recovery without writing boilerplate. nom lets you compose small parsers into larger ones, which scales well for moderate complexity. Pick lalrpop or pest for complex grammars with operator precedence or when you want a generated parser from a grammar file. These tools handle precedence and ambiguity automatically, saving you from manual tree construction.

Counter-intuitive but true: the more you use unsafe in a parser, the harder the rest of your code becomes to reason about. Parsers are pure logic. You rarely need unsafe. If you do, isolate it in a tiny helper and write a // SAFETY: comment that lists the invariants. Treat the SAFETY comment as a proof. If you can't write it, you don't have one.

Where to go next

A custom parser is a program that reads text input and breaks it down into meaningful parts, like turning a math equation into a structure a computer can calculate. You build one by first splitting the text into small pieces (tokens) and then arranging those pieces according to specific rules to create a final result. It's like a translator converting a sentence into a diagram that shows exactly how the words relate to each other.