The spreadsheet that outgrew Excel
You have a two gigabyte file full of sensor readings. Opening it in a spreadsheet crashes your browser. Parsing it as CSV takes forty seconds and eats your RAM. The file is already compressed and split by column. You just need to stream it into memory, process the numbers, and move on.
Parquet exists for exactly this scenario. It is the standard format for columnar data storage in data engineering. Rust does not include a built-in Parquet parser in its standard library. The ecosystem relies on two crates that work together: parquet handles the file format, compression, and page decoding. arrow provides the in-memory representation that receives the decoded data.
Why Parquet and Arrow travel together
Parquet stores data vertically instead of horizontally. Think of a traditional spreadsheet as a stack of index cards. Each card holds one complete row. Parquet flips the stack. It stores every temperature reading in one pile, every timestamp in another, and every sensor ID in a third. This layout lets you read only the columns you actually need. It also compresses numbers dramatically because identical types sit next to each other.
Arrow is the in-memory tray that holds those columns. The Arrow project defines a standard memory layout for tabular data. Rust, Python, Java, and C++ all speak the same Arrow dialect. When you read a Parquet file in Rust, you are not getting back a Vec<Vec<String>>. You are getting back an Arrow RecordBatch. That batch is a zero-copy view into contiguous memory. Different parts of your program can read the same batch without triggering allocation or data movement.
The two crates are version-locked by design. The parquet crate is built on top of the arrow crate. If you mix version 51 of parquet with version 49 of arrow, the compiler will reject you with trait bound errors. The community convention is to always pin both crates to the exact same major version. Check the arrow changelog first, then mirror that number in parquet.
The minimal streaming reader
Add both crates to your manifest. The builder pattern handles the heavy lifting of schema parsing, metadata extraction, and memory allocation.
[dependencies]
parquet = "51"
arrow = "51"
use std::fs::File;
use parquet::arrow::ParquetRecordBatchReaderBuilder;
/// Streams a Parquet file into Arrow RecordBatches and prints row counts.
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open the file. The parquet crate takes ownership of the file handle
// so it can read metadata pages without seeking back to the start.
let file = File::open("data.parquet")?;
// The builder reads the footer, validates the schema, and prepares
// the decoding pipeline. This step is cheap and happens once.
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
// Build the actual reader. It holds the file cursor and the schema.
// The reader implements Iterator, which enables lazy evaluation.
let mut reader = builder.build()?;
// Pull batches one at a time. The loop stops when the file ends.
// Each batch contains a configurable number of rows (default 1024).
while let Some(batch_result) = reader.next() {
let batch = batch_result?;
println!("Read {} rows", batch.num_rows());
}
Ok(())
}
What happens when you call next
The while let loop drives the entire pipeline. Calling reader.next() does not load the whole file. It advances the internal cursor, decompresses the next page of column data, and decodes it into Arrow arrays. The RecordBatch struct holds references to those arrays. When the batch goes out of scope at the end of the loop iteration, the memory is immediately reclaimed. The next iteration allocates fresh space for the next chunk of rows.
This streaming behavior is intentional. Parquet files can be terabytes in size. Loading everything into a single Vec defeats the purpose of the format. The builder pattern separates configuration from execution. try_new parses the file footer, which lives at the end of the file. The footer contains the schema, row group boundaries, and compression hints. Once the builder has that metadata, build() hands you an iterator that knows exactly how many bytes to read for each column.
Arrow arrays are immutable by design. You cannot push a new value into a RecordBatch after it is created. If you need to mutate data, you convert the batch to a mutable buffer, modify it, and construct a new batch. This restriction eliminates data races at the type level. Multiple threads can read the same batch simultaneously without locks.
Inspecting the schema before you commit
Real data rarely arrives perfectly clean. You often need to verify column names, check data types, or filter out empty groups before processing. The builder exposes the schema before you commit to reading.
use std::fs::File;
use parquet::arrow::ParquetRecordBatchReaderBuilder;
use arrow::datatypes::DataType;
/// Reads a Parquet file only if it contains a specific numeric column.
fn process_if_valid() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("data.parquet")?;
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
// Extract the schema to inspect field names and types.
// The schema is borrowed from the builder, so it lives as long as builder.
let schema = builder.schema();
// Find the index of the column we actually care about.
let temperature_idx = schema
.fields()
.iter()
.position(|f| f.name() == "temperature")
.ok_or("Missing temperature column")?;
// Verify the type matches what our math expects.
let field = &schema.fields()[temperature_idx];
if !matches!(field.data_type(), DataType::Float64 | DataType::Float32) {
return Err("Temperature must be a float type".into());
}
// Build the reader and stream only the rows we need.
let mut reader = builder.build()?;
while let Some(Ok(batch)) = reader.next() {
let col = batch.column(temperature_idx);
// Process the array here. Arrow provides typed accessor methods.
}
Ok(())
}
The schema inspection happens before any row data is decoded. This keeps the validation step fast. You can also pass a Projection to the builder if you only want specific columns. The reader will skip decoding the rest entirely. Skipping columns saves CPU cycles and reduces memory pressure.
When the compiler or runtime pushes back
Parquet and Arrow are strictly typed. The compiler will catch mismatches early, but the error messages can feel dense if you are new to the ecosystem.
If you try to print an Arrow array directly, the compiler rejects you with E0277 (trait bound not satisfied). Arrow arrays do not implement Display because formatting a million floating point numbers to stdout is almost never what you want. Convert the array to a Rust type first, or use the arrow::util::pretty::print_batches helper for debugging.
If you mix crate versions, you will see E0277 or E0308 (mismatched types) pointing at trait implementations that simply do not exist. The parquet crate re-exports arrow types. Importing arrow::RecordBatch while using parquet::arrow::RecordBatch from a different version creates two distinct types in the compiler's eyes. Stick to one version number across both dependencies.
Runtime errors usually come from malformed files. A truncated Parquet file triggers a ParquetError during try_new or next. The error type implements std::error::Error, so you can chain it with ? or match on it. Never unwrap file I/O errors in production code. Network mounts and cloud storage backends fail intermittently. Handle the Result explicitly.
Choosing your reading strategy
Use ParquetRecordBatchReaderBuilder when you need a simple, synchronous iterator over a local file. Use ParquetRecordBatchStream when you are reading from a tokio or async-std compatible source like S3 or HTTP. Use arrow::json or csv crates when your data is smaller than fifty megabytes and you do not need columnar compression. Use parquet::arrow::async_reader when you are building a high-throughput pipeline that overlaps I/O with CPU decoding. Use plain std::fs and manual parsing only when you are reading a few kilobytes of configuration data and want to avoid external dependencies.