How to Use Rust with Apache Arrow

Add the arrow crate to Cargo.toml and use its types to define schemas and create arrays for efficient in-memory data processing.

When rows slow you down

You are building a data tool. Maybe a CLI that processes CSV exports, or a backend that serves analytics to a dashboard. You started with JSON. It works for a few hundred records. Then the dataset grows to a million rows. Parsing JSON becomes the bottleneck. Your CPU spends more time deserializing text than doing actual work. You hear about Apache Arrow. It is the standard for columnar data in the data world. Pandas uses it. DuckDB uses it. Polars uses it. You want Rust to play nice with it. The arrow crate is your bridge. It lets you build columnar arrays in Rust that other languages can read without copying data.

The columnar shift

Arrow flips the script on how data lives in memory. Most databases and languages store data row by row. Think of a spreadsheet where each row is a complete record. If you have customers with names, ages, and emails, a row-based layout stores [Name, Age, Email, Name, Age, Email]. Arrow stores data column by column. It keeps all names together in one block, all ages in another, and all emails in a third.

This matters because of how CPUs work. When you calculate the average age, you only need the age column. In a row-based layout, the CPU loads the entire row into cache, including the name and email, just to grab the age. Most of that data is wasted bandwidth. In a columnar layout, the CPU loads only the ages. The data fits in cache. The calculation flies.

Arrow also defines a binary layout that is identical across languages. A Rust process can point to an Arrow buffer, and a Python process can read the exact same memory without translation. That is the zero-copy promise. No serialization. No deserialization. Just shared memory.

Minimal example: building arrays

Start by adding the arrow crate to your project. The crate provides the core types for arrays, buffers, and schemas.

[dependencies]
arrow = "53"

Arrow arrays are immutable. You build them once, then you read them. This immutability allows Arrow to share memory safely across threads and processes. You cannot modify an array in place. If you need to change data, you build a new array.

use arrow::array::{Int32Array, StringArray};

fn main() {
    // Create a column of integers.
    // Arrow arrays are immutable; this allocates memory and locks the data.
    let ids = Int32Array::from(vec![1, 2, 3]);

    // Create a column of strings.
    // Utf8 is the Arrow type for variable-length strings.
    let names = StringArray::from(vec!["Alice", "Bob", "Charlie"]);

    println!("IDs: {:?}", ids);
    println!("Names: {:?}", names);
}

The from method takes a vector and converts it into an Arrow array. Under the hood, Arrow allocates a contiguous block of memory for the values. It also creates a null bitmap. Even if your data has no nulls, the bitmap exists. If a value is missing, Arrow sets the corresponding bit in the bitmap to mark it as null. This keeps the data dense and predictable for the CPU.

Arrow arrays are immutable. Build them once, read them many times.

How Arrow stores data

When you inspect an Arrow array, you are looking at a wrapper around buffers. An Int32Array holds a data buffer containing the raw integers. A StringArray holds two buffers: one for the string data and one for offsets that mark where each string starts and ends. This offset buffer allows variable-length strings to live in a single contiguous block.

The array structure also tracks the length and the null count. If you create an array with from, the null count is zero. If you use from_iter with Option values, Arrow populates the null bitmap and counts the nulls automatically.

use arrow::array::Int32Array;

fn main() {
    // Handling nulls requires Options.
    // Arrow builds a bitmap to track which positions are null.
    let ids_with_nulls = Int32Array::from_iter(vec![Some(1), None, Some(3)]);

    println!("Null count: {}", ids_with_nulls.null_count());
    println!("Is null at index 1: {}", ids_with_nulls.is_null(1));
}

The is_null method checks the bitmap. It is fast. No branching over values. Just a bit check. This design allows vectorized operations to skip nulls efficiently.

Realistic example: schemas and batches

Real code does not just have loose arrays. It has schemas and batches. A Schema defines the columns: their names, types, and nullability. A RecordBatch is like a table slice. It groups columns together with the same number of rows. Most Arrow operations work on RecordBatch objects.

use arrow::array::{Int32Array, RecordBatch, StringArray};
use arrow::datatypes::{DataType, Field, Schema};
use std::sync::Arc;

fn main() {
    // Define the structure.
    // The boolean flag controls nullability. False means the column cannot contain nulls.
    let schema = Schema::new(vec![
        Field::new("id", DataType::Int32, false),
        Field::new("name", DataType::Utf8, false),
    ]);

    // Build columns matching the schema.
    let id_array = Int32Array::from(vec![1, 2, 3]);
    let name_array = StringArray::from(vec!["Alice", "Bob", "Charlie"]);

    // Combine into a batch.
    // RecordBatch requires Arc wrappers for shared ownership.
    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![
            Arc::new(id_array),
            Arc::new(name_array),
        ],
    ).expect("Failed to create batch");

    println!("Rows: {}", batch.num_rows());
    println!("Columns: {}", batch.num_columns());
}

Arrow uses Arc for shared ownership of schemas and arrays. This is a convention. Arrow is designed for multi-threaded data processing. Even if your code is single-threaded, the API expects Arc. Using Arc allows multiple parts of your program to hold references to the same batch without copying the data.

The RecordBatch::try_new function returns a Result. It validates that the arrays match the schema. If you pass an Int32Array where the schema expects Utf8, the function returns an error. The compiler cannot catch this mismatch because the types are generic enough. You must handle the error at runtime.

Treat the schema as the contract. If the schema and arrays disagree, the batch creation fails.

Slicing and zero-copy

One of Arrow's superpowers is slicing. You can create a view into a subset of a batch without allocating new memory. This is how you paginate data or process chunks efficiently.

use arrow::array::{Int32Array, RecordBatch, StringArray};
use arrow::datatypes::{DataType, Field, Schema};
use std::sync::Arc;

fn main() {
    let schema = Schema::new(vec![
        Field::new("id", DataType::Int32, false),
        Field::new("name", DataType::Utf8, false),
    ]);

    let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
    let name_array = StringArray::from(vec!["Alice", "Bob", "Charlie", "Diana", "Eve"]);

    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![Arc::new(id_array), Arc::new(name_array)],
    ).expect("Failed to create batch");

    // Slice the batch.
    // This creates a new batch pointing to the same memory.
    // No data is copied. Only the offset and length change.
    let slice = batch.slice(1, 3);

    println!("Original rows: {}", batch.num_rows());
    println!("Slice rows: {}", slice.num_rows());
}

The slice method takes an offset and a length. It returns a new RecordBatch that shares the underlying buffers with the original. The slice has its own offset and length, but the data is the same. This operation is constant time. It does not depend on the size of the data.

Slice, don't copy. Let the CPU cache do the work.

Pitfalls and errors

Arrow is powerful, but it has traps. The most common issue is schema mismatch. If you build arrays and a schema separately, you must ensure they align. The compiler will not help you here. You will get an ArrowError::SchemaError at runtime if the types or nullability do not match.

Another pitfall is length mismatch. Every column in a RecordBatch must have the same number of rows. If one column has 100 rows and another has 99, RecordBatch::try_new fails. Check the lengths before creating the batch.

Null handling is another area where beginners stumble. The from method assumes no nulls. If your data contains missing values, from will panic or produce incorrect results. Use from_iter with Option values, or use the builder API.

use arrow::array::Int32Array;

fn main() {
    // This panics if the vector contains nulls, because from expects concrete values.
    // Use from_iter for optional data.
    let ids = Int32Array::from_iter(vec![Some(1), None, Some(3)]);
}

Performance pitfalls exist too. The from_iter method is convenient but allocates memory as it goes. For high-performance code, use the builder API. Builders let you reserve capacity upfront and append values one by one.

use arrow::array::Int32Builder;

fn main() {
    // Reserve capacity to avoid reallocations.
    let mut builder = Int32Builder::with_capacity(100);

    builder.append_value(1);
    builder.append_null();
    builder.append_value(3);

    // Finish returns the array.
    let array = builder.finish();
}

Builders are the tool for streaming data or constructing arrays in loops. They give you control over memory allocation.

Decision: when to use Arrow

Use the arrow crate when you need to process columnar data or interoperate with other data tools like DuckDB, Polars, or Python's Pandas. Use RecordBatch when you are grouping columns together for computation or passing data to an execution engine. Use ArrayRef (which is Arc<dyn Array>) when you need to store heterogeneous columns in a vector or return an array from a function where the concrete type is not known at compile time. Use ArrayBuilder when you are constructing arrays from streaming data or need to control memory allocation. Reach for arrow-csv or arrow-json when you need to parse text formats directly into Arrow arrays without building them manually.

Arrow is heavy. Do not use it for a single integer. Use it when the data volume justifies the columnar layout.

Where to go next