Introducing rustream: A Fast Postgres to Parquet Sync Tool in Rust
Hey there!
If you’ve read my previous post on building a cloud-native ETL with Rust, you’ll know we spent a lot of time wiring up Postgres reads, Parquet writes, and S3 uploads by hand. rustream is the next evolution of that work – a focused, open-source CLI tool that does one thing well: sync Postgres tables to Parquet.
Why rustream?
The original Dracula project taught us a lot, but it was tightly coupled to our specific tables and schemas. Every new table meant writing a new Rust struct, a Diesel schema macro, and a Parquet record writer. It worked, but it didn’t scale in terms of developer time.
rustream takes a different approach: it introspects your Postgres schema at runtime using information_schema, automatically maps column types to Arrow data types, and writes Parquet files without any per-table code. You just point it at your database and go.
The code is on GitHub: kraftaa/rustream
How It Works
The tool is a single Rust binary. You give it a YAML config file that specifies your Postgres connection and output target:
postgres:
host: localhost
database: mydb
user: postgres
password: secret
output:
type: local
path: ./output
tables:
- name: users
incremental_column: updated_at
- name: orders
incremental_column: updated_at
partition_by: date
- name: products
Then run:
dracula sync --config config.yaml
That’s it. No Rust structs to define, no schema macros to generate, no compile step per table. The schema is figured out on the fly.
How the Schema Works (No More Diesel Boilerplate)
This is the big difference from Dracula. With Diesel, every table needed manual setup before you could even read from it:
diesel print-schemato generate atable!macro- A
#[derive(Queryable)]struct with fields matching every column - A separate
ParquetRecordWriterimplementation to map those fields to Parquet columns - Rebuild and recompile every time you add a table
In rustream, all of this happens at runtime. When you run a sync, the tool connects to Postgres and queries information_schema.columns to discover every column, its type, and whether it’s nullable:
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = $1 AND table_name = $2
ORDER BY ordinal_position
Then it maps each Postgres type to an Arrow data type on the fly – integer becomes Int32, timestamptz becomes Timestamp(Microsecond, UTC), jsonb becomes Utf8 (stored as a JSON string), uuid becomes Utf8, and so on. From that mapping, it builds an Arrow Schema and reads rows directly into Arrow RecordBatches using tokio-postgres, converting each column value into the right Arrow array type.
The result: you add three lines to a YAML file instead of writing hundreds of lines of Rust. And if a table gains new columns, the next sync picks them up automatically – no code changes, no recompile.
Key Features
Schema introspection – The tool queries information_schema.columns and maps Postgres types (int, text, timestamptz, jsonb, uuid, arrays, etc.) to Arrow types automatically. Unknown types fall back to Utf8 so nothing breaks.
Incremental sync – Set an incremental_column (like updated_at) and the tool tracks a high watermark in a local SQLite database. On each run it only reads rows newer than the last sync.
Partitioned output – Partition Parquet files by date, month, or year. An orders table with partition_by: date produces paths like orders/year=2026/month=02/day=11/...parquet.
S3 support – Switch the output to S3 by changing a few lines in the config:
output:
type: s3
bucket: my-data-lake
prefix: raw/postgres
region: us-east-1
AWS credentials come from the standard chain (env vars, ~/.aws/credentials, or IAM role).
Auto-discovery – Don’t want to list tables explicitly? Remove the tables key and set schema: public. The tool discovers all tables in the schema and syncs them, with an exclude list for tables you want to skip.
What Changed from the Original Dracula
The biggest shift is moving from Diesel ORM + per-table structs to tokio-postgres + Arrow. In the original project, adding a table meant:
- Run
diesel print-schemato generate the table macro - Write a
Queryablestruct matching the schema - Write a
ParquetRecordWriterstruct for the output - Wire up the read-transform-write pipeline
In rustream, adding a table means adding three lines to the YAML config. The runtime handles the rest.
We also switched from synchronous Diesel to fully async tokio-postgres, which plays better with the async S3 SDK and lets us handle I/O concurrently.
Architecture
The data flow is straightforward:
Postgres
-> reader::read_batch() -- SQL query with watermark filter
-> Arrow RecordBatch -- in-memory columnar data
-> writer::write_parquet() -- Snappy-compressed Parquet buffer
-> output::write_output() -- local filesystem or S3
The codebase is small – about 8 Rust source files:
config.rs– YAML config parsing with serdeschema.rs– Postgres schema introspectiontypes.rs– Postgres-to-Arrow type mappingreader.rs– Batch reading from Postgreswriter.rs– Parquet serializationoutput.rs– Local/S3 output abstractionstate.rs– SQLite watermark trackingsync.rs– Orchestration loop
Coming Soon: Apache Iceberg Support
I’m currently working on Iceberg output support on a feature branch. Instead of writing standalone Parquet files, rustream will be able to write proper Iceberg tables with metadata JSON, manifests, and snapshots. This means query engines like Athena, Trino, and Spark can read the data as a real table with time travel and schema evolution.
The config will look like:
format: iceberg
warehouse: s3://my-bucket/warehouse
catalog:
type: filesystem
Under the hood, I’m using the official Apache Iceberg Rust crate (v0.8) with its MemoryCatalog for filesystem-based catalogs, and optional AWS Glue catalog support for Athena users.
Installation
You can install directly from PyPI – it ships as a pre-built Python wheel via maturin, so you get a native Rust binary without needing a Rust toolchain:
pip install rustream
Or if you prefer isolated tool installs:
pipx install rustream
Or build from source:
git clone https://github.com/kraftaa/rustream.git
cd rustream
cargo build --release
Also Check Out: TransformDash
If you’re interested in the transformation side of the pipeline, I also built TransformDash – a lightweight alternative to dbt that runs SQL transformations directly against your Postgres or MongoDB without needing a dedicated data warehouse. It supports dbt-style and macros, DAG resolution, and has a built-in dashboard for quick data checks.
It’s also available on PyPI:
pip install transformdash
You can read more about it in my earlier post.
Try It Out
If you’re syncing Postgres tables to a data lake and want something simpler than Airbyte or Fivetran, give rustream a try. It’s lightweight, fast, and doesn’t require a JVM or Python runtime.
# Preview what will be synced
dracula sync --config config.yaml --dry-run
# Run the sync
dracula sync --config config.yaml
Check out the repo: github.com/kraftaa/rustream
Questions or feedback? Open an issue on GitHub.