datafusion-comet-shuffle: Shuffle Writer and Reader

This crate provides the shuffle writer and reader implementation for Apache DataFusion Comet and is maintained as part of the Apache DataFusion Comet subproject.

Shuffle Benchmark Tool

A standalone benchmark binary (shuffle_bench) is included for profiling shuffle write performance outside of Spark. It streams input data directly from Parquet files.

Basic usage

cargo run --release --features shuffle-bench --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 \
  --codec lz4 \
  --hash-columns 0,3

Options

Option	Default	Description
`--input`	(required)	Path to a Parquet file or directory of Parquet files
`--partitions`	`200`	Number of output shuffle partitions
`--partitioning`	`hash`	Partitioning scheme: `hash`, `single`, `round-robin`
`--hash-columns`	`0`	Comma-separated column indices to hash on (e.g. `0,3`)
`--codec`	`lz4`	Compression codec: `none`, `lz4`, `zstd`, `snappy`
`--zstd-level`	`1`	Zstd compression level (1–22)
`--batch-size`	`8192`	Batch size for reading Parquet data
`--memory-limit`	(none)	Memory limit in bytes; triggers spilling when exceeded
`--write-buffer-size`	`1048576`	Write buffer size in bytes
`--limit`	`0`	Limit rows processed per iteration (0 = no limit)
`--iterations`	`1`	Number of timed iterations
`--warmup`	`0`	Number of warmup iterations before timing
`--output-dir`	`/tmp/comet_shuffle_bench`	Directory for temporary shuffle output files

Profiling with flamegraph

cargo flamegraph --release --features shuffle-bench --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 --codec lz4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datafusion-comet-shuffle: Shuffle Writer and Reader

Shuffle Benchmark Tool

Basic usage

Options

Profiling with flamegraph

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

datafusion-comet-shuffle: Shuffle Writer and Reader

Shuffle Benchmark Tool

Basic usage

Options

Profiling with flamegraph