Skip to content

Latest commit

 

History

History
66 lines (53 loc) · 3.25 KB

File metadata and controls

66 lines (53 loc) · 3.25 KB

datafusion-comet-shuffle: Shuffle Writer and Reader

This crate provides the shuffle writer and reader implementation for Apache DataFusion Comet and is maintained as part of the Apache DataFusion Comet subproject.

Shuffle Benchmark Tool

A standalone benchmark binary (shuffle_bench) is included for profiling shuffle write performance outside of Spark. It streams input data directly from Parquet files.

Basic usage

cargo run --release --features shuffle-bench --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 \
  --codec lz4 \
  --hash-columns 0,3

Options

Option Default Description
--input (required) Path to a Parquet file or directory of Parquet files
--partitions 200 Number of output shuffle partitions
--partitioning hash Partitioning scheme: hash, single, round-robin
--hash-columns 0 Comma-separated column indices to hash on (e.g. 0,3)
--codec lz4 Compression codec: none, lz4, zstd, snappy
--zstd-level 1 Zstd compression level (1–22)
--batch-size 8192 Batch size for reading Parquet data
--memory-limit (none) Memory limit in bytes; triggers spilling when exceeded
--write-buffer-size 1048576 Write buffer size in bytes
--limit 0 Limit rows processed per iteration (0 = no limit)
--iterations 1 Number of timed iterations
--warmup 0 Number of warmup iterations before timing
--output-dir /tmp/comet_shuffle_bench Directory for temporary shuffle output files

Profiling with flamegraph

cargo flamegraph --release --features shuffle-bench --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 --codec lz4