Skip to content

Expand ORC writer capabilities#84

Open
mrxiad wants to merge 2 commits intodatafusion-contrib:mainfrom
mrxiad:feat/writer-compression-row-index
Open

Expand ORC writer capabilities#84
mrxiad wants to merge 2 commits intodatafusion-contrib:mainfrom
mrxiad:feat/writer-compression-row-index

Conversation

@mrxiad
Copy link
Copy Markdown

@mrxiad mrxiad commented May 6, 2026

Summary

  • write file-level and stripe-level column statistics for ORC writer output
  • add writer support for Decimal128, Date32, Timestamp, and Timestamp(UTC)
  • add opt-in writer bloom filter streams for row-indexed primitive columns
  • add AsyncArrowWriter / AsyncArrowWriterBuilder for async sinks
  • update README writer capability and roadmap wording

Tests

  • cargo fmt --check
  • cargo test -q
  • cargo clippy -q --all-targets --all-features -- -D warnings

Known limitations

  • complete row-index seek positions are not implemented yet; row index entries still carry statistics but empty positions
  • AsyncArrowWriter currently buffers ORC bytes internally and writes them to AsyncWrite on finish; it is an async sink API, not streaming backpressure support

@mrxiad mrxiad changed the title Add writer compression and row indexes Expand ORC writer capabilities May 6, 2026
@WenyXu WenyXu requested a review from Copilot May 7, 2026 02:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the ORC writer to emit richer metadata (statistics, row indexes, bloom filters), adds support for additional Arrow logical types (decimal/date/timestamp), introduces an async writer API for async sinks, and updates compression handling to include writer-side stream compression.

Changes:

  • Add file-level and stripe-level column statistics plus optional row indexes and bloom filter streams in writer output.
  • Add writer support for Decimal128, Date32, and Timestamp (with UTC handling in schema).
  • Introduce AsyncArrowWriter / AsyncArrowWriterBuilder and document updated writer capabilities in the README.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/writer/stripe.rs Tracks per-stripe stats and optionally writes row-index + bloom-filter streams; applies writer compression to streams and stripe footer.
src/writer/mod.rs Adds stream kinds for row index and bloom filter UTF8 streams; exposes writer index module internally.
src/writer/index.rs New builders for row indexes, bloom filters, and column statistics aggregation.
src/writer/column.rs Adds encoders for Decimal128, Date32, and Timestamp; adds timestamp encoding helpers and varint zigzag write support.
src/lib.rs Exposes the new async writer module and re-exports async writer types behind the async feature.
src/encoding/integer/mod.rs Re-exports write_varint_zigzagged for new decimal encoding.
src/compression.rs Adds writer compression API (WriterCompression) and compress_stream; expands zlib decompressor compatibility; adds tests.
src/bloom_filter.rs Makes bloom filter construction internal and adds protobuf serialization helper for UTF8 bitset format.
src/async_arrow_writer.rs New async writer wrapper that buffers ORC bytes and writes them to an AsyncWrite sink on finish/close.
src/arrow_writer.rs Adds writer options (compression, row index stride, bloom filters), writes metadata+footer with compression, and emits file/stripe statistics.
README.md Updates feature list and roadmap to reflect writer support and remaining limitations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/writer/index.rs
Comment on lines +226 to +233
pub(crate) fn update_array(&mut self, data_type: &ArrowDataType, array: &ArrayRef) {
for row_index in 0..array.len() {
self.update(data_type, array, row_index);
}
}

fn update(&mut self, data_type: &ArrowDataType, array: &ArrayRef, row_index: usize) {
if array.is_null(row_index) {
Comment thread src/writer/index.rs
Comment on lines +564 to +571
self.minimum = Some(self.minimum.as_deref().map_or_else(
|| value.to_string(),
|minimum| minimum.min(value).to_string(),
));
self.maximum = Some(self.maximum.as_deref().map_or_else(
|| value.to_string(),
|maximum| maximum.max(value).to_string(),
));
Comment thread src/writer/index.rs
Comment on lines +112 to +121
for row_index in 0..array.len() {
if self.rows_in_current_group == self.rows_per_group {
self.finish_current_group();
}

if !array.is_null(row_index) {
if let Some(hash) = bloom_hash(&self.data_type, array, row_index) {
self.current.add_hash(hash);
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants