Expand ORC writer capabilities#84
Open
mrxiad wants to merge 2 commits intodatafusion-contrib:mainfrom
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR expands the ORC writer to emit richer metadata (statistics, row indexes, bloom filters), adds support for additional Arrow logical types (decimal/date/timestamp), introduces an async writer API for async sinks, and updates compression handling to include writer-side stream compression.
Changes:
- Add file-level and stripe-level column statistics plus optional row indexes and bloom filter streams in writer output.
- Add writer support for Decimal128, Date32, and Timestamp (with UTC handling in schema).
- Introduce
AsyncArrowWriter/AsyncArrowWriterBuilderand document updated writer capabilities in the README.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/writer/stripe.rs | Tracks per-stripe stats and optionally writes row-index + bloom-filter streams; applies writer compression to streams and stripe footer. |
| src/writer/mod.rs | Adds stream kinds for row index and bloom filter UTF8 streams; exposes writer index module internally. |
| src/writer/index.rs | New builders for row indexes, bloom filters, and column statistics aggregation. |
| src/writer/column.rs | Adds encoders for Decimal128, Date32, and Timestamp; adds timestamp encoding helpers and varint zigzag write support. |
| src/lib.rs | Exposes the new async writer module and re-exports async writer types behind the async feature. |
| src/encoding/integer/mod.rs | Re-exports write_varint_zigzagged for new decimal encoding. |
| src/compression.rs | Adds writer compression API (WriterCompression) and compress_stream; expands zlib decompressor compatibility; adds tests. |
| src/bloom_filter.rs | Makes bloom filter construction internal and adds protobuf serialization helper for UTF8 bitset format. |
| src/async_arrow_writer.rs | New async writer wrapper that buffers ORC bytes and writes them to an AsyncWrite sink on finish/close. |
| src/arrow_writer.rs | Adds writer options (compression, row index stride, bloom filters), writes metadata+footer with compression, and emits file/stripe statistics. |
| README.md | Updates feature list and roadmap to reflect writer support and remaining limitations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+226
to
+233
| pub(crate) fn update_array(&mut self, data_type: &ArrowDataType, array: &ArrayRef) { | ||
| for row_index in 0..array.len() { | ||
| self.update(data_type, array, row_index); | ||
| } | ||
| } | ||
|
|
||
| fn update(&mut self, data_type: &ArrowDataType, array: &ArrayRef, row_index: usize) { | ||
| if array.is_null(row_index) { |
Comment on lines
+564
to
+571
| self.minimum = Some(self.minimum.as_deref().map_or_else( | ||
| || value.to_string(), | ||
| |minimum| minimum.min(value).to_string(), | ||
| )); | ||
| self.maximum = Some(self.maximum.as_deref().map_or_else( | ||
| || value.to_string(), | ||
| |maximum| maximum.max(value).to_string(), | ||
| )); |
Comment on lines
+112
to
+121
| for row_index in 0..array.len() { | ||
| if self.rows_in_current_group == self.rows_per_group { | ||
| self.finish_current_group(); | ||
| } | ||
|
|
||
| if !array.is_null(row_index) { | ||
| if let Some(hash) = bloom_hash(&self.data_type, array, row_index) { | ||
| self.current.add_hash(hash); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tests
Known limitations