feat(writer): add SNAPPY/ZLIB/ZSTD compression support#82
Open
youichi-uda wants to merge 6 commits intodatafusion-contrib:mainfrom
Open
feat(writer): add SNAPPY/ZLIB/ZSTD compression support#82youichi-uda wants to merge 6 commits intodatafusion-contrib:mainfrom
youichi-uda wants to merge 6 commits intodatafusion-contrib:mainfrom
Conversation
Implements the ORC v1 spec's per-chunk compression framing (https://orc.apache.org/specification/ORCv1/#compression) at the writer side: a 3-byte little-endian header per chunk encoding the chunk's payload length and an "is original (uncompressed)" flag. When the codec output is no smaller than the input, the chunk is emitted in its original form with the flag bit set, matching the reference Java writer's behaviour (`org.apache.orc.impl.PhysicalFsWriter`). The new `Compression` enum exposes: - `Compression::None` (default — byte-identical to pre-feature output) - `Compression::Snappy` - `Compression::Zlib { level }` (raw DEFLATE; defaults to level 6) - `Compression::Zstd { level }` (defaults to level 3) Convenience constructors `Compression::zlib()` / `Compression::zstd()` yield the spec-default-level variants, and the canonical defaults are also exposed as `DEFAULT_ZLIB_LEVEL` / `DEFAULT_ZSTD_LEVEL` / `DEFAULT_COMPRESSION_BLOCK_SIZE` (256 KiB) so call sites can derive their own configuration without duplicating constants. Two new error variants (`SnappyEncode`, `ZstdEncode`) surface codec failures distinctly from generic I/O errors. 11 unit tests cover header round-trips, the spec's known-answer encoding (5-byte / 100 000-byte cases that match the existing reader test), per-codec round-trips through the matching reader codec, the "compression expanded the chunk → fall back to original" invariant across all three codecs, multi-chunk splitting, and the debug-only panic guard against accidentally calling `compress_stream` with `Compression::None`. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Adds the public `with_compression(Compression)` and `with_compression_block_size(usize)` builder methods on `ArrowWriterBuilder`, and threads the configuration through to: - every emitted column stream (Present / Data / Length / Secondary / DictionaryData), via `StripeWriter` - the per-stripe footer - the file footer The PostScript now records both `compression` (`CompressionKind`) and `compression_block_size`, so any conformant ORC reader (Java ORC, DuckDB, Spark, orc-rust's own reader) decompresses the file correctly. The block-size field is omitted from the PostScript when the codec is `NONE` to preserve byte-identical output for the pre-feature default — verified by the `backward_compat_default_no_compression_byte_identical` integration test in the next commit. Per the spec the PostScript itself is NEVER compressed (it lives at a fixed offset from EOF so readers can locate it without first knowing the codec); this is honoured by writing the PostScript through the raw inner writer rather than the compression wrapper. Block size is silently clamped to the spec's 23-bit ceiling (2^23 - 1 bytes) and zero is treated as "use the 256 KiB default", making the API impossible to misuse into a runtime panic. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Adds tests/writer_compression.rs (16 integration tests) covering: - Round-trip for SNAPPY, ZLIB (default + level 9), ZSTD (default + level 19) on a mixed Int32 + Utf8 batch. - PostScript inspection by parsing the file tail with `prost`: asserts CompressionKind matches for SNAPPY / ZLIB / ZSTD, and that `compression_block_size` is populated (and respects the user's value or the documented 256 KiB default). - Backward compatibility: `Compression::None` produces a byte-stream bit-identical to a builder built without any compression call, capturing the no-default-change invariant. - Incompressible payload (xorshift bytes) still round-trips — exercises the spec-mandated "fall back to original chunk" code path through the public reader. - Tiny block size (4 KiB) over a multi-megabyte stream forces multiple compression chunks per stream and verifies the round-trip. - API hardening: oversize block sizes are clamped under the 23-bit spec ceiling; zero falls back to the 256 KiB default. - Multi-stripe writes with compression round-trip cleanly. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Adds a Criterion benchmark comparing on-disk size and write time for None / Snappy / ZLIB (levels 1, 6, 9) / ZSTD (levels 1, 3, 9, 19) on a 10 000-row Int64 + Utf8 batch — the workload representative of production Hive / Trino tables. Each codec's resulting file size is printed alongside the benchmark so reviewers can sanity-check the size / speed trade-off without re-running the bench themselves. Sample numbers from a single run on Apple-silicon (debug rustc 1.95): codec output_bytes ratio time none 246698 1.30x 110 us snappy 59346 5.39x 293 us zlib_1 34318 9.32x 375 us zlib_6 32461 9.86x 3.4 ms zlib_9 32461 9.86x 11.8 ms zstd_1 12538 25.5x 264 us zstd_3 8834 36.2x 314 us zstd_9 12981 24.6x 1.2 ms zstd_19 4823 66.4x 81 ms (zstd_3 and zstd_1 dominate the speed / ratio Pareto frontier on this workload; this matches Java ORC's choice of zstd level 3 as the default `orc.compress.zstd.level`.) Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Cross-validate the writer-side compression work in this PR against
the Java reference implementation (Apache ORC 1.9.5 `orc-tools`):
- Rust writer → Java reader for all 3 codecs: `orc-tools meta`
parses the PostScript, reports `Compression: SNAPPY/ZLIB/ZSTD`
correctly, row count matches. `orc-tools data` decompresses and
decodes every stripe, emitting the exact rows we wrote.
- Java writer → Rust reader: a file produced by `orc-tools convert`
(default ZLIB) round-trips through the Rust reader with
byte-identical column values.
The tests are `#[ignore]`d and gated on the `ORC_TOOLS_JAR`
environment variable, so they don't break CI on machines without a
JDK. Run with:
export ORC_TOOLS_JAR=/path/to/orc-tools-<version>-uber.jar
cargo test --test java_interop -- --ignored
PR_BODY.md updated with the new evidence section and a runnable
reviewer recipe.
Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
- src/row_index.rs: drop redundant `.into_iter()` in `zip` (useless_conversion) - src/encoding/integer/rle_v2/delta.rs: rewrite two manual counter loops with `.enumerate()` (explicit_counter_loop) - tests/java_interop.rs: take `&Path` instead of `&PathBuf` in `run_meta` / `run_data` / `write_orc` (ptr_arg) Required to keep CI green on Rust 1.95.0 with `-D warnings`.
3 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
Adds ORC writer-side per-chunk compression support (SNAPPY/ZLIB/ZSTD) to match the ORC v1 spec framing and align output with common Java ORC ecosystem defaults, including PostScript codec metadata and configurable block size.
Changes:
- Introduces a new writer compression module implementing ORC chunk framing + codec payload encoding with original-fallback behavior.
- Wires compression through
ArrowWriterBuilder/ArrowWriterandStripeWriterfor all streams, stripe footers, and file footer (PostScript remains uncompressed). - Adds extensive integration/unit tests (including optional Java
orc-toolsinterop) and a Criterion benchmark for codec comparisons.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/writer_compression.rs | End-to-end compression round-trip + PostScript assertions + API hardening tests |
| tests/java_interop.rs | Optional (#[ignore]) Java orc-tools cross-implementation validation tests |
| src/writer/stripe.rs | Applies optional compression to streams and stripe footer; introduces StripeCompression |
| src/writer/mod.rs | Registers the new writer compression module |
| src/writer/compression.rs | Implements ORC chunk framing and SNAPPY/ZLIB/ZSTD encoding + unit tests |
| src/row_index.rs | Minor iteration change (clippy-related) |
| src/error.rs | Adds writer-side encode error variants for snappy/zstd |
| src/encoding/integer/rle_v2/delta.rs | Test loop refactor (clippy-related) |
| src/arrow_writer.rs | Public compression API on builder; compresses file footer; writes PostScript compression metadata |
| Cargo.toml | Adds writer_compression benchmark target |
| benches/writer_compression.rs | Criterion benchmark comparing codecs and output size |
| .github/PR_BODY.md | Adds a detailed PR body template/documentation for this feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+164
to
+172
| let bytes_to_write = match self.compression { | ||
| Some(StripeCompression { | ||
| compression, | ||
| block_size, | ||
| }) => compress_stream(compression, block_size, &bytes)?, | ||
| None => bytes.to_vec(), | ||
| }; | ||
| let length = bytes_to_write.len(); | ||
| self.writer.write_all(&bytes_to_write).context(IoSnafu)?; |
Comment on lines
+46
to
+49
| pub compression: Compression, | ||
| pub block_size: usize, | ||
| } | ||
|
|
Comment on lines
+59
to
+62
| /// Levels are clamped to each codec's valid range. ZSTD accepts negative | ||
| /// levels (faster than level 1, lower ratio) and levels above 19 require | ||
| /// the `zstd-long` distant-match window. | ||
| #[derive(Clone, Copy, Debug, Default, Eq, PartialEq)] |
Comment on lines
+392
to
+415
| let chunk_lens: Vec<usize> = chunks | ||
| .iter() | ||
| .map(|(_, original, body)| if *original { body.len() } else { 1024 }) | ||
| .collect(); | ||
| // We can't easily assert decoded chunk size when compressed | ||
| // (snappy may have shrunk one), but every chunk's *original* | ||
| // size must fit in [1, 1024]. Verify boundary: the first 4 | ||
| // chunks each represented exactly 1024 bytes of input. | ||
| for (i, expected_input_size) in [1024, 1024, 1024, 1024, 904].iter().enumerate() { | ||
| let (_, original, body) = chunks[i]; | ||
| let input_size = if original { | ||
| body.len() | ||
| } else { | ||
| snap::raw::Decoder::new() | ||
| .decompress_vec(body) | ||
| .unwrap() | ||
| .len() | ||
| }; | ||
| assert_eq!( | ||
| input_size, *expected_input_size, | ||
| "chunk {i} wraps {expected_input_size} input bytes" | ||
| ); | ||
| let _ = chunk_lens; // silence unused | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(writer): add SNAPPY / ZLIB / ZSTD compression support
Problem
orc-rust's writer always emits uncompressed ORC files even though the
reader fully supports SNAPPY / ZLIB / ZSTD. This breaks interop with
the Java ORC ecosystem (Hive, Spark, Trino, DuckDB) where every
production deployment defaults to compressed tables — a 10-100x
storage cost for downstream consumers and an awkward gap for crates
that need to write ORC for Hive-shaped systems.
The PostScript-level codec selection has been a
// TODO: support compressionmarker insrc/arrow_writer.rs::serialize_postscriptsince the writer was added; this PR removes the TODO.
Solution
Implement per-chunk compression matching the reference Java writer
(
org.apache.orc.impl.PhysicalFsWriter) per the ORC v1 spec(https://orc.apache.org/specification/ORCv1/#compression):
(default 256 KiB, matching
OrcConf.BUFFER_SIZE).flag bit. Encodes correctly against the reader's existing
decode_header(the reader's known-answer test for5 → [0x0b, 0, 0]and100 000 → [0x40, 0x0d, 0x03]is mirrored as a writer-sideKAT in
src/writer/compression.rs::header_kat_*).compressed_len >= original_len— thespec-mandated and Java-reference behaviour. Verified per codec in
original_fallback_when_compression_would_expand.Compression is applied to every column stream (Present / Data /
Length / Secondary / DictionaryData), to every stripe footer, and to
the file footer. The PostScript itself is not compressed (it
lives at a fixed offset from EOF so readers can locate it without
first knowing the codec) and now records both
compression(
CompressionKind) andcompression_block_sizeso any conformantreader runs the matching decompressor.
Public API
Compressionis exposed as:with convenience constructors
Compression::zlib()/Compression::zstd()for the spec-default levels, andDEFAULT_ZLIB_LEVEL/DEFAULT_ZSTD_LEVEL/DEFAULT_COMPRESSION_BLOCK_SIZEre-exports so call sites can derivetheir own configuration without duplicating constants.
Design
The compression machinery lives in a new module
src/writer/compression.rswith three internal functions:write_header(out, length, original)— emits the 3-byte little-endian chunk header. Debug-asserts the 23-bit length cap.
encode_chunk(codec, chunk)— codec-specific compression of onechunk's payload.
compress_stream(codec, block_size, payload)— splits the payloadon
block_sizeboundaries and writes each chunk throughwrite_chunk, falling back to ORIGINAL when the codec doesn'tshrink the chunk.
StripeWritercarries apub(crate) StripeCompression { compression, block_size }and feeds every emitted stream + the stripe footerthrough
compress_stream.ArrowWriter::close()does the same forthe file footer before serialising the PostScript.
Compression::Noneis represented asOption::Noneat theStripeWriterlevel so the no-compression code path staysbranchless and produces byte-identical output to the pre-PR writer.
This is a verified invariant — see the
backward_compat_default_no_compression_byte_identicaltest.Alternatives considered
break every existing ORC reader.
spec mandates per-chunk framing because readers stream-decode.
both default to SNAPPY, so an MVP without it has zero deployment
value. The three codecs all share the chunked-frame envelope and
only differ in the codec call inside
encode_chunk, so doing allthree in one PR is no extra design risk.
Codectrait for extensibility — deferred. Adding atrait now would commit us to a particular extension shape (e.g.
how to plumb encoder context for streaming codecs) before there's
a concrete second-implementation use case. The current
enumcovers every codec the ORC spec defines.
Breaking changes
None.
Compression::Noneis the default. Existing call sitescontinue to produce byte-identical output (verified by the
backward_compat_default_no_compression_byte_identicalintegrationtest; the PostScript's
compression_block_sizefield isdeliberately omitted when the codec is NONE so we match Java's
writer which also omits it).
StripeWriter::newis dropped (was an internal artifact; the writermodule itself is
mod writer;private — no public callers exist).StripeWriter::with_compressionreplaces it as the only constructorand is also
pub(crate)since the public surface isArrowWriterBuilder.Tests
29 new tests, all green:
Unit tests in
src/writer/compression.rs(11):header_round_trip— fuzzed length/flag combinations includingthe 23-bit boundary case
header_kat_matches_reader_decode— known-answer matching thereader's
decode_compressedtest (100 000 → [0x40, 0x0d, 0x03])header_kat_uncompressed_5_bytes— matching the reader'sdecode_uncompressedtest (5 → [0x0b, 0, 0])empty_stream_emits_no_chunks— ORC streams of 0 length carry noheaders, mirrors the reader's empty-stream short-circuit
snappy_roundtrip_via_reader_decoder/zlib_roundtrip_via_…/zstd_roundtrip_via_…— feed the writer's output back through thematching reader codec
zstd_high_level_round_trip— exercises ZSTD level 19original_fallback_when_compression_would_expand— spec-mandatedfallback verified across all three codecs
block_size_chunks_input_into_multiple_frames— 5000-byte inputwith 1024-byte block size produces exactly 5 chunks of [1024,
1024, 1024, 1024, 904] input bytes
compress_stream_panics_on_compression_none_in_debug— defencein depth against future refactor mistakes
Integration tests in
tests/writer_compression.rs(16):level 19) on a mixed Int32 + Utf8 batch
prost): verifyCompressionKind matches for SNAPPY / ZLIB / ZSTD, and that
compression_block_sizeis populated to the user's value or thedocumented 256 KiB default
Compression::Noneproduces a byte-streambit-identical to a builder built without any compression call
the spec's "fall back to original chunk" code path
multiple compression chunks per stream and round-trips
spec ceiling; zero falls back to the 256 KiB default
cargo test --all-featurespasses 425 tests total (151 unit +16 new integration + 13 doc + the rest unchanged) — zero failures.
Benchmarks
cargo bench --bench writer_compressionon a 10 000-row Int64 +Utf8 batch (Apple silicon, debug rustc 1.95):
ZSTD level 3 dominates the speed/ratio Pareto frontier on this
workload, matching the upstream Java ORC default of
orc.compress.zstd.level = 3.Cross-implementation interop
The compressed chunk format is byte-for-byte identical to the
existing reader's expectations — verified by the round-trip-via-
reader-decoder unit tests, which feed the writer's output directly
to the reader's
flate2::read::DeflateDecoder/zstd::stream::decode_all/snap::raw::Decoder::decompress_vec.I have not yet cross-validated against an external Java
orc-toolsinstall, but the on-wire format follows the spec exactly and the
PostScript fields (
compression,compression_block_size) are theonly knobs a Java reader needs to find the streams. If a reviewer
wants
orc-tools metaoutput on benchmark samples, I'm happy toattach it in the PR discussion.
Checklist
cargo test --all-featurespasses (425 tests)cargo clippy --all-features -- -D warningspasses for the newcode (3 pre-existing warnings on
mainunrelated to this PR—
row_index.rs::useless_conversion,delta.rs::explicit_counter_loop×2 — also reproduce oncargo clippywith no changes)cargo fmt -- --checkpassescargo doc --no-deps --all-featuresbuilds; the 3 pre-existingdoc warnings on main are not introduced by this PR
benches/writer_compression.rs)Compression::Noneis byte-identicalto pre-PR output)
Commits