feat(writer): add SNAPPY/ZLIB/ZSTD compression support by youichi-uda · Pull Request #82 · datafusion-contrib/orc-rust

youichi-uda · 2026-04-18T02:39:41Z

feat(writer): add SNAPPY / ZLIB / ZSTD compression support

Problem

orc-rust's writer always emits uncompressed ORC files even though the
reader fully supports SNAPPY / ZLIB / ZSTD. This breaks interop with
the Java ORC ecosystem (Hive, Spark, Trino, DuckDB) where every
production deployment defaults to compressed tables — a 10-100x
storage cost for downstream consumers and an awkward gap for crates
that need to write ORC for Hive-shaped systems.

The PostScript-level codec selection has been a // TODO: support compression marker in src/arrow_writer.rs::serialize_postscript
since the writer was added; this PR removes the TODO.

Solution

Implement per-chunk compression matching the reference Java writer
(org.apache.orc.impl.PhysicalFsWriter) per the ORC v1 spec
(https://orc.apache.org/specification/ORCv1/#compression):

Per-stream, per-chunk compression with a configurable block size
(default 256 KiB, matching OrcConf.BUFFER_SIZE).
3-byte little-endian chunk header: 23-bit length + 1 ORIGINAL
flag bit. Encodes correctly against the reader's existing
decode_header (the reader's known-answer test for 5 → [0x0b, 0, 0] and 100 000 → [0x40, 0x0d, 0x03] is mirrored as a writer-side
KAT in src/writer/compression.rs::header_kat_*).
Original-fallback when compressed_len >= original_len — the
spec-mandated and Java-reference behaviour. Verified per codec in
original_fallback_when_compression_would_expand.

Compression is applied to every column stream (Present / Data /
Length / Secondary / DictionaryData), to every stripe footer, and to
the file footer. The PostScript itself is not compressed (it
lives at a fixed offset from EOF so readers can locate it without
first knowing the codec) and now records both compression
(CompressionKind) and compression_block_size so any conformant
reader runs the matching decompressor.

Public API

use orc_rust::arrow_writer::{ArrowWriterBuilder, Compression};

let writer = ArrowWriterBuilder::new(file, schema)
    .with_compression(Compression::Snappy)
    // optional — defaults to 256 KiB
    .with_compression_block_size(64 * 1024)
    .try_build()?;

Compression is exposed as:

pub enum Compression {
    None,                       // default — byte-identical to pre-PR output
    Snappy,
    Zlib { level: u32 },        // raw DEFLATE; default level 6
    Zstd { level: i32 },        // default level 3
}

with convenience constructors Compression::zlib() /
Compression::zstd() for the spec-default levels, and
DEFAULT_ZLIB_LEVEL / DEFAULT_ZSTD_LEVEL /
DEFAULT_COMPRESSION_BLOCK_SIZE re-exports so call sites can derive
their own configuration without duplicating constants.

Design

The compression machinery lives in a new module
src/writer/compression.rs with three internal functions:

write_header(out, length, original) — emits the 3-byte little-
endian chunk header. Debug-asserts the 23-bit length cap.
encode_chunk(codec, chunk) — codec-specific compression of one
chunk's payload.
compress_stream(codec, block_size, payload) — splits the payload
on block_size boundaries and writes each chunk through
write_chunk, falling back to ORIGINAL when the codec doesn't
shrink the chunk.

StripeWriter carries a pub(crate) StripeCompression { compression, block_size } and feeds every emitted stream + the stripe footer
through compress_stream. ArrowWriter::close() does the same for
the file footer before serialising the PostScript.

Compression::None is represented as Option::None at the
StripeWriter level so the no-compression code path stays
branchless and produces byte-identical output to the pre-PR writer.
This is a verified invariant — see the
backward_compat_default_no_compression_byte_identical test.

Alternatives considered

Whole-stripe compression — rejected: non-spec-compliant, would
break every existing ORC reader.
Per-stream global (no chunks) — rejected: same problem; the
spec mandates per-chunk framing because readers stream-decode.
ZSTD-only first, SNAPPY/ZLIB later — rejected: Hive and Trino
both default to SNAPPY, so an MVP without it has zero deployment
value. The three codecs all share the chunked-frame envelope and
only differ in the codec call inside encode_chunk, so doing all
three in one PR is no extra design risk.
Fork a Codec trait for extensibility — deferred. Adding a
trait now would commit us to a particular extension shape (e.g.
how to plumb encoder context for streaming codecs) before there's
a concrete second-implementation use case. The current enum
covers every codec the ORC spec defines.

Breaking changes

None. Compression::None is the default. Existing call sites
continue to produce byte-identical output (verified by the
backward_compat_default_no_compression_byte_identical integration
test; the PostScript's compression_block_size field is
deliberately omitted when the codec is NONE so we match Java's
writer which also omits it).

StripeWriter::new is dropped (was an internal artifact; the writer
module itself is mod writer; private — no public callers exist).
StripeWriter::with_compression replaces it as the only constructor
and is also pub(crate) since the public surface is
ArrowWriterBuilder.

Tests

29 new tests, all green:

Unit tests in src/writer/compression.rs (11):

header_round_trip — fuzzed length/flag combinations including
the 23-bit boundary case
header_kat_matches_reader_decode — known-answer matching the
reader's decode_compressed test (100 000 → [0x40, 0x0d, 0x03])
header_kat_uncompressed_5_bytes — matching the reader's
decode_uncompressed test (5 → [0x0b, 0, 0])
empty_stream_emits_no_chunks — ORC streams of 0 length carry no
headers, mirrors the reader's empty-stream short-circuit
snappy_roundtrip_via_reader_decoder / zlib_roundtrip_via_… /
zstd_roundtrip_via_… — feed the writer's output back through the
matching reader codec
zstd_high_level_round_trip — exercises ZSTD level 19
original_fallback_when_compression_would_expand — spec-mandated
fallback verified across all three codecs
block_size_chunks_input_into_multiple_frames — 5000-byte input
with 1024-byte block size produces exactly 5 chunks of [1024,
1024, 1024, 1024, 904] input bytes
compress_stream_panics_on_compression_none_in_debug — defence
in depth against future refactor mistakes

Integration tests in tests/writer_compression.rs (16):

Round-trip for SNAPPY, ZLIB (default + level 9), ZSTD (default +
level 19) on a mixed Int32 + Utf8 batch
PostScript inspection (parse the file tail with prost): verify
CompressionKind matches for SNAPPY / ZLIB / ZSTD, and that
compression_block_size is populated to the user's value or the
documented 256 KiB default
Backward compat: Compression::None produces a byte-stream
bit-identical to a builder built without any compression call
Incompressible payload (xorshift bytes) survives round-trip via
the spec's "fall back to original chunk" code path
Tiny block size (4 KiB) over a multi-megabyte stream forces
multiple compression chunks per stream and round-trips
API hardening: oversize block sizes are clamped under the 23-bit
spec ceiling; zero falls back to the 256 KiB default
Multi-stripe writes with compression round-trip cleanly

cargo test --all-features passes 425 tests total (151 unit +
16 new integration + 13 doc + the rest unchanged) — zero failures.

Benchmarks

cargo bench --bench writer_compression on a 10 000-row Int64 +
Utf8 batch (Apple silicon, debug rustc 1.95):

codec	output bytes	ratio	write time
none	246 698	1.30x	110 µs
snappy	59 346	5.39x	293 µs
zlib_1	34 318	9.32x	375 µs
zlib_6	32 461	9.86x	3.4 ms
zlib_9	32 461	9.86x	11.8 ms
zstd_1	12 538	25.52x	264 µs
zstd_3	8 834	36.22x	314 µs
zstd_9	12 981	24.65x	1.2 ms
zstd_19	4 823	66.35x	81.0 ms

ZSTD level 3 dominates the speed/ratio Pareto frontier on this
workload, matching the upstream Java ORC default of
orc.compress.zstd.level = 3.

Cross-implementation interop

The compressed chunk format is byte-for-byte identical to the
existing reader's expectations — verified by the round-trip-via-
reader-decoder unit tests, which feed the writer's output directly
to the reader's flate2::read::DeflateDecoder /
zstd::stream::decode_all / snap::raw::Decoder::decompress_vec.

I have not yet cross-validated against an external Java orc-tools
install, but the on-wire format follows the spec exactly and the
PostScript fields (compression, compression_block_size) are the
only knobs a Java reader needs to find the streams. If a reviewer
wants orc-tools meta output on benchmark samples, I'm happy to
attach it in the PR discussion.

Checklist

cargo test --all-features passes (425 tests)
cargo clippy --all-features -- -D warnings passes for the new
code (3 pre-existing warnings on main unrelated to this PR
— row_index.rs::useless_conversion,
delta.rs::explicit_counter_loop ×2 — also reproduce on
cargo clippy with no changes)
cargo fmt -- --check passes
cargo doc --no-deps --all-features builds; the 3 pre-existing
doc warnings on main are not introduced by this PR
Benchmarks added (benches/writer_compression.rs)
Backward compat verified (Compression::None is byte-identical
to pre-PR output)
Apache 2.0 license header on every new file
Conventional commit messages, signed off

Commits

feat(writer): add per-chunk compression module (SNAPPY/ZLIB/ZSTD)
feat(writer): wire compression through ArrowWriter and StripeWriter
test(writer): end-to-end compression round-trip + PostScript inspection
bench(writer): codec comparison benchmark on a 10k-row mixed batch

Implements the ORC v1 spec's per-chunk compression framing (https://orc.apache.org/specification/ORCv1/#compression) at the writer side: a 3-byte little-endian header per chunk encoding the chunk's payload length and an "is original (uncompressed)" flag. When the codec output is no smaller than the input, the chunk is emitted in its original form with the flag bit set, matching the reference Java writer's behaviour (`org.apache.orc.impl.PhysicalFsWriter`). The new `Compression` enum exposes: - `Compression::None` (default — byte-identical to pre-feature output) - `Compression::Snappy` - `Compression::Zlib { level }` (raw DEFLATE; defaults to level 6) - `Compression::Zstd { level }` (defaults to level 3) Convenience constructors `Compression::zlib()` / `Compression::zstd()` yield the spec-default-level variants, and the canonical defaults are also exposed as `DEFAULT_ZLIB_LEVEL` / `DEFAULT_ZSTD_LEVEL` / `DEFAULT_COMPRESSION_BLOCK_SIZE` (256 KiB) so call sites can derive their own configuration without duplicating constants. Two new error variants (`SnappyEncode`, `ZstdEncode`) surface codec failures distinctly from generic I/O errors. 11 unit tests cover header round-trips, the spec's known-answer encoding (5-byte / 100 000-byte cases that match the existing reader test), per-codec round-trips through the matching reader codec, the "compression expanded the chunk → fall back to original" invariant across all three codecs, multi-chunk splitting, and the debug-only panic guard against accidentally calling `compress_stream` with `Compression::None`. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>

Adds the public `with_compression(Compression)` and `with_compression_block_size(usize)` builder methods on `ArrowWriterBuilder`, and threads the configuration through to: - every emitted column stream (Present / Data / Length / Secondary / DictionaryData), via `StripeWriter` - the per-stripe footer - the file footer The PostScript now records both `compression` (`CompressionKind`) and `compression_block_size`, so any conformant ORC reader (Java ORC, DuckDB, Spark, orc-rust's own reader) decompresses the file correctly. The block-size field is omitted from the PostScript when the codec is `NONE` to preserve byte-identical output for the pre-feature default — verified by the `backward_compat_default_no_compression_byte_identical` integration test in the next commit. Per the spec the PostScript itself is NEVER compressed (it lives at a fixed offset from EOF so readers can locate it without first knowing the codec); this is honoured by writing the PostScript through the raw inner writer rather than the compression wrapper. Block size is silently clamped to the spec's 23-bit ceiling (2^23 - 1 bytes) and zero is treated as "use the 256 KiB default", making the API impossible to misuse into a runtime panic. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>

Adds tests/writer_compression.rs (16 integration tests) covering: - Round-trip for SNAPPY, ZLIB (default + level 9), ZSTD (default + level 19) on a mixed Int32 + Utf8 batch. - PostScript inspection by parsing the file tail with `prost`: asserts CompressionKind matches for SNAPPY / ZLIB / ZSTD, and that `compression_block_size` is populated (and respects the user's value or the documented 256 KiB default). - Backward compatibility: `Compression::None` produces a byte-stream bit-identical to a builder built without any compression call, capturing the no-default-change invariant. - Incompressible payload (xorshift bytes) still round-trips — exercises the spec-mandated "fall back to original chunk" code path through the public reader. - Tiny block size (4 KiB) over a multi-megabyte stream forces multiple compression chunks per stream and verifies the round-trip. - API hardening: oversize block sizes are clamped under the 23-bit spec ceiling; zero falls back to the 256 KiB default. - Multi-stripe writes with compression round-trip cleanly. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>

Adds a Criterion benchmark comparing on-disk size and write time for None / Snappy / ZLIB (levels 1, 6, 9) / ZSTD (levels 1, 3, 9, 19) on a 10 000-row Int64 + Utf8 batch — the workload representative of production Hive / Trino tables. Each codec's resulting file size is printed alongside the benchmark so reviewers can sanity-check the size / speed trade-off without re-running the bench themselves. Sample numbers from a single run on Apple-silicon (debug rustc 1.95): codec output_bytes ratio time none 246698 1.30x 110 us snappy 59346 5.39x 293 us zlib_1 34318 9.32x 375 us zlib_6 32461 9.86x 3.4 ms zlib_9 32461 9.86x 11.8 ms zstd_1 12538 25.5x 264 us zstd_3 8834 36.2x 314 us zstd_9 12981 24.6x 1.2 ms zstd_19 4823 66.4x 81 ms (zstd_3 and zstd_1 dominate the speed / ratio Pareto frontier on this workload; this matches Java ORC's choice of zstd level 3 as the default `orc.compress.zstd.level`.) Signed-off-by: Youichi Uda <youichi.uda@gmail.com>

Cross-validate the writer-side compression work in this PR against the Java reference implementation (Apache ORC 1.9.5 `orc-tools`): - Rust writer → Java reader for all 3 codecs: `orc-tools meta` parses the PostScript, reports `Compression: SNAPPY/ZLIB/ZSTD` correctly, row count matches. `orc-tools data` decompresses and decodes every stripe, emitting the exact rows we wrote. - Java writer → Rust reader: a file produced by `orc-tools convert` (default ZLIB) round-trips through the Rust reader with byte-identical column values. The tests are `#[ignore]`d and gated on the `ORC_TOOLS_JAR` environment variable, so they don't break CI on machines without a JDK. Run with: export ORC_TOOLS_JAR=/path/to/orc-tools-<version>-uber.jar cargo test --test java_interop -- --ignored PR_BODY.md updated with the new evidence section and a runnable reviewer recipe. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>

- src/row_index.rs: drop redundant `.into_iter()` in `zip` (useless_conversion) - src/encoding/integer/rle_v2/delta.rs: rewrite two manual counter loops with `.enumerate()` (explicit_counter_loop) - tests/java_interop.rs: take `&Path` instead of `&PathBuf` in `run_meta` / `run_data` / `write_orc` (ptr_arg) Required to keep CI green on Rust 1.95.0 with `-D warnings`.

Copilot

Pull request overview

Adds ORC writer-side per-chunk compression support (SNAPPY/ZLIB/ZSTD) to match the ORC v1 spec framing and align output with common Java ORC ecosystem defaults, including PostScript codec metadata and configurable block size.

Changes:

Introduces a new writer compression module implementing ORC chunk framing + codec payload encoding with original-fallback behavior.
Wires compression through ArrowWriterBuilder/ArrowWriter and StripeWriter for all streams, stripe footers, and file footer (PostScript remains uncompressed).
Adds extensive integration/unit tests (including optional Java orc-tools interop) and a Criterion benchmark for codec comparisons.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/writer_compression.rs	End-to-end compression round-trip + PostScript assertions + API hardening tests
tests/java_interop.rs	Optional (`#[ignore]`) Java `orc-tools` cross-implementation validation tests
src/writer/stripe.rs	Applies optional compression to streams and stripe footer; introduces `StripeCompression`
src/writer/mod.rs	Registers the new writer compression module
src/writer/compression.rs	Implements ORC chunk framing and SNAPPY/ZLIB/ZSTD encoding + unit tests
src/row_index.rs	Minor iteration change (clippy-related)
src/error.rs	Adds writer-side encode error variants for snappy/zstd
src/encoding/integer/rle_v2/delta.rs	Test loop refactor (clippy-related)
src/arrow_writer.rs	Public compression API on builder; compresses file footer; writes PostScript compression metadata
Cargo.toml	Adds `writer_compression` benchmark target
benches/writer_compression.rs	Criterion benchmark comparing codecs and output size
.github/PR_BODY.md	Adds a detailed PR body template/documentation for this feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                let bytes_to_write = match self.compression {
+                    Some(StripeCompression {
+                        compression,
+                        block_size,
+                    }) => compress_stream(compression, block_size, &bytes)?,
+                    None => bytes.to_vec(),
+                };
+                let length = bytes_to_write.len();
+                self.writer.write_all(&bytes_to_write).context(IoSnafu)?;


+    pub compression: Compression,
+    pub block_size: usize,
+}
+


+/// Levels are clamped to each codec's valid range. ZSTD accepts negative
+/// levels (faster than level 1, lower ratio) and levels above 19 require
+/// the `zstd-long` distant-match window.
+#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]


+        let chunk_lens: Vec<usize> = chunks
+            .iter()
+            .map(|(_, original, body)| if *original { body.len() } else { 1024 })
+            .collect();
+        // We can't easily assert decoded chunk size when compressed
+        // (snappy may have shrunk one), but every chunk's *original*
+        // size must fit in [1, 1024]. Verify boundary: the first 4
+        // chunks each represented exactly 1024 bytes of input.
+        for (i, expected_input_size) in [1024, 1024, 1024, 1024, 904].iter().enumerate() {
+            let (_, original, body) = chunks[i];
+            let input_size = if original {
+                body.len()
+            } else {
+                snap::raw::Decoder::new()
+                    .decompress_vec(body)
+                    .unwrap()
+                    .len()
+            };
+            assert_eq!(
+                input_size, *expected_input_size,
+                "chunk {i} wraps {expected_input_size} input bytes"
+            );
+            let _ = chunk_lens; // silence unused
+        }


youichi-uda added 6 commits April 18, 2026 10:56

youichi-uda mentioned this pull request Apr 18, 2026

chore: fix clippy warnings surfaced by Rust 1.95 #83

Merged

3 tasks

WenyXu requested a review from Copilot May 7, 2026 03:07

Copilot started reviewing on behalf of WenyXu May 7, 2026 03:07 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(writer): add SNAPPY/ZLIB/ZSTD compression support#82

feat(writer): add SNAPPY/ZLIB/ZSTD compression support#82
youichi-uda wants to merge 6 commits intodatafusion-contrib:mainfrom
youichi-uda:ferroflow/writer-compression

youichi-uda commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

youichi-uda commented Apr 18, 2026

feat(writer): add SNAPPY / ZLIB / ZSTD compression support

Problem

Solution

Public API

Design

Alternatives considered

Breaking changes

Tests

Benchmarks

Cross-implementation interop

Checklist

Commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants