Skip to content

feat(writer): add SNAPPY/ZLIB/ZSTD compression support#82

Open
youichi-uda wants to merge 6 commits intodatafusion-contrib:mainfrom
youichi-uda:ferroflow/writer-compression
Open

feat(writer): add SNAPPY/ZLIB/ZSTD compression support#82
youichi-uda wants to merge 6 commits intodatafusion-contrib:mainfrom
youichi-uda:ferroflow/writer-compression

Conversation

@youichi-uda
Copy link
Copy Markdown
Contributor

feat(writer): add SNAPPY / ZLIB / ZSTD compression support

Problem

orc-rust's writer always emits uncompressed ORC files even though the
reader fully supports SNAPPY / ZLIB / ZSTD. This breaks interop with
the Java ORC ecosystem (Hive, Spark, Trino, DuckDB) where every
production deployment defaults to compressed tables — a 10-100x
storage cost for downstream consumers and an awkward gap for crates
that need to write ORC for Hive-shaped systems.

The PostScript-level codec selection has been a // TODO: support compression marker in src/arrow_writer.rs::serialize_postscript
since the writer was added; this PR removes the TODO.

Solution

Implement per-chunk compression matching the reference Java writer
(org.apache.orc.impl.PhysicalFsWriter) per the ORC v1 spec
(https://orc.apache.org/specification/ORCv1/#compression):

  • Per-stream, per-chunk compression with a configurable block size
    (default 256 KiB, matching OrcConf.BUFFER_SIZE).
  • 3-byte little-endian chunk header: 23-bit length + 1 ORIGINAL
    flag bit. Encodes correctly against the reader's existing
    decode_header (the reader's known-answer test for 5 → [0x0b, 0, 0] and 100 000 → [0x40, 0x0d, 0x03] is mirrored as a writer-side
    KAT in src/writer/compression.rs::header_kat_*).
  • Original-fallback when compressed_len >= original_len — the
    spec-mandated and Java-reference behaviour. Verified per codec in
    original_fallback_when_compression_would_expand.

Compression is applied to every column stream (Present / Data /
Length / Secondary / DictionaryData), to every stripe footer, and to
the file footer. The PostScript itself is not compressed (it
lives at a fixed offset from EOF so readers can locate it without
first knowing the codec) and now records both compression
(CompressionKind) and compression_block_size so any conformant
reader runs the matching decompressor.

Public API

use orc_rust::arrow_writer::{ArrowWriterBuilder, Compression};

let writer = ArrowWriterBuilder::new(file, schema)
    .with_compression(Compression::Snappy)
    // optional — defaults to 256 KiB
    .with_compression_block_size(64 * 1024)
    .try_build()?;

Compression is exposed as:

pub enum Compression {
    None,                       // default — byte-identical to pre-PR output
    Snappy,
    Zlib { level: u32 },        // raw DEFLATE; default level 6
    Zstd { level: i32 },        // default level 3
}

with convenience constructors Compression::zlib() /
Compression::zstd() for the spec-default levels, and
DEFAULT_ZLIB_LEVEL / DEFAULT_ZSTD_LEVEL /
DEFAULT_COMPRESSION_BLOCK_SIZE re-exports so call sites can derive
their own configuration without duplicating constants.

Design

The compression machinery lives in a new module
src/writer/compression.rs with three internal functions:

  • write_header(out, length, original) — emits the 3-byte little-
    endian chunk header. Debug-asserts the 23-bit length cap.
  • encode_chunk(codec, chunk) — codec-specific compression of one
    chunk's payload.
  • compress_stream(codec, block_size, payload) — splits the payload
    on block_size boundaries and writes each chunk through
    write_chunk, falling back to ORIGINAL when the codec doesn't
    shrink the chunk.

StripeWriter carries a pub(crate) StripeCompression { compression, block_size } and feeds every emitted stream + the stripe footer
through compress_stream. ArrowWriter::close() does the same for
the file footer before serialising the PostScript.

Compression::None is represented as Option::None at the
StripeWriter level so the no-compression code path stays
branchless and produces byte-identical output to the pre-PR writer.
This is a verified invariant — see the
backward_compat_default_no_compression_byte_identical test.

Alternatives considered

  1. Whole-stripe compression — rejected: non-spec-compliant, would
    break every existing ORC reader.
  2. Per-stream global (no chunks) — rejected: same problem; the
    spec mandates per-chunk framing because readers stream-decode.
  3. ZSTD-only first, SNAPPY/ZLIB later — rejected: Hive and Trino
    both default to SNAPPY, so an MVP without it has zero deployment
    value. The three codecs all share the chunked-frame envelope and
    only differ in the codec call inside encode_chunk, so doing all
    three in one PR is no extra design risk.
  4. Fork a Codec trait for extensibility — deferred. Adding a
    trait now would commit us to a particular extension shape (e.g.
    how to plumb encoder context for streaming codecs) before there's
    a concrete second-implementation use case. The current enum
    covers every codec the ORC spec defines.

Breaking changes

None. Compression::None is the default. Existing call sites
continue to produce byte-identical output (verified by the
backward_compat_default_no_compression_byte_identical integration
test; the PostScript's compression_block_size field is
deliberately omitted when the codec is NONE so we match Java's
writer which also omits it).

StripeWriter::new is dropped (was an internal artifact; the writer
module itself is mod writer; private — no public callers exist).
StripeWriter::with_compression replaces it as the only constructor
and is also pub(crate) since the public surface is
ArrowWriterBuilder.

Tests

29 new tests, all green:

Unit tests in src/writer/compression.rs (11):

  • header_round_trip — fuzzed length/flag combinations including
    the 23-bit boundary case
  • header_kat_matches_reader_decode — known-answer matching the
    reader's decode_compressed test (100 000 → [0x40, 0x0d, 0x03])
  • header_kat_uncompressed_5_bytes — matching the reader's
    decode_uncompressed test (5 → [0x0b, 0, 0])
  • empty_stream_emits_no_chunks — ORC streams of 0 length carry no
    headers, mirrors the reader's empty-stream short-circuit
  • snappy_roundtrip_via_reader_decoder / zlib_roundtrip_via_… /
    zstd_roundtrip_via_… — feed the writer's output back through the
    matching reader codec
  • zstd_high_level_round_trip — exercises ZSTD level 19
  • original_fallback_when_compression_would_expand — spec-mandated
    fallback verified across all three codecs
  • block_size_chunks_input_into_multiple_frames — 5000-byte input
    with 1024-byte block size produces exactly 5 chunks of [1024,
    1024, 1024, 1024, 904] input bytes
  • compress_stream_panics_on_compression_none_in_debug — defence
    in depth against future refactor mistakes

Integration tests in tests/writer_compression.rs (16):

  • Round-trip for SNAPPY, ZLIB (default + level 9), ZSTD (default +
    level 19) on a mixed Int32 + Utf8 batch
  • PostScript inspection (parse the file tail with prost): verify
    CompressionKind matches for SNAPPY / ZLIB / ZSTD, and that
    compression_block_size is populated to the user's value or the
    documented 256 KiB default
  • Backward compat: Compression::None produces a byte-stream
    bit-identical to a builder built without any compression call
  • Incompressible payload (xorshift bytes) survives round-trip via
    the spec's "fall back to original chunk" code path
  • Tiny block size (4 KiB) over a multi-megabyte stream forces
    multiple compression chunks per stream and round-trips
  • API hardening: oversize block sizes are clamped under the 23-bit
    spec ceiling; zero falls back to the 256 KiB default
  • Multi-stripe writes with compression round-trip cleanly

cargo test --all-features passes 425 tests total (151 unit +
16 new integration + 13 doc + the rest unchanged) — zero failures.

Benchmarks

cargo bench --bench writer_compression on a 10 000-row Int64 +
Utf8 batch (Apple silicon, debug rustc 1.95):

codec output bytes ratio write time
none 246 698 1.30x 110 µs
snappy 59 346 5.39x 293 µs
zlib_1 34 318 9.32x 375 µs
zlib_6 32 461 9.86x 3.4 ms
zlib_9 32 461 9.86x 11.8 ms
zstd_1 12 538 25.52x 264 µs
zstd_3 8 834 36.22x 314 µs
zstd_9 12 981 24.65x 1.2 ms
zstd_19 4 823 66.35x 81.0 ms

ZSTD level 3 dominates the speed/ratio Pareto frontier on this
workload, matching the upstream Java ORC default of
orc.compress.zstd.level = 3.

Cross-implementation interop

The compressed chunk format is byte-for-byte identical to the
existing reader's expectations — verified by the round-trip-via-
reader-decoder unit tests, which feed the writer's output directly
to the reader's flate2::read::DeflateDecoder /
zstd::stream::decode_all / snap::raw::Decoder::decompress_vec.

I have not yet cross-validated against an external Java orc-tools
install, but the on-wire format follows the spec exactly and the
PostScript fields (compression, compression_block_size) are the
only knobs a Java reader needs to find the streams. If a reviewer
wants orc-tools meta output on benchmark samples, I'm happy to
attach it in the PR discussion.

Checklist

  • cargo test --all-features passes (425 tests)
  • cargo clippy --all-features -- -D warnings passes for the new
    code (3 pre-existing warnings on main unrelated to this PR
    row_index.rs::useless_conversion,
    delta.rs::explicit_counter_loop ×2 — also reproduce on
    cargo clippy with no changes)
  • cargo fmt -- --check passes
  • cargo doc --no-deps --all-features builds; the 3 pre-existing
    doc warnings on main are not introduced by this PR
  • Benchmarks added (benches/writer_compression.rs)
  • Backward compat verified (Compression::None is byte-identical
    to pre-PR output)
  • Apache 2.0 license header on every new file
  • Conventional commit messages, signed off

Commits

feat(writer): add per-chunk compression module (SNAPPY/ZLIB/ZSTD)
feat(writer): wire compression through ArrowWriter and StripeWriter
test(writer): end-to-end compression round-trip + PostScript inspection
bench(writer): codec comparison benchmark on a 10k-row mixed batch

Implements the ORC v1 spec's per-chunk compression framing
(https://orc.apache.org/specification/ORCv1/#compression) at the
writer side: a 3-byte little-endian header per chunk encoding the
chunk's payload length and an "is original (uncompressed)" flag.
When the codec output is no smaller than the input, the chunk is
emitted in its original form with the flag bit set, matching the
reference Java writer's behaviour
(`org.apache.orc.impl.PhysicalFsWriter`).

The new `Compression` enum exposes:
- `Compression::None` (default — byte-identical to pre-feature output)
- `Compression::Snappy`
- `Compression::Zlib { level }` (raw DEFLATE; defaults to level 6)
- `Compression::Zstd { level }` (defaults to level 3)

Convenience constructors `Compression::zlib()` / `Compression::zstd()`
yield the spec-default-level variants, and the canonical defaults
are also exposed as `DEFAULT_ZLIB_LEVEL` / `DEFAULT_ZSTD_LEVEL` /
`DEFAULT_COMPRESSION_BLOCK_SIZE` (256 KiB) so call sites can derive
their own configuration without duplicating constants.

Two new error variants (`SnappyEncode`, `ZstdEncode`) surface codec
failures distinctly from generic I/O errors.

11 unit tests cover header round-trips, the spec's known-answer
encoding (5-byte / 100 000-byte cases that match the existing reader
test), per-codec round-trips through the matching reader codec, the
"compression expanded the chunk → fall back to original" invariant
across all three codecs, multi-chunk splitting, and the debug-only
panic guard against accidentally calling `compress_stream` with
`Compression::None`.

Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Adds the public `with_compression(Compression)` and
`with_compression_block_size(usize)` builder methods on
`ArrowWriterBuilder`, and threads the configuration through to:

- every emitted column stream (Present / Data / Length / Secondary /
  DictionaryData), via `StripeWriter`
- the per-stripe footer
- the file footer

The PostScript now records both `compression` (`CompressionKind`) and
`compression_block_size`, so any conformant ORC reader (Java ORC,
DuckDB, Spark, orc-rust's own reader) decompresses the file
correctly. The block-size field is omitted from the PostScript when
the codec is `NONE` to preserve byte-identical output for the
pre-feature default — verified by the
`backward_compat_default_no_compression_byte_identical` integration
test in the next commit.

Per the spec the PostScript itself is NEVER compressed (it lives at
a fixed offset from EOF so readers can locate it without first
knowing the codec); this is honoured by writing the PostScript
through the raw inner writer rather than the compression wrapper.

Block size is silently clamped to the spec's 23-bit ceiling
(2^23 - 1 bytes) and zero is treated as "use the 256 KiB default",
making the API impossible to misuse into a runtime panic.

Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Adds tests/writer_compression.rs (16 integration tests) covering:

- Round-trip for SNAPPY, ZLIB (default + level 9), ZSTD (default +
  level 19) on a mixed Int32 + Utf8 batch.
- PostScript inspection by parsing the file tail with `prost`:
  asserts CompressionKind matches for SNAPPY / ZLIB / ZSTD, and that
  `compression_block_size` is populated (and respects the user's
  value or the documented 256 KiB default).
- Backward compatibility: `Compression::None` produces a byte-stream
  bit-identical to a builder built without any compression call,
  capturing the no-default-change invariant.
- Incompressible payload (xorshift bytes) still round-trips — exercises
  the spec-mandated "fall back to original chunk" code path through
  the public reader.
- Tiny block size (4 KiB) over a multi-megabyte stream forces multiple
  compression chunks per stream and verifies the round-trip.
- API hardening: oversize block sizes are clamped under the 23-bit
  spec ceiling; zero falls back to the 256 KiB default.
- Multi-stripe writes with compression round-trip cleanly.

Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Adds a Criterion benchmark comparing on-disk size and write time for
None / Snappy / ZLIB (levels 1, 6, 9) / ZSTD (levels 1, 3, 9, 19)
on a 10 000-row Int64 + Utf8 batch — the workload representative of
production Hive / Trino tables.

Each codec's resulting file size is printed alongside the benchmark
so reviewers can sanity-check the size / speed trade-off without
re-running the bench themselves. Sample numbers from a single run
on Apple-silicon (debug rustc 1.95):

  codec    output_bytes  ratio   time
  none           246698  1.30x   110 us
  snappy          59346  5.39x   293 us
  zlib_1          34318  9.32x   375 us
  zlib_6          32461  9.86x   3.4 ms
  zlib_9          32461  9.86x   11.8 ms
  zstd_1          12538  25.5x   264 us
  zstd_3           8834  36.2x   314 us
  zstd_9          12981  24.6x   1.2 ms
  zstd_19          4823  66.4x   81 ms

(zstd_3 and zstd_1 dominate the speed / ratio Pareto frontier on this
workload; this matches Java ORC's choice of zstd level 3 as the
default `orc.compress.zstd.level`.)

Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
Cross-validate the writer-side compression work in this PR against
the Java reference implementation (Apache ORC 1.9.5 `orc-tools`):

- Rust writer → Java reader for all 3 codecs: `orc-tools meta`
  parses the PostScript, reports `Compression: SNAPPY/ZLIB/ZSTD`
  correctly, row count matches. `orc-tools data` decompresses and
  decodes every stripe, emitting the exact rows we wrote.
- Java writer → Rust reader: a file produced by `orc-tools convert`
  (default ZLIB) round-trips through the Rust reader with
  byte-identical column values.

The tests are `#[ignore]`d and gated on the `ORC_TOOLS_JAR`
environment variable, so they don't break CI on machines without a
JDK. Run with:

    export ORC_TOOLS_JAR=/path/to/orc-tools-<version>-uber.jar
    cargo test --test java_interop -- --ignored

PR_BODY.md updated with the new evidence section and a runnable
reviewer recipe.

Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
- src/row_index.rs: drop redundant `.into_iter()` in `zip`
  (useless_conversion)
- src/encoding/integer/rle_v2/delta.rs: rewrite two manual counter
  loops with `.enumerate()` (explicit_counter_loop)
- tests/java_interop.rs: take `&Path` instead of `&PathBuf` in
  `run_meta` / `run_data` / `write_orc` (ptr_arg)

Required to keep CI green on Rust 1.95.0 with `-D warnings`.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds ORC writer-side per-chunk compression support (SNAPPY/ZLIB/ZSTD) to match the ORC v1 spec framing and align output with common Java ORC ecosystem defaults, including PostScript codec metadata and configurable block size.

Changes:

  • Introduces a new writer compression module implementing ORC chunk framing + codec payload encoding with original-fallback behavior.
  • Wires compression through ArrowWriterBuilder/ArrowWriter and StripeWriter for all streams, stripe footers, and file footer (PostScript remains uncompressed).
  • Adds extensive integration/unit tests (including optional Java orc-tools interop) and a Criterion benchmark for codec comparisons.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/writer_compression.rs End-to-end compression round-trip + PostScript assertions + API hardening tests
tests/java_interop.rs Optional (#[ignore]) Java orc-tools cross-implementation validation tests
src/writer/stripe.rs Applies optional compression to streams and stripe footer; introduces StripeCompression
src/writer/mod.rs Registers the new writer compression module
src/writer/compression.rs Implements ORC chunk framing and SNAPPY/ZLIB/ZSTD encoding + unit tests
src/row_index.rs Minor iteration change (clippy-related)
src/error.rs Adds writer-side encode error variants for snappy/zstd
src/encoding/integer/rle_v2/delta.rs Test loop refactor (clippy-related)
src/arrow_writer.rs Public compression API on builder; compresses file footer; writes PostScript compression metadata
Cargo.toml Adds writer_compression benchmark target
benches/writer_compression.rs Criterion benchmark comparing codecs and output size
.github/PR_BODY.md Adds a detailed PR body template/documentation for this feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/writer/stripe.rs
Comment on lines +164 to +172
let bytes_to_write = match self.compression {
Some(StripeCompression {
compression,
block_size,
}) => compress_stream(compression, block_size, &bytes)?,
None => bytes.to_vec(),
};
let length = bytes_to_write.len();
self.writer.write_all(&bytes_to_write).context(IoSnafu)?;
Comment thread src/writer/stripe.rs
Comment on lines +46 to +49
pub compression: Compression,
pub block_size: usize,
}

Comment thread src/writer/compression.rs
Comment on lines +59 to +62
/// Levels are clamped to each codec's valid range. ZSTD accepts negative
/// levels (faster than level 1, lower ratio) and levels above 19 require
/// the `zstd-long` distant-match window.
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
Comment thread src/writer/compression.rs
Comment on lines +392 to +415
let chunk_lens: Vec<usize> = chunks
.iter()
.map(|(_, original, body)| if *original { body.len() } else { 1024 })
.collect();
// We can't easily assert decoded chunk size when compressed
// (snappy may have shrunk one), but every chunk's *original*
// size must fit in [1, 1024]. Verify boundary: the first 4
// chunks each represented exactly 1024 bytes of input.
for (i, expected_input_size) in [1024, 1024, 1024, 1024, 904].iter().enumerate() {
let (_, original, body) = chunks[i];
let input_size = if original {
body.len()
} else {
snap::raw::Decoder::new()
.decompress_vec(body)
.unwrap()
.len()
};
assert_eq!(
input_size, *expected_input_size,
"chunk {i} wraps {expected_input_size} input bytes"
);
let _ = chunk_lens; // silence unused
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants