[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
[VL] Add lazy per-column deserialization for Columnar Table Cache#12211jackylee-ch wants to merge 1 commit into
Conversation
|
Run Gluten Clickhouse CI on x86 |
58bd451 to
d5a0502
Compare
|
Run Gluten Clickhouse CI on x86 |
d5a0502 to
8e374db
Compare
|
Run Gluten Clickhouse CI on x86 |
8e374db to
0f0ccd2
Compare
|
Run Gluten Clickhouse CI on x86 |
0f0ccd2 to
8b09d6b
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@yaooqinn PTAL |
|
Thanks @jackylee-ch, V3 layout is a sensible extension of the cache-stats wire we landed in #12092 / #12196. Several things to discuss before this lands: 1. Benchmark needs to be re-run. The checked-in 2. Do we really need a new SQLConf? V3 functionally supersedes V2 (V3 frames also carry 3. Cross-language test parity vs #12196. V3 has no cpp-side byte-equal golden test; JVM-side tests synthesize their own frames via 4. Smaller items.
Happy to file any of these as separate issues if it helps. |
8b09d6b to
09679ee
Compare
|
Run Gluten Clickhouse CI on x86 |
09679ee to
ab9e0f7
Compare
|
Run Gluten Clickhouse CI on x86 |
ab9e0f7 to
144e816
Compare
b77f4ab to
9a0f96a
Compare
9a0f96a to
b5b1906
Compare
2b96545 to
c3cc1bd
Compare
97a6019 to
9971c91
Compare
|
Run Gluten Clickhouse CI on x86 |
9971c91 to
f576df8
Compare
|
Run Gluten Clickhouse CI on x86 |
f576df8 to
f17dc6a
Compare
|
Run Gluten Clickhouse CI on x86 |
f17dc6a to
cda20eb
Compare
|
Run Gluten Clickhouse CI on x86 |
decdd0e to
ab055c5
Compare
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
ab055c5 to
2538fe5
Compare
|
Run Gluten Clickhouse CI on x86 |
2538fe5 to
765794f
Compare
|
Run Gluten Clickhouse CI on x86 |
765794f to
c7f9e2f
Compare
|
Run Gluten Clickhouse CI on x86 |
c7f9e2f to
42c1b15
Compare
|
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Pull request overview
This PR updates the Velox-backed Spark table cache to default to a new V3 framed wire format that stores per-column serialized payloads, enabling lazy materialization (projected native deserialization) while keeping partition stats as an optional pruning payload.
Changes:
- Add V3 per-column framed cache serialization (with and without stats) plus native projected deserialization path.
- Update cache filtering/pruning behavior and tests to reflect new lazy/V3 default and revised float NaN stats semantics.
- Add cross-language golden framing tests, end-to-end lazy serde tests, and a benchmark to compare V2 vs V3 (with/without stats).
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| gluten-ut/spark41/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala | Adjust expected cached size stats for Spark 4.1 tests. |
| gluten-ut/spark40/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala | Adjust expected cached size stats for Spark 4.0 tests. |
| gluten-ut/spark35/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala | Adjust expected cached size stats for Spark 3.5 tests. |
| gluten-ut/spark34/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala | Adjust expected cached size stats for Spark 3.4 tests. |
| gluten-ut/spark33/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala | Adjust expected cached size stats for Spark 3.3 tests. |
| gluten-substrait/src/main/scala/org/apache/gluten/config/GlutenConfig.scala | Update config description to reflect V3 lazy-by-default behavior and stats as optional payload. |
| gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchSerializerJniWrapper.java | Add JNI APIs for V3 serialize (no-stats / with-stats) and projected deserialization. |
| docs/Configuration.md | Update rendered config table/description for table cache partition stats behavior with V3 default. |
| cpp/velox/tests/VeloxColumnarBatchSerializerTest.cc | Extend native tests: NaN stats semantics, V3 golden frames, V3 writer layout + projected/lazy read fixtures. |
| cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.h | Define V3 framed serialization/deserialization interfaces and projection-aware V3 deserializer. |
| cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc | Implement V3 per-column framing, lazy column loaders, projected V3 deserialization, and refine stats logic. |
| cpp/core/operators/serializer/ColumnarBatchSerializer.h | Add V3 virtual hooks and default behavior (unsupported backends return empty/throw). |
| cpp/core/jni/JniWrapper.cc | Add JNI entry points for V3 serialization and projection-aware V3 deserialization; refactor byte[] creation helper. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchSerializerHelperSuite.scala | Add unit coverage for V3 capability latch and fallback behaviors. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchLazySerdeTest.scala | New E2E tests validating V3 lazy cache bytes, projected reads, pruning, and compatibility behaviors. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchFramedBytesSuite.scala | Add JVM-side V3 frame parsing tests + cross-language V3 golden fixtures. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchE2ESuite.scala | Update/add E2E tests for NaN behavior and V3 no-stats default path correctness. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchBuildFilterPruneSuite.scala | Add regression tests for null-bound referenced-column pruning bypass. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/benchmark/ColumnarTableCacheLazyDeserBenchmark.scala | New benchmark comparing V2 vs V3 (with/without stats), including cache footprint + read phases. |
| backends-velox/src/main/scala/org/apache/spark/sql/execution/ColumnarCachedBatchSerializer.scala | Add V3 magic parsing/routing, V3 serialization path, and V3 projected native deserialization integration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Write V3 per-column cache bytes by default for Velox table cache. Partition stats now only controls the optional stats/pruning payload: stats off writes a no-stats V3 frame, stats on writes V3 with stats, and older native libraries still fall back to V2 stats or legacy bytes. Add the V3 no-stats JNI/native serializer, JVM parsing for statsLen=0, cross-language golden coverage, and GitHub Actions benchmark execution without committing local benchmark results. Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
42c1b15 to
1be78bc
Compare
|
Run Gluten Clickhouse CI on x86 |
|
cc @yaooqinn PTAL |
What changes
This PR makes Velox table cache write V3 per-column framed bytes by default. Lazy materialization is a base table-cache capability;
spark.gluten.sql.columnar.tableCache.partitionStats.enablednow only controls the optional stats/pruning payload.spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled.statsLen=0) for the default lazy path.Performance
Four-environment comparison — eager
V2vs lazyV3, each without and with the optionalpartition-stats payload (
ColumnarTableCacheLazyDeserBenchmark):V2 without stats= legacy raw Presto payload (eager full-batch decode, no pruning).V2 with stats=framedSerializeWithStats(eager full-batch decode + partition-stats pruning).V3 without stats= per-column lazy payload (default; lazy projected decode).V3 with stats= per-column lazy payload + partition-stats pruning.100M rows / 32 partitions / 16 columns / 3 iterations, Apple M5 Pro, JDK 8 runtime, real Gluten
(off-heap enabled,
ColumnarCachedBatchSerializer). Read phases build one mode's cache at a time sothe full 100M fits. Times are avg ms, lower is better;
relativeis vsV2 without stats.Cache footprint (storage memory)
Footprint is identical across all four modes — V3 per-column framing does not regress cache size
for flat data, and the stats payload is negligible.
Read latency (avg ms / relative speedup vs V2 no-stats)
sum(c0)1 of 16 columns and 3.5x faster reading 4 of 16, versus eager V2 which decodes all 16.
additionally lazy-decodes only the surviving batches' projected columns, giving the best result at
136x (
V3 with stats). Lazy column-skip alone (V3 no-stats) is 6.8x.on par with / slightly faster than V2 (V3 ~1.3x at 2M), confirming
LazyVectoradds no overheadwhen every column is materialized. It is omitted from the 100M table because the eager-V2 path
decodes the full 100M x 16 off-heap and does not fit this 64 GiB laptop.
Net: V3 lazy per-column is a large win on projected/filtered reads (the common table-cache access
pattern) with identical cache footprint and no full-scan regression.
A GitHub Actions run on a larger-RAM runner can reproduce the same 100M comparison via the
Velox Backend (x86)workflow_dispatchbenchmark job.How was this patch tested?
./dev/format-scala-code.shPATH="/opt/homebrew/opt/llvm@15/bin:$PATH" ./dev/format-cpp-code.shgit diff --check upstream/main..HEADruby -e 'require "yaml"; YAML.load_file(".github/workflows/velox_backend_x86.yml"); puts "yaml ok"'./.github/workflows/util/check.sh upstream/mainenv CCACHE_DIR=/private/tmp/gluten-ccache ninja -C cpp/build velox/tests/CMakeFiles/velox_operators_test.dir/VeloxColumnarBatchSerializerTest.cc.o./build/mvn install -pl backends-velox -am -Pspark-3.5 -Pscala-2.12 -Pbackends-velox -DskipTests -Dexec.skipColumnarTableCacheLazyDeserBenchmarkwith1000rows,4partitions,1iteration, phasesbuild,read1,read4,readAll,filter.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex GPT-5