[GLUTEN-12187][VL] Port AttachDistributedSequenceExec to Velox backend by baibaichen · Pull Request #12188 · apache/gluten

baibaichen · 2026-05-29T06:15:50Z

What changes were proposed in this pull request?

Adds a Velox implementation of Spark's AttachDistributedSequenceExec (prepends a contiguous Long id column to child output). Used by pandas-on-Spark distributed-sequence index and DataFrame.zipWithIndex.

How is this implemented?

Plan-level

New abstract base ColumnarAttachDistributedSequenceBaseExec in gluten-substrait/ with factory from(plan) delegating to the backend API.
New offload rule case + validator gate in OffloadSingleNodeRules / Validators.
New backend hook genColumnarAttachDistributedSequenceExec on SparkPlanExecApi. Velox override returns the columnar impl; the CH override throws GlutenNotSupportException until that backend is ported.
Config: spark.gluten.sql.columnar.attachDistributedSequence (default true) lets users disable the offload.

Velox runtime (ColumnarAttachDistributedSequenceExec)

For >1 partition:

Materialize the child output once via Gluten's existing ColumnarCachedBatchSerializer, persisted at MEMORY_AND_DISK_SER. The cache blob is Velox-native serialization (CachedColumnarBatch) — kryo-friendly and typically much more compact than unsafe-row SER.
Count pass: read CachedColumnarBatch.numRows of partitions [0, numPartitions - 1) — no native deserialization.
Assign pass: convertCachedBatchToColumnarBatch → Velox-native batch → ColumnarBatches.load (zero-copy Arrow C-Data ABI handoff) → prepend one ArrowWritableColumnVector with the id column.

Single-partition queries skip caching entirely (startOffset = 0).

Memory hygiene

The base class exposes a doColumnarCleanup() hook called from cleanupResources(). The Velox impl uses it to unpersist the cached RDD when the query finishes, so BlockManager does not hold the serialized batches beyond the operator's lifetime.
The persisted RDD is cached behind a synchronized accessor so repeated doExecuteColumnar() calls share a single persist() handle.
assignIds wraps the per-batch build in a try/catch that closes the freshly-loaded heavy batch on failure, so a mid-build OOM (e.g. while allocating the id vector) cannot leak Arrow buffers.

Known overhead: cache write/read crosses the heap boundary

Per batch, the cache path does two full data copies:

Write: JNI serialize copies the off-heap Velox batch into an on-heap Array[Byte] (CachedColumnarBatch.bytes).
Read: JNI deserialize copies the bytes back into a fresh off-heap Velox batch.

This is the price for zipWithIndex's two-pass semantics (count + assign) without re-executing the child plan. The Velox↔Arrow hand-offs elsewhere in the operator are zero-copy ABI transfers and not relevant to this cost.

Alternative considered

We considered the row-mode pipeline Velox → C2R → RDD[InternalRow] cached → R2C. For Gluten that costs a full C2R/R2C transition on every row, and the unsafe-row serialized cache is typically 2–5× larger than Velox-native serialization for wide / nested data. The columnar path keeps everything columnar and pays serialization only once.

How was this patch tested?

New VeloxAttachDistributedSequenceExecSuite in backends-velox.

Does this PR introduce any user-facing change?

Yes — a new config:

spark.gluten.sql.columnar.attachDistributedSequence (default true).

When enabled, df.zipWithIndex and pandas-on-Spark distributed-sequence index materialize the id column columnarly on Velox instead of falling back.

github-actions · 2026-05-29T06:16:21Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-29T06:17:15Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-29T06:18:27Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-29T09:43:08Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-29T11:43:35Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-29T15:25:24Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-30T02:17:59Z

Run Gluten Clickhouse CI on x86

baibaichen · 2026-06-02T04:30:36Z

@zhztheplayer would you please review this PR?

Adds a Velox implementation of Spark's AttachDistributedSequenceExec that prepends a contiguous, globally increasing Long id column to its child output. Used by pandas-on-Spark distributed-sequence index and DataFrame.zipWithIndex. Plan-level - New abstract base ColumnarAttachDistributedSequenceBaseExec in gluten-substrait/, with factory from(plan) delegating to the backend API and a doColumnarCleanup() hook called from cleanupResources(). - New offload rule case + validator gate in OffloadSingleNodeRules / Validators. - New backend hook genColumnarAttachDistributedSequenceExec on SparkPlanExecApi. Velox override returns the columnar impl; CH override throws GlutenNotSupportException until that backend is ported. - Config spark.gluten.sql.columnar.attachDistributedSequence (default true) lets users disable the offload. Velox runtime - For >1 partition, materialize the child output once via Gluten's existing ColumnarCachedBatchSerializer, persisted at MEMORY_AND_DISK_SER. The cache blob is Velox-native serialization (CachedColumnarBatch), much more compact than unsafe-row SER for wide / nested data. - Count pass reads CachedColumnarBatch.numRows for partitions [0, numPartitions - 1) -- no native deserialization required. - Assign pass: convertCachedBatchToColumnarBatch -> ColumnarBatches.load (zero-copy Arrow C-Data ABI handoff) -> prepend one ArrowWritableColumnVector with the id column. - Single-partition queries skip caching entirely. Memory hygiene - doColumnarCleanup() unpersists the cached RDD when the query finishes so BlockManager does not hold the serialized batches beyond the operator's lifetime. - The persisted RDD is cached behind a synchronized accessor so repeated doExecuteColumnar() calls share a single persist() handle. - assignIds wraps the per-batch build in a try/catch that closes the freshly-loaded heavy batch on failure, preventing Arrow buffer leaks on mid-build OOM. Tests - New VeloxAttachDistributedSequenceExecSuite in backends-velox. Closes apache#12187 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tedSequence config

The original implementation persisted the child's columnar output via ColumnarCachedBatchSerializer to avoid re-executing the child plan twice. That path fails on zero-column batches that can result from column pruning (e.g. df.select("id") prunes every input column away): ensureVeloxBatch -> isVeloxBatch -> getIndicatorVector throws because the batch is neither LIGHT nor HEAVY. Switch to the simpler vanilla-Spark-style approach (matches the pandas-on-Spark cache="NONE" option): run the child once to count rows per partition for [0, numPartitions - 1), then run it again to attach the id column. One extra child execution; full robustness across arbitrary projections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t batch Round-3 (drop-cache) made our output batch flow through OffloadArrowDataExec when a downstream Velox consumer (e.g. shuffle after repartition) follows our op. That path calls ColumnarBatches.offload -> getRefCntHeavy, which asserts every column in the heavy batch shares the same reference count. Our previous output mixed a freshly allocated id column (refCnt=1) with retain'd input columns (refCnt=2), tripping the assertion -- which surfaced as the SPARK-36338 regression in the inherited GlutenDataFrameSuite. Allocate fresh ArrowWritableColumnVectors for every output column and copy input values per row via ValueVector.copyFromSafe. All output columns then have refCnt=1, the input batch is untouched, and the offload path works regardless of which transition the planner inserts. Test additions: - New 'output survives a downstream Velox shuffle (offload path)' test reproduces the bug locally (repartition after attach). - Set spark.sql.ansi.enabled=false in suite sparkConf so the columnar exec is actually selected under Spark 4.x where ANSI is default true (Gluten falls back the whole plan otherwise). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…support The new case in OffloadSingleNodeRules unconditionally called ColumnarAttachDistributedSequenceBaseExec.from(plan) for every backend, which triggers CHSparkPlanExecApi.genColumnarAttachDistributedSequenceExec to throw GlutenNotSupportException on the ClickHouse backend during the inherited SPARK-36338 test. Add a backend-level flag supportColumnarAttachDistributedSequenceExec (default false in BackendSettingsApi, true for Velox). Only attempt the offload when the active backend opts in. CH plans now stay as vanilla AttachDistributedSequenceExec and execute via Spark's row-based path (C2R/R2C), restoring the previous CH behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-02T07:32:13Z

Run Gluten Clickhouse CI on x86

philo-he

Looks good overall. Could you check whether any of Spark's own related test suites are enabled in Gluten? Thanks.

philo-he · 2026-06-02T07:27:49Z

+   * materialized during execution. Called from [[cleanupResources]] after children have been
+   * cleaned up. The default implementation is a no-op.
+   */
+  protected def doColumnarCleanup(): Unit = {}


Do we still need this? It seems to be useless.

philo-he · 2026-06-02T07:37:28Z

+    super.sparkConf
+      .set("spark.sql.shuffle.partitions", "3")
+      .set("spark.default.parallelism", "3")
+      .set(SQLConf.ANSI_ENABLED.key, "false")


Can we also set Spark master "local[3]" explicitly?

It seems that gluten columnar shuffle is not set. Do we need to set it in the base class WholeStageTransformerSuite?

github-actions Bot added CORE works for Gluten Core VELOX CLICKHOUSE labels May 29, 2026

baibaichen force-pushed the snt/attach-dist-seq branch from 0d5f602 to a50a94a Compare May 29, 2026 06:16

baibaichen force-pushed the snt/attach-dist-seq branch from a50a94a to fe3bf16 Compare May 29, 2026 06:17

github-actions Bot added the DOCS label May 29, 2026

baibaichen requested review from ArnavBalyan and philo-he June 1, 2026 02:12

baibaichen and others added 5 commits June 2, 2026 15:31

[GLUTEN-12187][VL] Regenerate Configuration.md for new attachDistribu…

5cf4d9a

…tedSequence config

baibaichen force-pushed the snt/attach-dist-seq branch from 5350118 to d0eff70 Compare June 2, 2026 07:31

philo-he reviewed Jun 2, 2026

View reviewed changes

Conversation

baibaichen commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How is this implemented?

Known overhead: cache write/read crosses the heap boundary

Alternative considered

How was this patch tested?

Does this PR introduce any user-facing change?

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

baibaichen commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

philo-he left a comment

Choose a reason for hiding this comment

Uh oh!

philo-he Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

philo-he Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

philo-he Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baibaichen commented May 29, 2026 •

edited

Loading