perf: avoid JVM shuffle when sandwiched between non-Comet operators [WIP] by andygrove · Pull Request #4010 · apache/datafusion-comet

andygrove · 2026-04-20T20:40:40Z

Which issue does this PR close?

Closes #4004.

Rationale for this change

In TPC-DS plans such as q1, when partial/final hash aggregates cannot be converted to Comet (e.g. due to unsupported aggregate expressions or types), the shuffle between them is still wrapped as CometColumnarExchange (columnar shuffle). With a JVM operator on both sides of the shuffle, this adds a row -> arrow -> shuffle -> arrow -> row round trip with no Comet operator on either side able to consume columnar output:

HashAggregate
  +- CometNativeColumnarToRow
     +- CometColumnarExchange
        +- HashAggregate

The extra conversion is pure overhead compared to a vanilla Spark row-based shuffle.

What changes are included in this PR?

Added a post-transform pass revertRedundantColumnarShuffle in CometExecRule that detects CometShuffleExchangeExec with CometColumnarShuffle whose child is not a Comet plan and whose parent is not a Comet plan, and reverts it to the original Spark ShuffleExchangeExec (preserved in the originalPlan field).
The pass runs only in the COMET_EXEC_ENABLED=true branch of _apply, so users running with COMET_EXEC_ENABLED=false (shuffle-only mode) are unaffected.
Regenerated TPC-DS plan-stability golden files for Spark 3.4, 3.5, and 4.0 to reflect the new pattern.

After the fix, the q1 pattern above becomes:

HashAggregate
  +- Exchange
     +- HashAggregate

How are these changes tested?

New test CometExecRule should not wrap shuffle in CometColumnarShuffle when both sides are JVM in CometExecRuleSuite disables partial hash aggregate to force both aggregates to stay JVM, then asserts that the shuffle remains a plain ShuffleExchangeExec.
Existing CometShuffleSuite, CometExecSuite, CometShuffleFallbackStickinessSuite, and both TPC-DS plan-stability suites (v1.4 and v2.7) continue to pass against the regenerated golden files.

Replace the separate STAGE_FALLBACK_TAG with explain-info-based stickiness. Shuffle path checks (`nativeShuffleFailureReasons`, `columnarShuffleFailureReasons`) are now pure and return reasons instead of tagging eagerly. A new `shuffleSupported` coordinator short-circuits on `hasExplainInfo`, tries native then columnar, and tags via `withInfos` only on total failure. DPP fallback, which disqualifies both paths, moves into the coordinator. This removes the need for `CometFallback` and eliminates the semantic split where `withInfo` could fire for a path-specific failure while the node still converted via a different path.

…llup Expand the doc comments on withInfo/withInfos/hasExplainInfo to make clear that these record fallback reasons surfaced in extended explain output, and that any call to withInfo is a signal that the node falls back to Spark. Also restore the child-expression rollup for native range-partitioning sort orders that was lost in the earlier refactor: when exprToProto fails on a sort-order expression, its own fallback reasons (e.g. strict floating-point sort) are now copied onto the shuffle's reasons so they surface alongside 'unsupported range partitioning sort order'.

When a CometShuffleExchangeExec with CometColumnarShuffle has a non-Comet child and a non-Comet parent, the columnar shuffle only adds row->arrow->shuffle->arrow->row conversion overhead with no Comet operator on either side to consume columnar output. Revert such shuffles to the original Spark ShuffleExchangeExec after the main transform pass. Closes apache#4004

…-jvm-parent # Conflicts: # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

The broader match that checked any non-Comet parent broke object-mode Dataset plans in CometIcebergNativeSuite (DeserializeToObject around a CometColumnarExchange over encoder nodes). A CometNativeColumnarToRowExec elsewhere in the plan had its assertion child.supportsColumnar violated when transform bubbled up the new row-based Exchange. Restrict the match to the exact reported pattern: HashAggregateExec or ObjectHashAggregateExec on both sides of the shuffle. Golden TPC-DS plans are unchanged by this narrowing.

karuppayya · 2026-04-20T23:38:36Z

+   * nor the child is a Comet plan that can consume columnar output, that conversion is pure
+   * overhead (row->arrow->shuffle->arrow->row vs. row->shuffle->row).
+   */
+  private def revertRedundantColumnarShuffle(plan: SparkPlan): SparkPlan = {


Thanks for the opttimzation @andygrove
I guess this is the first time where CometShuffleExchangeExec is reverted back to a plain ShuffleExchangeExec.

The two shuffle paths use different memory systems:

Comet columnar shuffle uses Comet's own memory pool. (off-heap)

Spark vanilla shuffle uses the JVM execution memory pool , with spills managed by ExternalSorter.

Users who have tuned their clusters for Comet (smaller JVM heap) could see unexpected spills after this chang, shifting shuffle memory pressure back to theJVM.
Additionally, Comet's Arrow IPC columnar format typically compresses better than Spark's row-based UnsafeRowSerializer path, so shuffle I/O mayalso increase.
It would be good to document or log when a shuffle is reverted so users can correlate any unexpected behavior with this optimization.

Thanks for the feedback @karuppayya. I'm assuming the cost of doing two transitions (r2c then c2r) would outweigh the benefits of using Comet shuffle? I agree that it would be worth adding documentation.

I agree that this optimization will improve performance and compute efficiency.

My main concern is determining the best recommendation for users to tune memory, particularly since they cannot explicitly disable it.

Also can it be a seperate rule in itself and have it only in org.apache.comet.CometSparkSessionExtensions.CometExecColumnar#postColumnarTransitions?

andygrove · 2026-04-21T00:09:45Z

I may need to address the known issues with mixed Spark/Comet partial/final aggregates before making progress on this issue (#1267, #1389)

…ce Comet columnar shuffle Without the tag, AQE re-plans each stage in isolation, and the isolated subplan (which no longer shows the parent aggregate) converts the reverted ShuffleExchangeExec back into a CometShuffleExchangeExec. Subsequent plan canonicalization then fails because a ColumnarToRowExec ends up with a non-columnar child. Persist the revert decision via a TreeNodeTag on the ShuffleExchangeExec. Both applyCometShuffle and the main transform now short-circuit when the tag is set, so the decision survives re-entrancy.

andygrove added 6 commits April 18, 2026 07:29

chore: trigger CI

cee98e3

Merge remote-tracking branch 'apache/main' into avoid-jvm-shuffle-for…

013fd95

…-jvm-parent # Conflicts: # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

karuppayya reviewed Apr 20, 2026

View reviewed changes

andygrove changed the title ~~perf: avoid JVM shuffle when sandwiched between non-Comet operators~~ perf: avoid JVM shuffle when sandwiched between non-Comet operators [WIP] Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: avoid JVM shuffle when sandwiched between non-Comet operators [WIP]#4010

perf: avoid JVM shuffle when sandwiched between non-Comet operators [WIP]#4010
andygrove wants to merge 7 commits intoapache:mainfrom
andygrove:avoid-jvm-shuffle-for-jvm-parent

andygrove commented Apr 20, 2026 •

edited

Loading

Uh oh!

karuppayya Apr 20, 2026

Uh oh!

andygrove Apr 21, 2026

Uh oh!

karuppayya Apr 21, 2026

Uh oh!

andygrove commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

karuppayya Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

karuppayya Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Apr 20, 2026 •

edited

Loading