Skip to content

fix: broadcast exchange bypasses AQE partition coalescing#4163

Merged
mbutrovich merged 5 commits intoapache:mainfrom
andygrove:fix-broadcast-partition-count
May 5, 2026
Merged

fix: broadcast exchange bypasses AQE partition coalescing#4163
mbutrovich merged 5 commits intoapache:mainfrom
andygrove:fix-broadcast-partition-count

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

When a shuffle feeds into a broadcast exchange, AQE may coalesce the shuffle partitions (e.g. from 200 to 7). CometBroadcastExchangeExec was extracting the underlying plan from AQEShuffleReadExec and executing it directly, bypassing the coalescing. This caused the broadcast collect to run against all original shuffle partitions instead of the coalesced count, inflating task overhead and shuffle data.

Execute through the AQEShuffleReadExec to respect partition coalescing.

What changes are included in this PR?

How are these changes tested?

When a shuffle feeds into a broadcast exchange, AQE may coalesce the
shuffle partitions (e.g. from 200 to 7). CometBroadcastExchangeExec
was extracting the underlying plan from AQEShuffleReadExec and
executing it directly, bypassing the coalescing. This caused the
broadcast collect to run against all original shuffle partitions
instead of the coalesced count, inflating task overhead and shuffle
data.

Execute through the AQEShuffleReadExec to respect partition coalescing.
@andygrove andygrove marked this pull request as ready for review May 1, 2026 19:26
@andygrove andygrove changed the title fix: broadcast exchange bypasses AQE partition coalescing [WIP] fix: broadcast exchange bypasses AQE partition coalescing May 1, 2026
andygrove and others added 3 commits May 1, 2026 13:55
Use REBALANCE so AQE actually inserts AQEShuffleReadExec, then verify the
read exec's `numPartitions` driver metric reflects coalescing. The metric
is only set when AQEShuffleReadExec.executeColumnar runs, so a broadcast
that bypasses the wrapper leaves it at 0 and fails the test.
@mbutrovich mbutrovich self-requested a review May 4, 2026 13:45
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove! I think this fix lines up with how vanilla Spark handles this: BroadcastExchangeExec.relationFuture just calls child.executeCollectIterator() on whatever the direct child is (sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala:182), so when AQE wraps the shuffle stage in AQEShuffleReadExec, coalescing comes for free via AQEShuffleReadExec.shuffleRDD building from partitionSpecs.toArray (AQEShuffleReadExec.scala:250-258). The previous Comet code was reaching past that wrapper to grab s.plan, which dropped the partition specs. Executing through the wrapper restores parity with Spark.

One thought worth considering: the new getByteArrayRdd(plan: SparkPlan) helper is byte-for-byte identical to CometExec.getByteArrayRdd (spark/src/main/scala/org/apache/spark/sql/comet/operators.scala:376) apart from the parameter type. Now that the helper accepts any SparkPlan, every branch in the match block in CometBroadcastExchangeExec.scala:126-151 reduces to the same call. Could the whole match collapse to roughly:

val countsAndBytes = if (child.supportsColumnar) {
  getByteArrayRdd(child).collect()
} else {
  // existing RowToColumnar fallback for the LocalTableScan case
  ...
}

That mirrors Spark's "dispatch by execution, not by type peeking" pattern and would make it harder to reintroduce the same class of bug the next time AQE adds a wrapper (skew read, local read, or some future variant). Happy to defer if there is a reason the per-case branches need to stay.

Minor: the test comment explaining the numPartitions metric mechanism is great. Might be worth noting in the PR description that the fix incidentally also restores correct behavior for skew splits and local reads, since those paths also flow through AQEShuffleReadExec. Reviewers will appreciate knowing the blast radius is broader than just coalescing.

…eeking

Replace the per-wrapper-type match block with a single
getByteArrayRdd(child) call. This executes through AQE wrappers
(coalescing, skew splits, local reads) instead of pattern-matching
past them, preventing future regressions when AQE adds new wrappers.
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks right to me now, thanks @andygrove!

@mbutrovich mbutrovich merged commit a8f3560 into apache:main May 5, 2026
178 of 179 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants