Skip to content

chore: Improve shuffle fallback logic#3989

Merged
mbutrovich merged 3 commits intoapache:mainfrom
andygrove:refactor-shuffle-fallback-coordinator
Apr 20, 2026
Merged

chore: Improve shuffle fallback logic#3989
mbutrovich merged 3 commits intoapache:mainfrom
andygrove:refactor-shuffle-fallback-coordinator

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 18, 2026

Which issue does this PR close?

Closes #3984

Rationale for this change

What changes are included in this PR?

  • Remove STAGE_FALLBACK_TAG since this was duplicating the existing withInfo mechanism for tagging plans with fallback reasons
  • Improve shuffle serde logic to avoid tagging the plan until native and columnar compatibility checks have both run
  • Improve documentation on withInfo methods (which should really be renamed to something like withFallbackReason, IMO)

How are these changes tested?

Existing tests, especially those added in #3982.

Replace the separate STAGE_FALLBACK_TAG with explain-info-based stickiness.
Shuffle path checks (`nativeShuffleFailureReasons`, `columnarShuffleFailureReasons`)
are now pure and return reasons instead of tagging eagerly. A new
`shuffleSupported` coordinator short-circuits on `hasExplainInfo`, tries native
then columnar, and tags via `withInfos` only on total failure. DPP fallback,
which disqualifies both paths, moves into the coordinator. This removes the
need for `CometFallback` and eliminates the semantic split where `withInfo`
could fire for a path-specific failure while the node still converted via a
different path.
…llup

Expand the doc comments on withInfo/withInfos/hasExplainInfo to make clear
that these record fallback reasons surfaced in extended explain output,
and that any call to withInfo is a signal that the node falls back to
Spark.

Also restore the child-expression rollup for native range-partitioning
sort orders that was lost in the earlier refactor: when exprToProto fails
on a sort-order expression, its own fallback reasons (e.g. strict
floating-point sort) are now copied onto the shuffle's reasons so they
surface alongside 'unsupported range partitioning sort order'.
@andygrove andygrove changed the title feat: Improve shuffle fallback logic and reduce planning overhead feat: Improve shuffle fallback logic Apr 18, 2026
@andygrove andygrove changed the title feat: Improve shuffle fallback logic chore: Improve shuffle fallback logic Apr 18, 2026
@andygrove andygrove marked this pull request as ready for review April 18, 2026 14:56
@mbutrovich
Copy link
Copy Markdown
Contributor

Is this a pure refactor, or does it fix any bugs related to the fallback logic? Asked a different way: should this be a branch-15 backport candidate?

@andygrove
Copy link
Copy Markdown
Member Author

Is this a pure refactor, or does it fix any bugs related to the fallback logic? Asked a different way: should this be a branch-15 backport candidate?

This is a pure refactor. No functional changes.

If I had been in less of a hurry to get a build out on Friday, this is how I would have implemented the fix for #3949

@andygrove
Copy link
Copy Markdown
Member Author

Is this a pure refactor, or does it fix any bugs related to the fallback logic? Asked a different way: should this be a branch-15 backport candidate?

This is a pure refactor. No functional changes.

If I had been in less of a hurry to get a build out on Friday, this is how I would have implemented the fix for #3949

The reason why withInfo did not work initially is because of the way we were tagging shuffles when native shuffle was not supported, even though we did not fall back, because columnar shuffle was supported.

We now figure out all the reasons for both native and columnar before calling withInfo.

if (!isCometPlan(s.child)) {
// we do not need to report a fallback reason if the child plan is not a Comet plan
return false
reasons += "Comet native shuffle not enabled"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show an example to how to enable native shuffle?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They must have intentionally done it since it's enabled by default.

throw new IllegalStateException(
"shuffleSupported chose native shuffle but children are not all CometNativeExec")
case None =>
throw new IllegalStateException()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it is time to add some meaningful message here?

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove overall it is LGTM

// shuffle falls back to Spark and tagged it. Preserve that decision - re-deriving it against
// a possibly-reshaped subtree (e.g. AQE stage-wrapping) can flip the answer and produce
// inconsistent plans across passes (see #3949).
if (hasExplainInfo(s)) return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not insisting on this but I actually think having a separate tag for fallback was better than using the info tags. Info is just that, a bit of information. Its presence or absence should not really be used for making decisions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Comet only use withInfo to indicate that an operator should fall back, AFAIK.

What would be an example of adding explain info to an operator where we do not also want to fall back to Spark? I could see that it could be added to note minor incompatibilities.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add info, for instance, to show why Comet chose a hash join instead of a sort merge join (and maybe point the user to a config that allows them to override a comet decision). The idea is that explain info is information that goes with an explain plan and is essentially user facing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Let me write up an issue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove! Let's keep iterating in this area and improving the user experience!

@mbutrovich mbutrovich merged commit 5efd972 into apache:main Apr 20, 2026
135 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clean up stage fallback rules

4 participants