chore: Improve shuffle fallback logic#3989
Conversation
Replace the separate STAGE_FALLBACK_TAG with explain-info-based stickiness. Shuffle path checks (`nativeShuffleFailureReasons`, `columnarShuffleFailureReasons`) are now pure and return reasons instead of tagging eagerly. A new `shuffleSupported` coordinator short-circuits on `hasExplainInfo`, tries native then columnar, and tags via `withInfos` only on total failure. DPP fallback, which disqualifies both paths, moves into the coordinator. This removes the need for `CometFallback` and eliminates the semantic split where `withInfo` could fire for a path-specific failure while the node still converted via a different path.
…llup Expand the doc comments on withInfo/withInfos/hasExplainInfo to make clear that these record fallback reasons surfaced in extended explain output, and that any call to withInfo is a signal that the node falls back to Spark. Also restore the child-expression rollup for native range-partitioning sort orders that was lost in the earlier refactor: when exprToProto fails on a sort-order expression, its own fallback reasons (e.g. strict floating-point sort) are now copied onto the shuffle's reasons so they surface alongside 'unsupported range partitioning sort order'.
|
Is this a pure refactor, or does it fix any bugs related to the fallback logic? Asked a different way: should this be a |
This is a pure refactor. No functional changes. If I had been in less of a hurry to get a build out on Friday, this is how I would have implemented the fix for #3949 |
The reason why withInfo did not work initially is because of the way we were tagging shuffles when native shuffle was not supported, even though we did not fall back, because columnar shuffle was supported. We now figure out all the reasons for both native and columnar before calling withInfo. |
| if (!isCometPlan(s.child)) { | ||
| // we do not need to report a fallback reason if the child plan is not a Comet plan | ||
| return false | ||
| reasons += "Comet native shuffle not enabled" |
There was a problem hiding this comment.
should we show an example to how to enable native shuffle?
There was a problem hiding this comment.
They must have intentionally done it since it's enabled by default.
| throw new IllegalStateException( | ||
| "shuffleSupported chose native shuffle but children are not all CometNativeExec") | ||
| case None => | ||
| throw new IllegalStateException() |
There was a problem hiding this comment.
perhaps it is time to add some meaningful message here?
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove overall it is LGTM
| // shuffle falls back to Spark and tagged it. Preserve that decision - re-deriving it against | ||
| // a possibly-reshaped subtree (e.g. AQE stage-wrapping) can flip the answer and produce | ||
| // inconsistent plans across passes (see #3949). | ||
| if (hasExplainInfo(s)) return None |
There was a problem hiding this comment.
Not insisting on this but I actually think having a separate tag for fallback was better than using the info tags. Info is just that, a bit of information. Its presence or absence should not really be used for making decisions.
There was a problem hiding this comment.
Currently, Comet only use withInfo to indicate that an operator should fall back, AFAIK.
What would be an example of adding explain info to an operator where we do not also want to fall back to Spark? I could see that it could be added to note minor incompatibilities.
There was a problem hiding this comment.
We could add info, for instance, to show why Comet chose a hash join instead of a sort merge join (and maybe point the user to a config that allows them to override a comet decision). The idea is that explain info is information that goes with an explain plan and is essentially user facing.
There was a problem hiding this comment.
I like the idea. Let me write up an issue
mbutrovich
left a comment
There was a problem hiding this comment.
Thanks @andygrove! Let's keep iterating in this area and improving the user experience!
Which issue does this PR close?
Closes #3984
Rationale for this change
What changes are included in this PR?
STAGE_FALLBACK_TAGsince this was duplicating the existingwithInfomechanism for tagging plans with fallback reasonswithInfomethods (which should really be renamed to something likewithFallbackReason, IMO)How are these changes tested?
Existing tests, especially those added in #3982.