What is the problem the feature request solves?
Describe the bug
Not a bug — a performance / correctness-of-decisions enhancement discovered
while doing the shuffle fallback cleanup (follow-up to PR
refactor-shuffle-fallback-coordinator).
Describe the solution you'd like
Comet's plan rules (CometScanRule, CometExecRule) run at both initial
planning and on every AQE stage-prep pass. On each invocation the serde
framework redoes the same compatibility checks and tagging:
CometExecRule.isOperatorEnabled calls handler.getSupportLevel(op)
per operator, per pass.
QueryPlanSerde.exprToProto walks the expression tree and calls
handler.getSupportLevel + handler.convert on every sub-expression,
per pass.
When a node was already tagged as falling back in a prior pass (carries
reasons on CometExplainInfo.EXTENSION_INFO via withInfo / withInfos),
re-deriving the same decision is wasted work — and in some cases (#3949)
risks re-deriving against a reshaped subtree and flipping the answer.
Additional context
The naive approach — if (hasExplainInfo(op)) return None at the top of
convertToComet / exprToProto — is too coarse:
-
Multiple handlers per node. A scan tagged by CometScanRule
("Comet Scan is not enabled") can still legitimately convert via
CometSparkToColumnarExec in CometExecRule. The fallback tag was
recorded by a different rule/handler and does not mean the current
handler should give up. CometExpressionSuite."get_struct_field"
breaks with a coarse short-circuit.
-
Partial-path tagging. Some serde code paths record fallback
reasons even when another path could still succeed. The shuffle
refactor fixes this for shuffle by only tagging on total failure;
other sites (expression rollups, etc.) would need the same discipline
before a global short-circuit is safe.
Suggested directions:
- Per-handler attempt record (
TreeNodeTag[Set[String]] keyed on handler
name) so convertToComet(op, handlerX) can skip re-running
handlerX.getSupportLevel(op) / handlerX.convert(op, ...) when this
handler has already tried and failed, without blocking other handlers.
- Or: standardize all serde paths on "only tag on final fallback" (as the
shuffle coordinator now does) so hasExplainInfo becomes a reliable
sticky marker, then add the coarse short-circuit safely.
- For expressions:
exprToProto is called many times per pass on the
same sub-expression trees (partitioning expressions in shuffle checks
are the hottest). Caching needs to handle DecimalPrecision.promote
possibly returning a fresh instance.
Hotspots to target first:
- Expression serialization inside shuffle/partitioning checks (many
exprToProto repeats).
CometExecRule.isOperatorEnabled on every operator, per pass.
- Scan-level expression walks in
CometScanRule.
Component(s)
spark
Describe the potential solution
No response
Additional context
No response
What is the problem the feature request solves?
Describe the bug
Not a bug — a performance / correctness-of-decisions enhancement discovered
while doing the shuffle fallback cleanup (follow-up to PR
refactor-shuffle-fallback-coordinator).Describe the solution you'd like
Comet's plan rules (
CometScanRule,CometExecRule) run at both initialplanning and on every AQE stage-prep pass. On each invocation the serde
framework redoes the same compatibility checks and tagging:
CometExecRule.isOperatorEnabledcallshandler.getSupportLevel(op)per operator, per pass.
QueryPlanSerde.exprToProtowalks the expression tree and callshandler.getSupportLevel+handler.converton every sub-expression,per pass.
When a node was already tagged as falling back in a prior pass (carries
reasons on
CometExplainInfo.EXTENSION_INFOviawithInfo/withInfos),re-deriving the same decision is wasted work — and in some cases (#3949)
risks re-deriving against a reshaped subtree and flipping the answer.
Additional context
The naive approach —
if (hasExplainInfo(op)) return Noneat the top ofconvertToComet/exprToProto— is too coarse:Multiple handlers per node. A scan tagged by
CometScanRule("Comet Scan is not enabled") can still legitimately convert via
CometSparkToColumnarExecinCometExecRule. The fallback tag wasrecorded by a different rule/handler and does not mean the current
handler should give up.
CometExpressionSuite."get_struct_field"breaks with a coarse short-circuit.
Partial-path tagging. Some serde code paths record fallback
reasons even when another path could still succeed. The shuffle
refactor fixes this for shuffle by only tagging on total failure;
other sites (expression rollups, etc.) would need the same discipline
before a global short-circuit is safe.
Suggested directions:
TreeNodeTag[Set[String]]keyed on handlername) so
convertToComet(op, handlerX)can skip re-runninghandlerX.getSupportLevel(op)/handlerX.convert(op, ...)when thishandler has already tried and failed, without blocking other handlers.
shuffle coordinator now does) so
hasExplainInfobecomes a reliablesticky marker, then add the coarse short-circuit safely.
exprToProtois called many times per pass on thesame sub-expression trees (partitioning expressions in shuffle checks
are the hottest). Caching needs to handle
DecimalPrecision.promotepossibly returning a fresh instance.
Hotspots to target first:
exprToProtorepeats).CometExecRule.isOperatorEnabledon every operator, per pass.CometScanRule.Component(s)
spark
Describe the potential solution
No response
Additional context
No response