Skip to content

Short-circuit serde compatibility checks when a node is already tagged as falling back #3990

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Describe the bug

Not a bug — a performance / correctness-of-decisions enhancement discovered
while doing the shuffle fallback cleanup (follow-up to PR
refactor-shuffle-fallback-coordinator).

Describe the solution you'd like

Comet's plan rules (CometScanRule, CometExecRule) run at both initial
planning and on every AQE stage-prep pass. On each invocation the serde
framework redoes the same compatibility checks and tagging:

  • CometExecRule.isOperatorEnabled calls handler.getSupportLevel(op)
    per operator, per pass.
  • QueryPlanSerde.exprToProto walks the expression tree and calls
    handler.getSupportLevel + handler.convert on every sub-expression,
    per pass.

When a node was already tagged as falling back in a prior pass (carries
reasons on CometExplainInfo.EXTENSION_INFO via withInfo / withInfos),
re-deriving the same decision is wasted work — and in some cases (#3949)
risks re-deriving against a reshaped subtree and flipping the answer.

Additional context

The naive approach — if (hasExplainInfo(op)) return None at the top of
convertToComet / exprToProto — is too coarse:

  1. Multiple handlers per node. A scan tagged by CometScanRule
    ("Comet Scan is not enabled") can still legitimately convert via
    CometSparkToColumnarExec in CometExecRule. The fallback tag was
    recorded by a different rule/handler and does not mean the current
    handler should give up. CometExpressionSuite."get_struct_field"
    breaks with a coarse short-circuit.

  2. Partial-path tagging. Some serde code paths record fallback
    reasons even when another path could still succeed. The shuffle
    refactor fixes this for shuffle by only tagging on total failure;
    other sites (expression rollups, etc.) would need the same discipline
    before a global short-circuit is safe.

Suggested directions:

  • Per-handler attempt record (TreeNodeTag[Set[String]] keyed on handler
    name) so convertToComet(op, handlerX) can skip re-running
    handlerX.getSupportLevel(op) / handlerX.convert(op, ...) when this
    handler has already tried and failed, without blocking other handlers.
  • Or: standardize all serde paths on "only tag on final fallback" (as the
    shuffle coordinator now does) so hasExplainInfo becomes a reliable
    sticky marker, then add the coarse short-circuit safely.
  • For expressions: exprToProto is called many times per pass on the
    same sub-expression trees (partitioning expressions in shuffle checks
    are the hottest). Caching needs to handle DecimalPrecision.promote
    possibly returning a fresh instance.

Hotspots to target first:

  • Expression serialization inside shuffle/partitioning checks (many
    exprToProto repeats).
  • CometExecRule.isOperatorEnabled on every operator, per pass.
  • Scan-level expression walks in CometScanRule.

Component(s)

spark

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformancepriority:mediumFunctional bugs, performance regressions, broken features

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions