|
| 1 | +# Short-circuit serde compatibility checks via per-handler attempt tag |
| 2 | + |
| 3 | +Tracking issue: [#3990](https://github.com/apache/datafusion-comet/issues/3990) |
| 4 | + |
| 5 | +## Goal |
| 6 | + |
| 7 | +Comet's plan rules run at both initial planning and on every AQE stage-prep |
| 8 | +pass. On each invocation the serde framework redoes the same operator |
| 9 | +compatibility checks. For operators that already fell back on a prior pass, |
| 10 | +`CometExecRule.isOperatorEnabled` re-invokes `handler.getSupportLevel(op)` — |
| 11 | +which for shuffle, aggregate, and window handlers also walks child |
| 12 | +expressions. This is wasted work and, per #3949, risks re-deriving against |
| 13 | +a reshaped subtree and flipping the answer. |
| 14 | + |
| 15 | +Eliminate the repeat cost for the operator path by recording, per operator |
| 16 | +node, which handler classes have already tried `getSupportLevel` and |
| 17 | +returned a non-matching support level. Subsequent passes short-circuit |
| 18 | +those handlers without re-running the check. |
| 19 | + |
| 20 | +## Non-goals |
| 21 | + |
| 22 | +- Expression-level short-circuit in `exprToProto` / `exprToProtoInternal.convert`. |
| 23 | + The hot partitioning/aggregate expression walks called out in #3990 are |
| 24 | + reached through operator handlers' `getSupportLevel`, so the operator-level |
| 25 | + cache covers them indirectly. If profiling after this lands still shows |
| 26 | + hot `exprToProto` walks, an expression-level cache is a follow-up. |
| 27 | +- Scan-level short-circuit in `CometScanRule`. |
| 28 | +- Positive caching of successful protobuf conversions. Successful conversions |
| 29 | + replace the original operator with `CometNativeExec` after pass 1, so |
| 30 | + pass 2 does not re-encounter them. |
| 31 | +- Standardizing every `withInfo` site on "tag only on total failure" |
| 32 | + (direction B from #3990). That is a separate, larger audit. |
| 33 | + |
| 34 | +## Design |
| 35 | + |
| 36 | +### New tag |
| 37 | + |
| 38 | +Add a new `TreeNodeTag[Set[String]]` on `CometExplainInfo`: |
| 39 | + |
| 40 | +```scala |
| 41 | +val FAILED_HANDLERS = new TreeNodeTag[Set[String]]("CometFailedHandlers") |
| 42 | +``` |
| 43 | + |
| 44 | +The set holds `handler.getClass.getName` for each handler that has already |
| 45 | +run `getSupportLevel` on this operator and returned `Unsupported` or |
| 46 | +`Incompatible` (without `allowIncompat`). The tag is orthogonal to the |
| 47 | +existing `EXTENSION_INFO` tag used by `withInfo` / explain output. |
| 48 | + |
| 49 | +### Change site 1 — `CometExecRule.convertToComet` |
| 50 | + |
| 51 | +At the top of `convertToComet(op, handler)`, before calling |
| 52 | +`isOperatorEnabled`, check whether `handler.getClass.getName` is already in |
| 53 | +`op`'s `FAILED_HANDLERS` set. If yes, return `None` immediately. No call to |
| 54 | +`isOperatorEnabled`, no call to `handler.getSupportLevel`, no expression |
| 55 | +walks inside `getSupportLevel`. |
| 56 | + |
| 57 | +### Change site 2 — `CometExecRule.isOperatorEnabled` |
| 58 | + |
| 59 | +On the two return-`false` paths inside the `getSupportLevel` match — |
| 60 | +`Unsupported` and `Incompatible` without `allowIncompat` — add |
| 61 | +`handler.getClass.getName` to `op`'s `FAILED_HANDLERS` tag. |
| 62 | + |
| 63 | +Do **not** record on the `enabledConfig` short-circuit path. That check |
| 64 | +is a fast config lookup and Spark config can toggle between passes; caching |
| 65 | +it yields no speedup and risks stale decisions. |
| 66 | + |
| 67 | +### Why this preserves the `get_struct_field` / multi-handler case |
| 68 | + |
| 69 | +The cache is keyed per handler class. A scan tagged by `CometScanRule`'s |
| 70 | +scan handler only records *that* handler class in `FAILED_HANDLERS`. When |
| 71 | +`CometExecRule` later tries `CometSparkToColumnarExec` on the same node, |
| 72 | +its handler class is not in the set, so it proceeds normally. This is the |
| 73 | +case that #3990 explicitly warned would break under a coarse |
| 74 | +`hasExplainInfo` short-circuit, and it is preserved here. |
| 75 | + |
| 76 | +### Interaction with existing `withInfo` / `EXTENSION_INFO` |
| 77 | + |
| 78 | +Unchanged. `withInfo` continues to record fallback reasons for extended |
| 79 | +explain output. The new `FAILED_HANDLERS` tag is internal, written and |
| 80 | +read only by `convertToComet` / `isOperatorEnabled`, and does not affect |
| 81 | +explain output or `hasExplainInfo`. |
| 82 | + |
| 83 | +## Testing |
| 84 | + |
| 85 | +- **Unit test** (new, in `CometExecRuleSuite` or adjacent): construct an |
| 86 | + operator and a counting stub handler that returns `Unsupported`. Call |
| 87 | + `convertToComet` twice; assert `getSupportLevel` is invoked exactly once |
| 88 | + and the second call returns `None` directly. |
| 89 | +- **Regression**: `CometExpressionSuite "get_struct_field"` must still pass. |
| 90 | + This is the test #3990 explicitly flagged as breaking under a coarse |
| 91 | + short-circuit. It validates that the per-handler keying preserves the |
| 92 | + scan-tagged-node + `CometSparkToColumnarExec` case. |
| 93 | +- **Behavioral**: a query that goes through two rule passes (initial plan + |
| 94 | + one AQE stage-prep pass) on an operator that falls back. Assert final |
| 95 | + plan is identical and `withInfo` reasons don't double-accumulate. |
| 96 | + `withInfos` already dedups via `Set`, so this is a guard against |
| 97 | + regression rather than new behavior. |
| 98 | +- **CI**: full JVM suite runs in CI. Local verification uses a targeted |
| 99 | + subset (primarily `CometExpressionSuite` plus any suite exercising AQE |
| 100 | + stage-prep re-runs). |
| 101 | + |
| 102 | +## Risks |
| 103 | + |
| 104 | +- **AQE re-planning and node identity.** If AQE produces a fresh operator |
| 105 | + instance rather than re-running rules on the same instance, the tag does |
| 106 | + not carry and the work is redone. That matches current behavior — no |
| 107 | + regression, just less speedup than if identity were preserved. Worth |
| 108 | + noting; not blocking. |
| 109 | +- **Config changes between passes.** Handled explicitly by skipping the |
| 110 | + `enabledConfig` path. |
| 111 | +- **Partial-path tagging elsewhere.** #3990 warns that some serde paths |
| 112 | + record fallback reasons even when another path could still succeed. |
| 113 | + This design sidesteps that risk by being per-handler, not relying on |
| 114 | + `hasExplainInfo`. The fix for partial-path tagging is out of scope. |
0 commit comments