Skip to content

Commit b389ae5

Browse files
committed
docs: design for short-circuiting serde compatibility checks (#3990)
1 parent cee98e3 commit b389ae5

File tree

1 file changed

+114
-0
lines changed

1 file changed

+114
-0
lines changed
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Short-circuit serde compatibility checks via per-handler attempt tag
2+
3+
Tracking issue: [#3990](https://github.com/apache/datafusion-comet/issues/3990)
4+
5+
## Goal
6+
7+
Comet's plan rules run at both initial planning and on every AQE stage-prep
8+
pass. On each invocation the serde framework redoes the same operator
9+
compatibility checks. For operators that already fell back on a prior pass,
10+
`CometExecRule.isOperatorEnabled` re-invokes `handler.getSupportLevel(op)`
11+
which for shuffle, aggregate, and window handlers also walks child
12+
expressions. This is wasted work and, per #3949, risks re-deriving against
13+
a reshaped subtree and flipping the answer.
14+
15+
Eliminate the repeat cost for the operator path by recording, per operator
16+
node, which handler classes have already tried `getSupportLevel` and
17+
returned a non-matching support level. Subsequent passes short-circuit
18+
those handlers without re-running the check.
19+
20+
## Non-goals
21+
22+
- Expression-level short-circuit in `exprToProto` / `exprToProtoInternal.convert`.
23+
The hot partitioning/aggregate expression walks called out in #3990 are
24+
reached through operator handlers' `getSupportLevel`, so the operator-level
25+
cache covers them indirectly. If profiling after this lands still shows
26+
hot `exprToProto` walks, an expression-level cache is a follow-up.
27+
- Scan-level short-circuit in `CometScanRule`.
28+
- Positive caching of successful protobuf conversions. Successful conversions
29+
replace the original operator with `CometNativeExec` after pass 1, so
30+
pass 2 does not re-encounter them.
31+
- Standardizing every `withInfo` site on "tag only on total failure"
32+
(direction B from #3990). That is a separate, larger audit.
33+
34+
## Design
35+
36+
### New tag
37+
38+
Add a new `TreeNodeTag[Set[String]]` on `CometExplainInfo`:
39+
40+
```scala
41+
val FAILED_HANDLERS = new TreeNodeTag[Set[String]]("CometFailedHandlers")
42+
```
43+
44+
The set holds `handler.getClass.getName` for each handler that has already
45+
run `getSupportLevel` on this operator and returned `Unsupported` or
46+
`Incompatible` (without `allowIncompat`). The tag is orthogonal to the
47+
existing `EXTENSION_INFO` tag used by `withInfo` / explain output.
48+
49+
### Change site 1 — `CometExecRule.convertToComet`
50+
51+
At the top of `convertToComet(op, handler)`, before calling
52+
`isOperatorEnabled`, check whether `handler.getClass.getName` is already in
53+
`op`'s `FAILED_HANDLERS` set. If yes, return `None` immediately. No call to
54+
`isOperatorEnabled`, no call to `handler.getSupportLevel`, no expression
55+
walks inside `getSupportLevel`.
56+
57+
### Change site 2 — `CometExecRule.isOperatorEnabled`
58+
59+
On the two return-`false` paths inside the `getSupportLevel` match —
60+
`Unsupported` and `Incompatible` without `allowIncompat` — add
61+
`handler.getClass.getName` to `op`'s `FAILED_HANDLERS` tag.
62+
63+
Do **not** record on the `enabledConfig` short-circuit path. That check
64+
is a fast config lookup and Spark config can toggle between passes; caching
65+
it yields no speedup and risks stale decisions.
66+
67+
### Why this preserves the `get_struct_field` / multi-handler case
68+
69+
The cache is keyed per handler class. A scan tagged by `CometScanRule`'s
70+
scan handler only records *that* handler class in `FAILED_HANDLERS`. When
71+
`CometExecRule` later tries `CometSparkToColumnarExec` on the same node,
72+
its handler class is not in the set, so it proceeds normally. This is the
73+
case that #3990 explicitly warned would break under a coarse
74+
`hasExplainInfo` short-circuit, and it is preserved here.
75+
76+
### Interaction with existing `withInfo` / `EXTENSION_INFO`
77+
78+
Unchanged. `withInfo` continues to record fallback reasons for extended
79+
explain output. The new `FAILED_HANDLERS` tag is internal, written and
80+
read only by `convertToComet` / `isOperatorEnabled`, and does not affect
81+
explain output or `hasExplainInfo`.
82+
83+
## Testing
84+
85+
- **Unit test** (new, in `CometExecRuleSuite` or adjacent): construct an
86+
operator and a counting stub handler that returns `Unsupported`. Call
87+
`convertToComet` twice; assert `getSupportLevel` is invoked exactly once
88+
and the second call returns `None` directly.
89+
- **Regression**: `CometExpressionSuite "get_struct_field"` must still pass.
90+
This is the test #3990 explicitly flagged as breaking under a coarse
91+
short-circuit. It validates that the per-handler keying preserves the
92+
scan-tagged-node + `CometSparkToColumnarExec` case.
93+
- **Behavioral**: a query that goes through two rule passes (initial plan +
94+
one AQE stage-prep pass) on an operator that falls back. Assert final
95+
plan is identical and `withInfo` reasons don't double-accumulate.
96+
`withInfos` already dedups via `Set`, so this is a guard against
97+
regression rather than new behavior.
98+
- **CI**: full JVM suite runs in CI. Local verification uses a targeted
99+
subset (primarily `CometExpressionSuite` plus any suite exercising AQE
100+
stage-prep re-runs).
101+
102+
## Risks
103+
104+
- **AQE re-planning and node identity.** If AQE produces a fresh operator
105+
instance rather than re-running rules on the same instance, the tag does
106+
not carry and the work is redone. That matches current behavior — no
107+
regression, just less speedup than if identity were preserved. Worth
108+
noting; not blocking.
109+
- **Config changes between passes.** Handled explicitly by skipping the
110+
`enabledConfig` path.
111+
- **Partial-path tagging elsewhere.** #3990 warns that some serde paths
112+
record fallback reasons even when another path could still succeed.
113+
This design sidesteps that risk by being per-handler, not relying on
114+
`hasExplainInfo`. The fix for partial-path tagging is out of scope.

0 commit comments

Comments
 (0)