[SPARK-57176][SQL] Extend nested column pruning through array-returning functions#56227
[SPARK-57176][SQL] Extend nested column pruning through array-returning functions#56227sunchao wants to merge 7 commits into
Conversation
f1fb146 to
7499630
Compare
|
cc @viirya @peter-toth @cloud-fan @dongjoon-hyun please take a look, thanks! |
viirya
left a comment
There was a problem hiding this comment.
Read through the full change against the existing SchemaPruning/ProjectionOverSchema/SelectedField machinery, the NamedLambdaVariable/HigherOrderFunction binding model, and ArraySort/ArrayCompact semantics. This is a careful, well-tested extension of the SPARK-57022 work — the hard edge cases are handled deliberately and I didn't find a correctness bug. A few consistency nits below.
The semanticEquals -> exprId change is the most important behavioral shift and it's a real fix, not just a cleanup. NamedLambdaVariable is a plain case class with no canonicalized/equals override, so semanticEquals ends up comparing all fields including the value: AtomicReference and dataType. The codebase's own convention for matching lambda variables is by exprId — see HigherOrderFunction.functionsForEval (keyed on exprId, with the "variables [may be] instantiated separately during serialization" comment) and canonicalized. Switching to exprId aligns with that and fixes the case where a body reference is a separately-instantiated copy that semanticEquals would miss. The "match separately instantiated array lambda variables by exprId" test exercises exactly this via .copy(value = new AtomicReference[Any]()), and it correctly also covers the pre-existing ArrayTransform path.
The edge-case handling holds up:
- Retaining the full schema for strict
array_sortwhen!allowNullComparisonResult && lambda.function.nullableis correct —comparatorthrowscomparatorReturnsNull(o1.toString, ...), so pruning would change the observable error parameters, and gating onlambda.function.nullablekeeps it precise rather than over-conservative. (Unrelated to this PR, but that error call passeso1.toStringtwice instead ofo1, o2— looks like a pre-existing typo worth a separate fix.) - For
array_compact(->KnownNotContainsNull(ArrayFilter(child, x -> IsNotNull(x)))), theSelectedFieldcase correctly restores the child'scontainsNullwhen pushing the pruned type down, so the rebuiltArrayFilterdoesn't end upcontainsNull = falseover a nullable array. The newIsNotNull/IsNull(variable) => Some(Seq.empty)cases are precise — they fire only for a null-check on the whole element, whileIsNotNull(x.a)still collectsa. - Returning
Seq.empty(rather than full schema) when the lambda needs no element fields is right for filter/sort, since the element itself flows to the result and the downstreamSelectedFieldalready contributes the result requirements. That's a meaningful difference from the transform path, where the lambda output is consumed instead.
Three nits, all consistency/readability rather than correctness:
-
ProjectionOverLambdaVariable.projectstill usessemanticEqualswhile theSchemaPruningside switched toexprId. It happens to be equivalent here becauseoriginal/projectedare built from the lambda's own argument instances. But given the PR's own finding that lambda vars can be separately instantiated — and that this object is documented as needing to stay "in sync" with the collection side — matching byexprIdhere too would be more robust. Notably the PR added the "matched by exprId because they may be instantiated separately" comment to this object's doc, but the code underneath still usessemanticEquals, so the comment and code disagree. This is the one I'd most want aligned before merge. -
The
Slicefoldability guard is inconsistent between the two files.SelectedFieldguardsstart.foldable && length.foldable, butProjectionOverSchema'sSlicecase doesn't — unlike the adjacentElementAtcase there, which guardsright.foldable. I traced it and it's safe in practice (non-foldable bounds ->SelectedFieldreturns None -> full schema collected -> the unguarded rewrite is a no-op, confirmed by the non-foldable-Slice test), but adding the same guard would make the invariant local instead of emergent and matchElementAt. -
projectArrayHigherOrderFunctionnow relies onelementVar ne projectedElementVar(reference inequality) to decide whether to rewrite a given argument's references. That works because non-element args are passed through as the same object, but a one-line comment stating that invariant (only the firstnumElementVariablesNamedLambdaVariables are re-typed, identified by reference) would help the next reader.
Tests are thorough — both the prune-happens and prune-doesn't-happen directions are covered for each risky case, including an end-to-end eval for the exprId fix and checkScan/checkAnswer across filter, sort (custom + default ordering), reverse/shuffle/slice, non-foldable slice, transform-over-HOF, and array_compact. Nice work.
|
Thanks for the review!
Updated the PR to add matching
The current head already matches lambda variables by
I left the reference-identity check as-is since it directly follows from |
Why are the changes needed?
SPARK-57176 follows SPARK-57022, which added nested column pruning for
transformoverarray<struct>inputs.Array-returning functions still retain the complete input element struct even when downstream expressions and lambdas only require a subset of nested fields. For example:
If
friendscontainsfirst,middle, andlast, Spark currently reads all three fields even though the query only requiresfirstandlast.What changes were proposed in this PR?
filterand comparator-basedarray_sort.reverse,shuffle,slice, andarray_compact.array_sortnatural ordering requires the full struct.Functions that inspect full element equality or natural ordering remain out of scope because dropping nested fields could change results.
Does this PR introduce any user-facing change?
Yes. Eligible queries using array-returning functions over arrays of structs can read a narrower input schema. Query results and SQL APIs are unchanged.
How was this patch tested?
JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite" "sql/testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z Array"JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt catalyst/scalastyle sql/scalastylegit diff --checkWas this patch authored or co-authored using generative AI tooling?
Generated-by: Codex (GPT-5)