[SPARK-57176][SQL] Extend nested column pruning through array-returning functions by sunchao · Pull Request #56227 · apache/spark

sunchao · 2026-05-30T18:13:28Z

Why are the changes needed?

SPARK-57176 follows SPARK-57022, which added nested column pruning for transform over array<struct> inputs.

Array-returning functions still retain the complete input element struct even when downstream expressions and lambdas only require a subset of nested fields. For example:

SELECT filter(friends, friend -> friend.last = 'Smith').first
FROM contacts

If friends contains first, middle, and last, Spark currently reads all three fields even though the query only requires first and last.

What changes were proposed in this PR?

Merge downstream result-field requirements with lambda requirements for filter and comparator-based array_sort.
Propagate projected element schemas through reverse, shuffle, slice, and array_compact.
Rewrite bound lambda variable types and nested field ordinals after pruning.
Retain the complete element schema when the whole result is used, when a lambda consumes the whole element, or when default array_sort natural ordering requires the full struct.

Functions that inspect full element equality or natural ordering remain out of scope because dropping nested fields could change results.

Does this PR introduce any user-facing change?

Yes. Eligible queries using array-returning functions over arrays of structs can read a narrower input schema. Query results and SQL APIs are unchanged.

How was this patch tested?

JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite" "sql/testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z Array"
JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt catalyst/scalastyle sql/scalastyle
git diff --check

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex (GPT-5)

…ng functions

sunchao · 2026-06-04T16:54:50Z

cc @viirya @peter-toth @cloud-fan @dongjoon-hyun please take a look, thanks!

viirya

Read through the full change against the existing SchemaPruning/ProjectionOverSchema/SelectedField machinery, the NamedLambdaVariable/HigherOrderFunction binding model, and ArraySort/ArrayCompact semantics. This is a careful, well-tested extension of the SPARK-57022 work — the hard edge cases are handled deliberately and I didn't find a correctness bug. A few consistency nits below.

The semanticEquals -> exprId change is the most important behavioral shift and it's a real fix, not just a cleanup. NamedLambdaVariable is a plain case class with no canonicalized/equals override, so semanticEquals ends up comparing all fields including the value: AtomicReference and dataType. The codebase's own convention for matching lambda variables is by exprId — see HigherOrderFunction.functionsForEval (keyed on exprId, with the "variables [may be] instantiated separately during serialization" comment) and canonicalized. Switching to exprId aligns with that and fixes the case where a body reference is a separately-instantiated copy that semanticEquals would miss. The "match separately instantiated array lambda variables by exprId" test exercises exactly this via .copy(value = new AtomicReference[Any]()), and it correctly also covers the pre-existing ArrayTransform path.

The edge-case handling holds up:

Retaining the full schema for strict array_sort when !allowNullComparisonResult && lambda.function.nullable is correct — comparator throws comparatorReturnsNull(o1.toString, ...), so pruning would change the observable error parameters, and gating on lambda.function.nullable keeps it precise rather than over-conservative. (Unrelated to this PR, but that error call passes o1.toString twice instead of o1, o2 — looks like a pre-existing typo worth a separate fix.)
For array_compact (-> KnownNotContainsNull(ArrayFilter(child, x -> IsNotNull(x)))), the SelectedField case correctly restores the child's containsNull when pushing the pruned type down, so the rebuilt ArrayFilter doesn't end up containsNull = false over a nullable array. The new IsNotNull/IsNull(variable) => Some(Seq.empty) cases are precise — they fire only for a null-check on the whole element, while IsNotNull(x.a) still collects a.
Returning Seq.empty (rather than full schema) when the lambda needs no element fields is right for filter/sort, since the element itself flows to the result and the downstream SelectedField already contributes the result requirements. That's a meaningful difference from the transform path, where the lambda output is consumed instead.

Three nits, all consistency/readability rather than correctness:

ProjectionOverLambdaVariable.project still uses semanticEquals while the SchemaPruning side switched to exprId. It happens to be equivalent here because original/projected are built from the lambda's own argument instances. But given the PR's own finding that lambda vars can be separately instantiated — and that this object is documented as needing to stay "in sync" with the collection side — matching by exprId here too would be more robust. Notably the PR added the "matched by exprId because they may be instantiated separately" comment to this object's doc, but the code underneath still uses semanticEquals, so the comment and code disagree. This is the one I'd most want aligned before merge.
The Slice foldability guard is inconsistent between the two files. SelectedField guards start.foldable && length.foldable, but ProjectionOverSchema's Slice case doesn't — unlike the adjacent ElementAt case there, which guards right.foldable. I traced it and it's safe in practice (non-foldable bounds -> SelectedField returns None -> full schema collected -> the unguarded rewrite is a no-op, confirmed by the non-foldable-Slice test), but adding the same guard would make the invariant local instead of emergent and match ElementAt.
projectArrayHigherOrderFunction now relies on elementVar ne projectedElementVar (reference inequality) to decide whether to rewrite a given argument's references. That works because non-element args are passed through as the same object, but a one-line comment stating that invariant (only the first numElementVariables NamedLambdaVariables are re-typed, identified by reference) would help the next reader.

Tests are thorough — both the prune-happens and prune-doesn't-happen directions are covered for each risky case, including an end-to-end eval for the exprId fix and checkScan/checkAnswer across filter, sort (custom + default ordering), reverse/shuffle/slice, non-foldable slice, transform-over-HOF, and array_compact. Nice work.

sunchao · 2026-06-05T04:07:55Z

Thanks for the review!

The Slice foldability guard is inconsistent between the two files ...

Updated the PR to add matching start.foldable && length.foldable guard to the Slice
case in ProjectionOverSchema to keep it aligned with SelectedField and the
adjacent ElementAt handling.

ProjectionOverLambdaVariable.project still uses semanticEquals ...

The current head already matches lambda variables by exprId
in ProjectionOverLambdaVariable.project, so no additional change is needed
there.

projectArrayHigherOrderFunction now relies ..

I left the reference-identity check as-is since it directly follows from
projectedArguments: only the retyped element variables are copied, while the
remaining arguments retain their original instances.

sunchao added 6 commits June 2, 2026 22:20

[SPARK-57176][SQL] Extend nested column pruning through array-returni…

edb3e03

…ng functions

[SPARK-57176][SQL] Preserve dependencies in array-returning pruning

d762bc0

[SPARK-57176][SQL] Preserve wrapper dependencies and input nullability

d89124f

[SPARK-57176][SQL] Preserve strict ArraySort error values

c8c2fe0

[SPARK-57176][SQL] Match array lambda variables by expression ID

4eeb3e2

[SPARK-57176][SQL] Preserve nested array lambda dependencies

7499630

sunchao force-pushed the dev/chao/codex/spark-array-returning-function-pruning branch from f1fb146 to 7499630 Compare June 3, 2026 05:48

viirya reviewed Jun 4, 2026

View reviewed changes

[SPARK-57176][SQL] Align Slice pruning guard

9c3bc3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57176][SQL] Extend nested column pruning through array-returning functions#56227

[SPARK-57176][SQL] Extend nested column pruning through array-returning functions#56227
sunchao wants to merge 7 commits into
apache:masterfrom
sunchao:dev/chao/codex/spark-array-returning-function-pruning

sunchao commented May 30, 2026

Uh oh!

sunchao commented Jun 4, 2026

Uh oh!

viirya left a comment

Uh oh!

sunchao commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunchao commented May 30, 2026

Why are the changes needed?

What changes were proposed in this PR?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao commented Jun 4, 2026

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants