Skip to content

[SPARK-56660][SQL] Decompose struct equality into field-level predicates for filter pushdown#56244

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/SPARK-56660-struct-predicate-decompose
Open

[SPARK-56660][SQL] Decompose struct equality into field-level predicates for filter pushdown#56244
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/SPARK-56660-struct-predicate-decompose

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add optimizer rule DecomposeStructComparison that rewrites struct-level equality (= and <=>) into a conjunction of field-level equalities. This enables filter pushdown for individual struct fields.

For example, struct_col = struct(1, 'a') becomes struct_col.field1 = 1 AND struct_col.field2 = 'a'.

Why are the changes needed?

Struct literal comparisons and tuple comparisons are treated as opaque predicates by the optimizer. Data source filter pushdown only understands scalar predicates, so struct equality cannot be pushed down for file pruning (Parquet row group skipping, partition pruning, etc.), even though the equivalent scalar predicates would be pushed.

Does this PR introduce any user-facing change?

Yes — queries filtering on struct equality will now benefit from file pruning and filter pushdown, improving performance on large tables.

How was this patch tested?

Added StructPredicateDecomposeSuite with tests covering EqualTo, EqualNullSafe, nested structs, single-field structs, empty structs, tuple comparisons, non-deterministic guard, and GreaterThan exclusion.

Was this patch authored or co-authored using generative AI tooling?

Yes.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56660-struct-predicate-decompose branch 3 times, most recently from 857e4be to a9a74c4 Compare June 3, 2026 01:15
Copy link
Copy Markdown
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change!

Comment on lines +600 to +602
val fields = left.dataType.asInstanceOf[StructType].fields
fields.indices.map { i =>
cmp(GetStructField(left, i), GetStructField(right, i))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently there's a nuance case in spark - while 1 = NULL returns null, struct(1, null) = struct(1, null) returns true; with this code change, this behavior is lost. I think we will need to be careful in null handling in general

* `struct_col.field1 = 1 AND struct_col.field2 = 'a'`.
*/
object DecomposeStructComparison extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that combining this 1/ recursive transform of nested struct and 2/ unconditionally AND'ing all field together within struct regardless of how many fields the struct hold, together could cause stack overflow error for deeply nested and/or huge struct, and worst case pose DOS threat to the hosting system. Do we think if we should guard towards these by setting an upper limit, and/or in general have this feature behind a config flag to guard against this and the below issue?

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56660-struct-predicate-decompose branch from a9a74c4 to 76dca41 Compare June 5, 2026 01:53
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@yyanyy Great catch on the NULL semantics — you're right. Spark's struct equality uses InterpretedOrdering which treats null=null within fields as equal (returns TRUE), while EqualTo(null, null) returns NULL.

Fixed: the decomposition now uses EqualNullSafe (<=>) for per-field comparisons, which matches the struct equality semantics exactly:

  • null <=> null → true (matches struct behavior)
  • null <=> 2 → false (matches struct behavior)

The only remaining discrepancy is when the entire struct itself is null (original returns NULL, decomposed returns FALSE) — but since our rule only fires in Filter context, this is harmless (both NULL and FALSE exclude the row from WHERE).

Also added a width guard (max 100 fields) to prevent stack overflow on very wide/deeply nested structs, per your second concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants