[SPARK-56660][SQL] Decompose struct equality into field-level predicates for filter pushdown#56244
Conversation
857e4be to
a9a74c4
Compare
yyanyy
left a comment
There was a problem hiding this comment.
Thanks for making this change!
| val fields = left.dataType.asInstanceOf[StructType].fields | ||
| fields.indices.map { i => | ||
| cmp(GetStructField(left, i), GetStructField(right, i)) |
There was a problem hiding this comment.
Apparently there's a nuance case in spark - while 1 = NULL returns null, struct(1, null) = struct(1, null) returns true; with this code change, this behavior is lost. I think we will need to be careful in null handling in general
| * `struct_col.field1 = 1 AND struct_col.field2 = 'a'`. | ||
| */ | ||
| object DecomposeStructComparison extends Rule[LogicalPlan] { | ||
| def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( |
There was a problem hiding this comment.
I'm a little worried that combining this 1/ recursive transform of nested struct and 2/ unconditionally AND'ing all field together within struct regardless of how many fields the struct hold, together could cause stack overflow error for deeply nested and/or huge struct, and worst case pose DOS threat to the hosting system. Do we think if we should guard towards these by setting an upper limit, and/or in general have this feature behind a config flag to guard against this and the below issue?
…tes for filter pushdown
a9a74c4 to
76dca41
Compare
|
@yyanyy Great catch on the NULL semantics — you're right. Spark's struct equality uses Fixed: the decomposition now uses
The only remaining discrepancy is when the entire struct itself is null (original returns NULL, decomposed returns FALSE) — but since our rule only fires in Filter context, this is harmless (both NULL and FALSE exclude the row from WHERE). Also added a width guard (max 100 fields) to prevent stack overflow on very wide/deeply nested structs, per your second concern. |
What changes were proposed in this pull request?
Add optimizer rule
DecomposeStructComparisonthat rewrites struct-level equality (=and<=>) into a conjunction of field-level equalities. This enables filter pushdown for individual struct fields.For example,
struct_col = struct(1, 'a')becomesstruct_col.field1 = 1 AND struct_col.field2 = 'a'.Why are the changes needed?
Struct literal comparisons and tuple comparisons are treated as opaque predicates by the optimizer. Data source filter pushdown only understands scalar predicates, so struct equality cannot be pushed down for file pruning (Parquet row group skipping, partition pruning, etc.), even though the equivalent scalar predicates would be pushed.
Does this PR introduce any user-facing change?
Yes — queries filtering on struct equality will now benefit from file pruning and filter pushdown, improving performance on large tables.
How was this patch tested?
Added
StructPredicateDecomposeSuitewith tests covering EqualTo, EqualNullSafe, nested structs, single-field structs, empty structs, tuple comparisons, non-deterministic guard, and GreaterThan exclusion.Was this patch authored or co-authored using generative AI tooling?
Yes.