Skip to content

[VL][Delta] Guard DV DML row-index scans#12215

Draft
malinjawi wants to merge 8 commits into
apache:mainfrom
malinjawi:split/delta-dv-dml-scan-safety
Draft

[VL][Delta] Guard DV DML row-index scans#12215
malinjawi wants to merge 8 commits into
apache:mainfrom
malinjawi:split/delta-dv-dml-scan-safety

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

@malinjawi malinjawi commented Jun 1, 2026

What changes are proposed in this pull request?

This PR is the next draft split in the Delta deletion-vector MoR stack. It is stacked after #12197 and #12198, and is opened early to get CI signal while those reviewer-requested scan splits continue through review.

It adds a correctness-first DML row-index scan guard for Delta DV DELETE/MoR planning. The goal is to preserve safe fallback behavior for DML target scans until native row-index scan execution is proven for the required Delta table shapes.

Main changes:

  • add DeltaDeletionVectorDmlUtils to detect Delta DML row-index scan shapes
  • guard Delta post-transform planning so DML row-index scans do not accidentally move under native execution when the native DML path is disabled
  • preserve Delta internal row-index/file-path columns needed by DML target scans
  • add focused Delta 3.3 and Delta 4.0 coverage for fallback plan shape
  • add repeated DELETE coverage over an existing deletion vector, verifying the active DV cardinality advances and final read results remain correct
  • add DeltaDmlRowIndexScanBenchmark for Delta 3.3 and Delta 4.0 so scan-guard evidence can be collected separately from the later DELETE command benchmark

This PR is intentionally safety-only:

  • no DELETE command routing
  • no native bitmap aggregation enablement
  • no plain Parquet target-scan optimization
  • no performance shortcut for DML scan planning

Benchmark status:

This PR currently preserves Spark fallback for the protected DML row-index scan shape, so it is not claiming a native performance win. The benchmark harness reports plan shape, DELETE timing, validation timing, active files, files with DVs, DV cardinality, DV payload bytes, final row count, and deleted-row pattern.

Local Spark baseline evidence on 2026-06-01, Apple M2 Pro, Spark 3.5 profile:

org.apache.spark.sql.delta.DeltaDmlRowIndexScanBenchmark 10000 2 1 all spark

Results:

Case Rows Files Plan DELETE ms Validation ms Files with DVs DV cardinality Payload bytes Final rows
create DV 10000 2 1 DML row-index scan 1708.115 340.978 2 1000 2064 9000
update existing DV 10000 2 1 DML row-index scan 1260.559 268.221 2 2000 4064 8000

The guarded Gluten fallback benchmark mode is intentionally present in the harness but still needs native runtime evidence from CI or a local macOS native build, because this machine does not currently have darwin/aarch64/libgluten.dylib.

Issue: #11901

How was this patch tested?

Validation run locally on 2026-06-01:

  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • git diff --check
  • Spark-only benchmark command above, output saved locally at /tmp/delta-dml-row-index-scan-benchmark-spark-all.txt

Local focused runtime execution with Gluten enabled is still blocked by the missing local macOS native Gluten library (darwin/aarch64/libgluten.dylib), so this draft PR is relying on CI or a compatible local native build for the guarded fallback runtime lane.

Follow-up CI compatibility fix on 2026-06-01 at f3135560c7a3d7a0dfedc1f64267c9a529e7a970:

  • Replaced direct use of ParquetFileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME in common gluten-delta code with the stable temporary column name literal so Spark 3.3 and Spark 3.4 compile.
  • git diff --check
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl gluten-delta -am -Pjava-17,spark-3.3,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl gluten-delta -am -Pjava-17,spark-3.4,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl gluten-delta -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl gluten-delta -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

Mohammad Linjawi and others added 4 commits May 31, 2026 12:06
Keep Delta DV DML row-index target scans on Spark unless native DML row-index scanning and native write are explicitly enabled. Preserve the Spark Project/Filter subtree above the fallback scan and add Delta 3.3/4.0 plan-shape coverage for metadata row-index on and off.

Validation: JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Validation: JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Validation: git diff --cached --check
@github-actions github-actions Bot added CORE works for Gluten Core VELOX DATA_LAKE labels Jun 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant