Skip to content

[VL][Delta] Add DELETE DV diagnostics benchmark#12217

Draft
malinjawi wants to merge 16 commits into
apache:mainfrom
malinjawi:split/delta-dv-delete-diagnostics-benchmark
Draft

[VL][Delta] Add DELETE DV diagnostics benchmark#12217
malinjawi wants to merge 16 commits into
apache:mainfrom
malinjawi:split/delta-dv-delete-diagnostics-benchmark

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

What changes

This is the next stacked Delta DV MoR slice after #12216. It adds a focused benchmark harness for persistent deletion-vector DELETE so we can measure the current correctness path before enabling native bitmap construction or target-scan shortcuts.

Stack order:

  1. [VL][Delta] Add DV scan info extraction utility #12197 - DV scan info extraction utility
  2. [VL][Delta] Add JVM Delta DV scan handoff #12198 - JVM Delta DV scan handoff
  3. [VL][Delta] Guard DV DML row-index scans #12215 - DML row-index scan safety
  4. [VL][Delta] Add persistent DV DELETE correctness path #12216 - persistent DV DELETE correctness path
  5. This PR - focused DELETE DV diagnostics benchmark

This PR should remain draft until the earlier correctness PR has native CI confidence and we attach runtime benchmark output from CI or a local native build.

Scope

  • Adds DeltaDeleteDeletionVectorBenchmark for Delta 3.3 and Delta 4.0.
  • Measures Spark DELETE DV baseline against Gluten DELETE DV with native write and DML row-index scan enabled.
  • Covers create-DV and update-existing-DV modes.
  • Validates correctness during the benchmark by checking active files, files with DVs, DV cardinality, and payload bytes.

Intentionally deferred

  • Native bitmap aggregation as the default DELETE bitmap construction path.
  • Plain Parquet target-scan optimization.
  • Production timing hooks.
  • Checksum or stats shortcuts.
  • Any CI performance assertion on noisy speedup numbers.

Validation

Local validation after rebasing onto origin/split/delta-dv-delete-correctness at bbb971f71cd4fc690258c54c59333b392f90a8aa:

  • git diff --check origin/split/delta-dv-delete-correctness...HEAD
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • env JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH ./build/mvn -q test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Runtime benchmark execution is still pending because this local Mac checkout cannot start the Velox backend without darwin/aarch64/libgluten.dylib. The benchmark class is intentionally added as a draft/diagnostic harness so native CI or a compatible local native build can provide the measured output before review-ready status.

Mohammad Linjawi and others added 6 commits May 31, 2026 12:06
Keep Delta DV DML row-index target scans on Spark unless native DML row-index scanning and native write are explicitly enabled. Preserve the Spark Project/Filter subtree above the fallback scan and add Delta 3.3/4.0 plan-shape coverage for metadata row-index on and off.

Validation: JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Validation: JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Validation: git diff --cached --check
@github-actions github-actions Bot added CORE works for Gluten Core VELOX DATA_LAKE labels Jun 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-delete-diagnostics-benchmark branch from 96ac465 to 6c1aff8 Compare June 1, 2026 14:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-delete-diagnostics-benchmark branch from 6c1aff8 to 8c10d31 Compare June 1, 2026 14:54
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-delete-diagnostics-benchmark branch from d399df9 to 0a1b008 Compare June 1, 2026 16:22
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-delete-diagnostics-benchmark branch from 0a1b008 to 016838d Compare June 1, 2026 17:20
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

malinjawi added 9 commits June 1, 2026 20:53
Route Delta DELETE commands with persistent deletion vectors through the Gluten-specific command while leaving metadata-only, full-table, and non-DV cases on the existing Delta path.

Add Delta 3.3 and Delta 4.0 coverage for persistent DV DELETE routing and repeated deletion-vector updates.

Validation: git diff --cached --check; mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests; mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests.
@malinjawi malinjawi force-pushed the split/delta-dv-delete-diagnostics-benchmark branch from 016838d to 104a356 Compare June 1, 2026 17:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant