build: Enable Spark SQL tests for Spark 4.1.1 by andygrove · Pull Request #4093 · apache/datafusion-comet

andygrove · 2026-04-26T14:07:27Z

Which issue does this PR close?

Rationale for this change

We currently run the Spark SQL test suites against 3.4.3, 3.5.8, and 4.0.1 in CI. This PR adds Spark 4.1.1 to that matrix so we can catch incompatibilities introduced by that release as early as possible.

What changes are included in this PR?

New dev/diffs/4.1.1.diff generated by applying dev/diffs/4.0.1.diff to the v4.1.1 Spark tag with git apply --reject and resolving rejects manually. Most rejects were import-only differences caused by surrounding-context changes in 4.1.1 (for example, ShuffleExchangeExec vs ShuffleExchangeLike and the additional comet imports). The pom.xml portion of the diff sets spark.version.short=4.1 so the patched Spark build pulls comet-spark-spark4.1_2.13.

Workflow update: {spark-short: '4.1', spark-full: '4.1.1', java: 17, scan-impl: 'auto'} entry added to .github/workflows/spark_sql_test.yml.

Docs update: expanded docs/source/contributor-guide/spark-sql-tests.md with two new sections covering (1) adding a profile and shim tree for a brand-new Spark minor, including the API-shape patterns above, and (2) running the SQL tests locally against a patched Spark clone (NOLINT_ON_COMPILE=true to skip Spark's scalastyle, and the -Dmaven.repo.local=/tmp/spark-m2-repo workaround for partially-cached ~/.m2 entries that confuse Coursier).

How are these changes tested?

Locally: make release PROFILES=-Pspark-4.1 builds cleanly. Against the patched Spark 4.1.1 source with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true:

catalyst/test: 8472 tests passed.
sql/testOnly org.apache.spark.sql.MathFunctionsSuite: 61 tests passed.

The full SQL/Hive matrix will run as part of Spark SQL Tests on this PR; the run itself is the broader test plan.

Add a spark-4.1 Maven profile (root pom.xml and spark/pom.xml) targeting Spark 4.1.1, generate dev/diffs/4.1.1.diff from the 4.0.1 diff, and add a Spark 4.1.1 entry to the spark_sql_test workflow matrix. The spark-4.1 profile reuses the spark-4.0 shim sources for now. Iceberg and Jetty deps mirror the spark-4.0 profile.

…ment new-version workflow Spark 4.1 widened IndexShuffleBlockResolver's third constructor parameter from java.util.Map to java.util.concurrent.ConcurrentMap, so reflectively invoking it with Collections.emptyMap() now fails with "argument type mismatch". Pass a ConcurrentHashMap instead, which is assignable to both. Also expand docs/source/contributor-guide/spark-sql-tests.md with the profile/shim setup and local sbt workflow lessons (NOLINT_ON_COMPILE, fresh -Dmaven.repo.local, common API-shape changes between Spark minors).

Mirror spark-4.0 test shim tree so JVM tests can compile under -Pspark-4.1.

Wire isSpark41Plus into CometPlanStabilitySuite and update regenerate-golden-files.sh to accept --spark-version 4.1. Generated golden files for tpcds v1.4 and v2.7 under approved-plans-{v1_4,v2_7}-spark4_1.

Mirrors the spark-4.0 CometTypeShim helper that apache#4084 added to the scan rule. VariantMetadata.isVariantStruct exists in Spark 4.1.1 (in PushVariantIntoScan.scala) so the implementation is identical to spark-4.0.

Comet's Maven phase resolves the transitive dependency graph and pulls POMs for artifacts whose JARs it never needs. When sbt then resolves Spark's deps, sbt-coursier sees the POM in mavenLocal, declares the artifact found locally, and fails on the missing JAR without falling back to Maven Central. Delete pom-only entries (where packaging is jar/bundle) between Comet's install and sbt's update so sbt re-fetches the full artifact remotely. Hit on Spark 4.1 jobs because the cached scala-xml_2.13:2.1.0 pom blocks Spark's tags subproject from resolving the JAR.

Adds a Spark 4.1, JDK 17 entry to lint-java and linux-test in pr_build_linux.yml and a Spark 4.1, Scala 2.13 entry to pr_build_macos.yml so PR builds exercise the new spark-4.1 profile alongside 3.4/3.5/4.0.

semanticdb-scalac_2.13.17:4.13.6 is not published; the most recent semanticdb build for the 2.13 branch is for 2.13.16. Scala 2.13 is binary-stable across patch versions, so Comet compiled with 2.13.16 still links against Spark 4.1.1's 2.13.17 runtime.

Reverts the 2.13.16 pin: Spark 4.1.1 is compiled against 2.13.17 and emits calls into MurmurHash3 stdlib methods that don't exist in 2.13.16, so any TreeNode hashCode at runtime throws NoSuchMethodError. semanticdb-scalac_2.13.17 isn't published yet, so drop the spark-4.1 entry from the lint-java matrix (which runs -Psemanticdb scalafix); the linux-test matrix entry stays. Verified locally: ./mvnw test -Pspark-4.1 -Dsuites=org.apache.comet.CometFuzzMathSuite passes 30/30.

The scala-2.13 profile sets scala.version=2.13.16, overriding the spark-4.1 profile's 2.13.17 pin. Activating both produced a 2.13.16 runtime that hits NoSuchMethodError on Spark 4.1.1's TreeNode.hashCode (calls into MurmurHash3 stdlib methods added in 2.13.17). The spark-4.1 profile already sets scala.binary.version=2.13, so -Pscala-2.13 is redundant.

1. CometNativeWriteExec: Spark 4.1 made newTaskTempFile(taskContext, dir, ext: String) throw mustOverrideOneMethodError by default; the FileNameSpec overload is now the supported one. Switch the call to use FileNameSpec("", ""), which exists in 3.4 onward, so it works across all profiles. 2. CometExpressionSuite remainder function test: Spark 4.1 introduced REMAINDER_BY_ZERO for the % operator instead of reusing DIVIDE_BY_ZERO. Branch the expected error message on isSpark41Plus.

manuzhang · 2026-04-27T14:34:18Z

+import org.apache.spark.sql.types.{DataType, StringType, StructType}
+
+trait CometTypeShim {
+  // A `StringType` carries collation metadata in Spark 4.0. Only non-default (non-UTF8_BINARY)


If this is common for Spark 4.0 and Spark 4.1, we can move it from spark-4.0 to spark-4.x.

It looks most shims in Spark 4.1 are identical to Spark 4.0 except for CometExprShim. I added a CometSumShim for Spark 4.1 in #2829 and moved other shims from spark-4.0 to spark-4.x.

This was addressed in another PR that is now merged, and this PR was rebased

manuzhang · 2026-04-27T15:27:37Z

@andygrove Has this been superseded by #4097 and #4106?

# Conflicts: # .github/workflows/spark_sql_test.yml

andygrove · 2026-04-28T05:12:54Z

@andygrove Has this been superseded by #4097 and #4106?

This pr has been rebased and now just enables the spark sql tests

Patches dev/diffs/4.1.1.diff to handle the four failing CI jobs: - CollationSuite 'hash agg is not used for non binary collations': recognise CometHashAggregateExec in the binary-collation plan-shape assertion (the existing patch already handled the join-shape cases). - StreamRealTimeModeAllowlistSuite: mix in IgnoreCometSuite. The suite asserts on hard-coded operator-name strings that Comet replaces, and the allowlist mechanism is orthogonal to Comet. - Disable Comet for hll.sql, except-all.sql, intersect-all.sql, having-and-order-by-recursive-type-name-resolution.sql via --SET spark.comet.enabled = false, and tag SPARK-53968 in SQLViewSuite with IgnoreComet. Each carries a TODO link to the follow-up issue tracking the underlying Comet bug: - issues/4121: native scan rejects invalid UTF-8 in STRING column - issues/4122: EXCEPT ALL / INTERSECT ALL with GROUP BY incorrect - issues/4123: native sort lacks row-format support for Struct(Map) - issues/4124: decimal arithmetic 10x discrepancy through view CTE

Skip 9 tests failing in Spark 4.1.1 SQL test CI: - ParquetIOSuite: 5 "missing all struct fields" tests (apache#4136) - CachedBatchSerializerNoUnwrapSuite: plan replacement (apache#4137) - JoinSuite: SPARK-49386 SMJ spill by size threshold - SQLQueryTestSuite: intersect-all.sql trailing whitespace in golden files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Spark 4.1.1 SQL tests are being killed in CI, likely from the JVM hitting the 16 GB ubuntu-24.04 runner ceiling. Raise SBT to 6 GB via -mem 6144 and set _JAVA_OPTIONS so forked test JVMs get 4 GB heap, 1 GB metaspace, and 512 MB code cache. Apply to all Spark versions since the headroom is harmless for 3.4/3.5/4.0.

_JAVA_OPTIONS applies to every JVM on the runner, including the SBT launcher started by build/sbt. With -mem 6144, the launcher already sets -Xms6g -Xmx6g; _JAVA_OPTIONS=-Xmx4g then overrides Xmx, leaving Xms (6g) > Xmx (4g) and the JVM refuses to start. Keep -mem 6144 for the SBT JVM and drop _JAVA_OPTIONS.

Standard ubuntu-24.04 GitHub-hosted runners (4 vCPU / 16 GB) are running out of memory on the Spark 4.1.1 SQL test matrix. Switch the spark-sql-test job to the same runs-on.com self-hosted pool already used by pr_build_linux.yml: family=m8a+m7a+c8a, cpu=16, which yields 32-64 GB RAM depending on family selection. Forks without access to the Apache pool fall back to ubuntu-latest. Drop the now-unused matrix.os axis (only ubuntu-24.04 was listed).

…L files Two follow-ups from the first runs-on.com run on spark-4.1.1: 1. HiveSparkSubmitSuite (SPARK-8020) failed on the runs-on.com ubuntu24-full-x64 image because spark-submit's Ivy config references a 'local-m2-cache' resolver that the image does not provide. Pin every sql_hive-* matrix entry back to the standard ubuntu-24.04 runner where the resolver is preconfigured. Catalyst and sql_core entries continue to use the runs-on.com 16-cpu pool. 2. Three SQLQueryTestSuite cases newly visible after the OOMs were resolved are added to ignoreList in the 4.1.1 spark diff: - replacing-missing-expression-with-alias.sql and in-set-operations.sql have ORDER BY ties whose row order is non-deterministic under high parallelism (revealed by the faster runner, not a Comet bug). - thetasketch.sql fails with a CometNativeException because theta_sketch binary output is read as UTF-8 over collated strings; skip pending Comet support or fallback for theta-sketch on collation.

After resolving the OOMs and pinning Hive jobs, two more failure jobs surfaced on the runs-on.com run, all rooted in the same plan-shape divergence: tests collect on concrete Spark types (UnionExec, ShuffleExchangeExec, ReusedExchangeExec) that Comet replaces with its own operators, so the collected list is empty and head/.size == 1 fails. Two follow-up edits to the 4.1.1 diff: 1. StreamingQuerySuite: extend the existing ShuffleExchangeExec -> ShuffleExchangeLike substitution to the SPARK-53942 cases (stateless partition change and stateful partitions retained from checkpoint). CometColumnarExchange satisfies ShuffleExchangeLike, so the numPartitions assertions still validate the expected behavior. 2. DataFrameSetOperationsSuite: tag the three SPARK-52921 union partitioning tests with IgnoreComet. UnionExec and ReusedExchangeExec have no equivalent broad trait, so skipping is the cleanest path until Comet exposes compatible plan markers.

The three SPARK-52921 IgnoreComet tags added in the previous commit described the symptom in the reason argument; convention is to point to a tracking issue instead so the reason stays useful as Comet evolves. Use the umbrella issue apache#4098.

To iterate faster on the spark-4.1.1 work, narrow the CI surface to the Spark SQL Tests workflow with only the 4.1.1 matrix entry, and gate the heavier workflows behind a never-matching paths filter so they no longer trigger on PRs from this branch: - spark_sql_test.yml: 3.4 / 3.5 / 4.0 matrix entries commented out - pr_build_linux.yml, pr_build_macos.yml, iceberg_spark_test.yml, miri.yml, codeql.yml: pull_request trigger restricted to 'DISABLED_FOR_spark-4.1.1_iteration' Each change carries a TEMP comment explaining how to revert before this branch merges back to main.

After the previous commit replaced every ShuffleExchangeExec match with ShuffleExchangeLike in StreamingQuerySuite, the import became unused and Spark's fatal-warnings policy turned it into a build break: StreamingQuerySuite.scala:46:100: Unused import [fatal] Remove ShuffleExchangeExec from the exchange import list; the other imports (REQUIRED_BY_STATEFUL_OPERATOR, ReusedExchangeExec) are still used elsewhere in the file.

This reverts commit fff9158.

The 'bounded memory usage calculation - with colFamiliesEnabled=true (with changelog checkpointing)' case fails the assertion at line 420 (getNumRocksDBInstances(false) == 0) when run on the runs-on.com 16-cpu pool. The failure is upstream test isolation: the suite shares RocksDBMemoryManager state with other tests in the JVM and only the first of four parameterized variants flakes, depending on what was left registered by earlier tests. Comet does not touch state stores, so this is not a Comet-specific behavior. Mix in IgnoreCometSuite on RocksDBStateStoreIntegrationSuite to skip the whole suite while Comet is enabled. Track under the umbrella issue apache#4098.

# Conflicts: # .github/workflows/spark_sql_test.yml

andygrove · 2026-04-30T22:11:18Z

This finally passes CI 😅

We're going to wait until #4112 is merged before merging this one.

comphead · 2026-05-01T18:50:42Z

          - {spark-short: '3.5', spark-full: '3.5.8', java: 11, scan-impl: 'auto'}
          - {spark-short: '4.0', spark-full: '4.0.2', java: 17, scan-impl: 'auto'}
          - {spark-short: '4.0', spark-full: '4.0.2', java: 21, scan-impl: 'auto'}
+          - {spark-short: '4.1', spark-full: '4.1.1', java: 17, scan-impl: 'auto'}


should we also add 4.2?

Eventually, yes. This PR is for 4.1.1 but we can go through the same process for 4.2 in a separate PR.

comphead · 2026-05-01T18:52:30Z

+
+-  test("Explain formatted output for scan operator for datasource V2") {
+  test("Explain formatted output for scan operator for datasource V2",
+      IgnoreComet("Comet explain output is different")) {


github issue?

I'm not at desk but this likely a test we already ignored for other Spark versions and not new to 4.1.1

The issues ignored for 4.1.1 are linked to issues

Same ignore for 3.4.3

% grep -B 1 "IgnoreComet.*(\"" dev/diffs/3.4.3.diff | grep -B 1 "explain output is different" + test("Explain formatted output for scan operator for datasource V2", + IgnoreComet("Comet explain output is different")) { -- + test("explain with table on DSv1 data source", + IgnoreComet("Comet explain output is different")) {

comphead · 2026-05-01T18:56:04Z

+   test("Runtime bloom filter join: add bloom filter if dpp filter exists on " +
+-    "a different column") {
+    "a different column",
+    IgnoreComet("TODO: Support SubqueryBroadcastExec in Comet: #242")) {


oh, this is pretty old and looks useful

also ignored since 3.4.3, but we can probably enable now that we have DPP

% grep -B 1 "IgnoreComet.*(\"" dev/diffs/3.4.3.diff | grep -B 1 "242" + "on the same column", + IgnoreComet("TODO: Support SubqueryBroadcastExec in Comet: #242")) { -- + "a different column", + IgnoreComet("TODO: Support SubqueryBroadcastExec in Comet: #242")) {

PR to enable these in all diffs: #4178

comphead · 2026-05-01T18:56:40Z

+   test("INCONSISTENT_BEHAVIOR_CROSS_VERSION: " +
+-    "compatibility with Spark 2.4/3.2 in reading/writing dates") {
+    "compatibility with Spark 2.4/3.2 in reading/writing dates",
+    IgnoreComet("Comet doesn't completely support datetime rebase mode yet")) {


ignored since 3.4.3

% grep -B 1 "IgnoreComet.*(\"" dev/diffs/3.4.3.diff | grep -B 1 "datetime rebase" + "compatibility with Spark 2.4/3.2 in reading/writing dates", + IgnoreComet("Comet doesn't completely support datetime rebase mode yet")) {

comphead · 2026-05-01T18:59:41Z

+
+-  test("SPARK-30201 HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT") {
+  test("SPARK-30201 HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT",
+      IgnoreComet("Comet does not support reading non UTF-8 strings")) {


is it true? 🤔

Yes, from official Rust docs:

The String type, which is provided by Rust’s standard library rather than coded into the core language, is a growable, mutable, owned, UTF-8 encoded string type

comphead

Thanks @andygrove

andygrove · 2026-05-01T19:25:08Z

Thanks @comphead we should probably audit some of these older ignores

andygrove added 4 commits April 26, 2026 08:07

save current progress

7a0dd7e

docs: reflow spark-sql-tests.md with prettier

b17726e

andygrove mentioned this pull request Apr 26, 2026

build: Add spark-4.1 profile and shims #2829

Closed

andygrove added 10 commits April 26, 2026 09:03

build: add spark-4.1 test shim sources

626c966

Mirror spark-4.0 test shim tree so JVM tests can compile under -Pspark-4.1.

test: add spark-4.1 plan-stability golden files

8521e41

Wire isSpark41Plus into CometPlanStabilitySuite and update regenerate-golden-files.sh to accept --spark-version 4.1. Generated golden files for tpcds v1.4 and v2.7 under approved-plans-{v1_4,v2_7}-spark4_1.

Merge remote-tracking branch 'apache/main' into spark-4.1.1

7205d4d

fix: add isVariantStruct shim for spark-4.1 profile

330e400

Mirrors the spark-4.0 CometTypeShim helper that apache#4084 added to the scan rule. VariantMetadata.isVariantStruct exists in Spark 4.1.1 (in PushVariantIntoScan.scala) so the implementation is identical to spark-4.0.

ci: enable spark-4.1 PR builds in Linux and macOS matrices

5daf943

Adds a Spark 4.1, JDK 17 entry to lint-java and linux-test in pr_build_linux.yml and a Spark 4.1, Scala 2.13 entry to pr_build_macos.yml so PR builds exercise the new spark-4.1 profile alongside 3.4/3.5/4.0.

andygrove changed the title ~~build: add spark-4.1 profile and enable Spark 4.1.1 SQL tests~~ build: add spark-4.1 profile and enable Spark 4.1.1 SQL tests [WIP] Apr 26, 2026

andygrove mentioned this pull request Apr 26, 2026

Tracking: remaining Spark 4.1 CI failures #4098

Open

4 tasks

manuzhang reviewed Apr 27, 2026

View reviewed changes

andygrove changed the title ~~build: add spark-4.1 profile and enable Spark 4.1.1 SQL tests [WIP]~~ build: Enable Spark SQL test for Spark 4.1.1 Apr 28, 2026

andygrove added 4 commits April 27, 2026 21:24

Merge remote-tracking branch 'apache/main' into spark-4.1.1

9a154b7

# Conflicts: # .github/workflows/spark_sql_test.yml

test: drop spark-4.1 plan-stability golden files

84379ec

test: revert CometExpressionSuite spark-4.1 changes

dce8dfa

ci: revert spark-4.1 entries from pr_build workflows

cf81dea

andygrove changed the title ~~build: Enable Spark SQL test for Spark 4.1.1~~ build: Enable Spark SQL tests for Spark 4.1.1 Apr 28, 2026

andygrove marked this pull request as ready for review April 28, 2026 03:52

andygrove mentioned this pull request Apr 28, 2026

SPARK-53968 SQLViewSuite: decimal arithmetic returns ~10x smaller values through view CTE on Spark 4.1.1 #4124

Closed

andygrove marked this pull request as draft April 28, 2026 18:12

andygrove and others added 12 commits April 28, 2026 16:33

Revert "ci: TEMP disable non-essential workflows on spark-4.1.1 branch"

e91c669

This reverts commit fff9158.

Merge remote-tracking branch 'apache/main' into spark-4.1.1

9f4ec9c

# Conflicts: # .github/workflows/spark_sql_test.yml

andygrove marked this pull request as ready for review April 30, 2026 12:12

andygrove marked this pull request as draft April 30, 2026 16:47

Merge branch 'main' into spark-4.1.1

ef1c480

andygrove marked this pull request as ready for review May 1, 2026 13:30

andygrove requested a review from comphead May 1, 2026 13:38

comphead reviewed May 1, 2026

View reviewed changes

comphead approved these changes May 1, 2026

View reviewed changes

andygrove merged commit 425f9c9 into apache:main May 1, 2026
194 of 195 checks passed

andygrove deleted the spark-4.1.1 branch May 1, 2026 19:25

Conversation

andygrove commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manuzhang commented Apr 27, 2026

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

andygrove commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Apr 26, 2026 •

edited

Loading