Skip to content

build: Enable Spark SQL tests for Spark 4.1.1#4093

Draft
andygrove wants to merge 24 commits intoapache:mainfrom
andygrove:spark-4.1.1
Draft

build: Enable Spark SQL tests for Spark 4.1.1#4093
andygrove wants to merge 24 commits intoapache:mainfrom
andygrove:spark-4.1.1

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 26, 2026

Which issue does this PR close?

Rationale for this change

We currently run the Spark SQL test suites against 3.4.3, 3.5.8, and 4.0.1 in CI. This PR adds Spark 4.1.1 to that matrix so we can catch incompatibilities introduced by that release as early as possible.

What changes are included in this PR?

New dev/diffs/4.1.1.diff generated by applying dev/diffs/4.0.1.diff to the v4.1.1 Spark tag with git apply --reject and resolving rejects manually. Most rejects were import-only differences caused by surrounding-context changes in 4.1.1 (for example, ShuffleExchangeExec vs ShuffleExchangeLike and the additional comet imports). The pom.xml portion of the diff sets spark.version.short=4.1 so the patched Spark build pulls comet-spark-spark4.1_2.13.

Workflow update: {spark-short: '4.1', spark-full: '4.1.1', java: 17, scan-impl: 'auto'} entry added to .github/workflows/spark_sql_test.yml.

Docs update: expanded docs/source/contributor-guide/spark-sql-tests.md with two new sections covering (1) adding a profile and shim tree for a brand-new Spark minor, including the API-shape patterns above, and (2) running the SQL tests locally against a patched Spark clone (NOLINT_ON_COMPILE=true to skip Spark's scalastyle, and the -Dmaven.repo.local=/tmp/spark-m2-repo workaround for partially-cached ~/.m2 entries that confuse Coursier).

How are these changes tested?

Locally: make release PROFILES=-Pspark-4.1 builds cleanly. Against the patched Spark 4.1.1 source with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true:

  • catalyst/test: 8472 tests passed.
  • sql/testOnly org.apache.spark.sql.MathFunctionsSuite: 61 tests passed.

The full SQL/Hive matrix will run as part of Spark SQL Tests on this PR; the run itself is the broader test plan.

Add a spark-4.1 Maven profile (root pom.xml and spark/pom.xml) targeting
Spark 4.1.1, generate dev/diffs/4.1.1.diff from the 4.0.1 diff, and add a
Spark 4.1.1 entry to the spark_sql_test workflow matrix.

The spark-4.1 profile reuses the spark-4.0 shim sources for now. Iceberg
and Jetty deps mirror the spark-4.0 profile.
…ment new-version workflow

Spark 4.1 widened IndexShuffleBlockResolver's third constructor parameter
from java.util.Map to java.util.concurrent.ConcurrentMap, so reflectively
invoking it with Collections.emptyMap() now fails with "argument type
mismatch". Pass a ConcurrentHashMap instead, which is assignable to both.

Also expand docs/source/contributor-guide/spark-sql-tests.md with the
profile/shim setup and local sbt workflow lessons (NOLINT_ON_COMPILE,
fresh -Dmaven.repo.local, common API-shape changes between Spark
minors).
Mirror spark-4.0 test shim tree so JVM tests can compile under -Pspark-4.1.
Wire isSpark41Plus into CometPlanStabilitySuite and update regenerate-golden-files.sh to accept --spark-version 4.1. Generated golden files for tpcds v1.4 and v2.7 under approved-plans-{v1_4,v2_7}-spark4_1.
Mirrors the spark-4.0 CometTypeShim helper that apache#4084 added to the scan rule. VariantMetadata.isVariantStruct exists in Spark 4.1.1 (in PushVariantIntoScan.scala) so the implementation is identical to spark-4.0.
Comet's Maven phase resolves the transitive dependency graph and pulls POMs for artifacts whose JARs it never needs. When sbt then resolves Spark's deps, sbt-coursier sees the POM in mavenLocal, declares the artifact found locally, and fails on the missing JAR without falling back to Maven Central. Delete pom-only entries (where packaging is jar/bundle) between Comet's install and sbt's update so sbt re-fetches the full artifact remotely. Hit on Spark 4.1 jobs because the cached scala-xml_2.13:2.1.0 pom blocks Spark's tags subproject from resolving the JAR.
Adds a Spark 4.1, JDK 17 entry to lint-java and linux-test in pr_build_linux.yml and a Spark 4.1, Scala 2.13 entry to pr_build_macos.yml so PR builds exercise the new spark-4.1 profile alongside 3.4/3.5/4.0.
semanticdb-scalac_2.13.17:4.13.6 is not published; the most recent semanticdb build for the 2.13 branch is for 2.13.16. Scala 2.13 is binary-stable across patch versions, so Comet compiled with 2.13.16 still links against Spark 4.1.1's 2.13.17 runtime.
Reverts the 2.13.16 pin: Spark 4.1.1 is compiled against 2.13.17 and emits calls into MurmurHash3 stdlib methods that don't exist in 2.13.16, so any TreeNode hashCode at runtime throws NoSuchMethodError. semanticdb-scalac_2.13.17 isn't published yet, so drop the spark-4.1 entry from the lint-java matrix (which runs -Psemanticdb scalafix); the linux-test matrix entry stays. Verified locally: ./mvnw test -Pspark-4.1 -Dsuites=org.apache.comet.CometFuzzMathSuite passes 30/30.
The scala-2.13 profile sets scala.version=2.13.16, overriding the spark-4.1 profile's 2.13.17 pin. Activating both produced a 2.13.16 runtime that hits NoSuchMethodError on Spark 4.1.1's TreeNode.hashCode (calls into MurmurHash3 stdlib methods added in 2.13.17). The spark-4.1 profile already sets scala.binary.version=2.13, so -Pscala-2.13 is redundant.
1. CometNativeWriteExec: Spark 4.1 made newTaskTempFile(taskContext, dir, ext: String) throw mustOverrideOneMethodError by default; the FileNameSpec overload is now the supported one. Switch the call to use FileNameSpec("", ""), which exists in 3.4 onward, so it works across all profiles.

2. CometExpressionSuite remainder function test: Spark 4.1 introduced REMAINDER_BY_ZERO for the % operator instead of reusing DIVIDE_BY_ZERO. Branch the expected error message on isSpark41Plus.
@andygrove andygrove changed the title build: add spark-4.1 profile and enable Spark 4.1.1 SQL tests build: add spark-4.1 profile and enable Spark 4.1.1 SQL tests [WIP] Apr 26, 2026
import org.apache.spark.sql.types.{DataType, StringType, StructType}

trait CometTypeShim {
// A `StringType` carries collation metadata in Spark 4.0. Only non-default (non-UTF8_BINARY)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is common for Spark 4.0 and Spark 4.1, we can move it from spark-4.0 to spark-4.x.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks most shims in Spark 4.1 are identical to Spark 4.0 except for CometExprShim. I added a CometSumShim for Spark 4.1 in #2829 and moved other shims from spark-4.0 to spark-4.x.

@manuzhang
Copy link
Copy Markdown
Member

@andygrove Has this been superseded by #4097 and #4106?

@andygrove andygrove changed the title build: add spark-4.1 profile and enable Spark 4.1.1 SQL tests [WIP] build: Enable Spark SQL test for Spark 4.1.1 Apr 28, 2026
@andygrove andygrove changed the title build: Enable Spark SQL test for Spark 4.1.1 build: Enable Spark SQL tests for Spark 4.1.1 Apr 28, 2026
@andygrove andygrove marked this pull request as ready for review April 28, 2026 03:52
@andygrove
Copy link
Copy Markdown
Member Author

@andygrove Has this been superseded by #4097 and #4106?

This pr has been rebased and now just enables the spark sql tests

Patches dev/diffs/4.1.1.diff to handle the four failing CI jobs:

- CollationSuite 'hash agg is not used for non binary collations':
  recognise CometHashAggregateExec in the binary-collation plan-shape
  assertion (the existing patch already handled the join-shape cases).

- StreamRealTimeModeAllowlistSuite: mix in IgnoreCometSuite. The
  suite asserts on hard-coded operator-name strings that Comet
  replaces, and the allowlist mechanism is orthogonal to Comet.

- Disable Comet for hll.sql, except-all.sql, intersect-all.sql,
  having-and-order-by-recursive-type-name-resolution.sql via
  --SET spark.comet.enabled = false, and tag SPARK-53968 in
  SQLViewSuite with IgnoreComet. Each carries a TODO link to the
  follow-up issue tracking the underlying Comet bug:

  - issues/4121: native scan rejects invalid UTF-8 in STRING column
  - issues/4122: EXCEPT ALL / INTERSECT ALL with GROUP BY incorrect
  - issues/4123: native sort lacks row-format support for Struct(Map)
  - issues/4124: decimal arithmetic 10x discrepancy through view CTE
@andygrove andygrove marked this pull request as draft April 28, 2026 18:12
andygrove and others added 5 commits April 28, 2026 16:33
Skip 9 tests failing in Spark 4.1.1 SQL test CI:
- ParquetIOSuite: 5 "missing all struct fields" tests (apache#4136)
- CachedBatchSerializerNoUnwrapSuite: plan replacement (apache#4137)
- JoinSuite: SPARK-49386 SMJ spill by size threshold
- SQLQueryTestSuite: intersect-all.sql trailing whitespace in golden files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spark 4.1.1 SQL tests are being killed in CI, likely from the JVM
hitting the 16 GB ubuntu-24.04 runner ceiling. Raise SBT to 6 GB via
-mem 6144 and set _JAVA_OPTIONS so forked test JVMs get 4 GB heap,
1 GB metaspace, and 512 MB code cache. Apply to all Spark versions
since the headroom is harmless for 3.4/3.5/4.0.
_JAVA_OPTIONS applies to every JVM on the runner, including the SBT
launcher started by build/sbt. With -mem 6144, the launcher already
sets -Xms6g -Xmx6g; _JAVA_OPTIONS=-Xmx4g then overrides Xmx, leaving
Xms (6g) > Xmx (4g) and the JVM refuses to start. Keep -mem 6144 for
the SBT JVM and drop _JAVA_OPTIONS.
Standard ubuntu-24.04 GitHub-hosted runners (4 vCPU / 16 GB) are
running out of memory on the Spark 4.1.1 SQL test matrix. Switch
the spark-sql-test job to the same runs-on.com self-hosted pool
already used by pr_build_linux.yml: family=m8a+m7a+c8a, cpu=16,
which yields 32-64 GB RAM depending on family selection. Forks
without access to the Apache pool fall back to ubuntu-latest.

Drop the now-unused matrix.os axis (only ubuntu-24.04 was listed).
…L files

Two follow-ups from the first runs-on.com run on spark-4.1.1:

1. HiveSparkSubmitSuite (SPARK-8020) failed on the runs-on.com
   ubuntu24-full-x64 image because spark-submit's Ivy config references a
   'local-m2-cache' resolver that the image does not provide. Pin every
   sql_hive-* matrix entry back to the standard ubuntu-24.04 runner where
   the resolver is preconfigured. Catalyst and sql_core entries continue
   to use the runs-on.com 16-cpu pool.

2. Three SQLQueryTestSuite cases newly visible after the OOMs were
   resolved are added to ignoreList in the 4.1.1 spark diff:
   - replacing-missing-expression-with-alias.sql and in-set-operations.sql
     have ORDER BY ties whose row order is non-deterministic under high
     parallelism (revealed by the faster runner, not a Comet bug).
   - thetasketch.sql fails with a CometNativeException because
     theta_sketch binary output is read as UTF-8 over collated strings;
     skip pending Comet support or fallback for theta-sketch on collation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants