Skip to content

Add Apache Spark 4.0 support (#787)#941

Open
imarios wants to merge 7 commits into
typelevel:masterfrom
imarios:spark-4-support
Open

Add Apache Spark 4.0 support (#787)#941
imarios wants to merge 7 commits into
typelevel:masterfrom
imarios:spark-4-support

Conversation

@imarios
Copy link
Copy Markdown
Contributor

@imarios imarios commented May 20, 2026

Summary

Adds an in-tree frameless-*-spark40 module set targeting Spark 4.0.2, addressing #787. Spark 4 is built for Scala 2.13 only (2.12 was dropped) and requires JDK 17+.

The approach keeps the existing per-Spark-version source-overlay pattern (spark-3 / spark-3.4+) and adds a spark-4 overlay — no external shim dependency. All version-divergent Catalyst access is isolated behind FramelessInternals.

Key adaptations for Spark 4:

  • Column no longer wraps a Catalyst Expression. Bridge through classic.ExpressionUtils.column and an eager ColumnNodeToExpressionConverter. (Spark's ExpressionUtils.expression returns a lazy ColumnNodeExpression that is Unevaluable and exposes no children, which broke self-join disambiguation and codegen.)
  • Dataset/SparkSession split into abstract API + classic impl. Internal helpers downcast to classic for logicalPlan / sessionState / sqlContext.
  • ExpressionEncoder now takes a leading AgnosticEncoder (SPARK-49025). We supply a metadata-only JavaBeanEncoder stand-in carrying the correct ClassTag — the encoder field is only read for clsTag and the Option-wrapping check; serializer/deserializer/schema still come from frameless's own expressions.
  • AnalysisException is now errorClass-based; MapGroups gets a spark-4 variant.
  • joinCross re-encodes its result via TypedExpressionEncoder, consistent with the other joins.
  • Hide the new catalyst.expressions.With from TypedColumn's wildcard import.

Test harness (no-ops on Spark 3.x): disable ANSI mode (Spark 4 default) so property generators keep wrap-around/null semantics, and strip field metadata in SchemaTests.

CI: adds a JDK 17 leg and pins root-spark40 to Scala 2.13 / JDK 17; the 3.x roots stay on JDK 8.

Test plan

  • dataset-spark40 compiles against Spark 4.0.2 (Scala 2.13, JDK 17)
  • dataset-spark40 test suite: 414/414 passing
  • cats-spark40 / refined-spark40 / ml-spark40 test suites passing
  • No regression: existing Spark 3.3/3.4/3.5 modules still compile and pass on Scala 2.12/2.13
  • End-to-end verification on a 2-worker standalone Spark 4.0.2 cluster (Docker): groupBy/agg, self-join (colLeft/colRight), joinWith tuple decode, and executor-side closures all produce correct results across separate executor JVMs — confirming cross-node serialization
  • Spark Connect support (4+ AgnosticEncoder support & Spark Connect #701) is intentionally out of scope

Co-authored with Claude Code.

imarios and others added 6 commits May 19, 2026 22:33
Adds an in-tree `frameless-*-spark40` module set targeting Spark 4.0.2,
cross-built for Scala 2.13 only (Spark 4 dropped 2.12) and requiring JDK 17.
No external shim dependency: version-divergent Catalyst access is isolated
behind FramelessInternals in a `src/main/spark-4` source overlay, mirroring
the existing spark-3 / spark-3.4+ pattern.

Key adaptations for Spark 4:
- Column no longer wraps a Catalyst Expression; bridge through
  classic.ExpressionUtils.column and an eager ColumnNodeToExpressionConverter
  (the lazy ColumnNodeExpression is Unevaluable and hides children, which broke
  self-join disambiguation and codegen).
- Dataset/SparkSession split into abstract API + classic impl; internal
  helpers downcast to classic for logicalPlan/sessionState/sqlContext.
- ExpressionEncoder now takes a leading AgnosticEncoder (SPARK-49025); supply a
  metadata-only JavaBeanEncoder stand-in carrying the right ClassTag.
- AnalysisException is errorClass-based; MapGroups gets a spark-4 variant.
- joinCross re-encodes its result via TypedExpressionEncoder, consistent with
  the other joins.
- Hide the new catalyst expressions.With from TypedColumn's wildcard import.

Test harness: disable ANSI mode (Spark 4 default) so the property generators
keep their wrap-around/null semantics, and strip field metadata in
SchemaTests. All changes are no-ops on Spark 3.x.

CI: add a JDK 17 leg and pin root-spark40 to Scala 2.13 / JDK 17.

dataset-spark40 passes 414/414 tests; verified end-to-end on a 2-worker
standalone Spark 4.0.2 cluster (groupBy/agg, self-join, joinWith, executor
closures) to confirm cross-node serialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adding a JDK 17 CI leg for Spark 4 made sbt-typelevel run the Generate Site
job on JDK 17 (it picks the last configured Java). mdoc executes Spark code,
which needs the module --add-opens flags on JDK 17. Fork the docs run, pass the
flags through (extracted into sparkJava17Options, shared with the test config),
and pin the forked run's working directory to the repo root so docs keep
finding their relative data files (docs/iris.data).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Spark 4 port reworked FramelessInternals (internal version-compat plumbing,
not intended public API): `column` is now the Expression->Column bridge and
`mkDataset` derives the session from the source Dataset instead of taking a
SQLContext. Both are binary-incompatible signature changes flagged by MiMa
against the 3.x baselines (0.14.0/0.14.1), so exclude them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scala 2.12's scaladoc fails (fatally) on [[Expression]] / [[Column]] /
[[ExpressionEncoder]] doc links in FramelessInternals because those Spark types
aren't resolvable in the doc scope. Use backticks (code spans) instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing self-join tests only compare row counts. This collects and
verifies the decoded (T, U) tuples through the colLeft/colRight disambiguation
path - a regression guard for the Spark 4 ColumnNode rework, which broke that
path (only count-level coverage would have missed a subtly wrong decode).
Passes unchanged on Spark 3.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread dataset/src/test/scala/frameless/SelfJoinTests.scala
Comment thread dataset/src/test/scala/frameless/forward/SQLContextTests.scala
Comment thread dataset/src/main/scala/frameless/TypedDataset.scala
Comment thread dataset/src/main/scala/frameless/TypedDataset.scala
Revert the opinionated merge of the standalone `import ...Encoder` into a
braced group; add FramelessInternals as a separate plain import instead.
scalafmt does not merge imports, so this stays linter-clean while staying
closer to the original source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pomadchin pomadchin self-requested a review May 20, 2026 18:00
@imarios
Copy link
Copy Markdown
Contributor Author

imarios commented May 20, 2026

Hey @pomadchin, thank you for taking the time to review. I have tried to keep this super focused on the most basic Spark 4.0 version with minimal scope creep.

I followed the same shim approach we have historically taken to unblock Framelss adoption. I believe most folks today use Spark 4.x and this was preventing frameless from being an option for them. I obviously used Claude Code here but I did review the decisions closely. I am here to answer any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant