feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter by LuciferYang · Pull Request #471 · lance-format/lance-spark

LuciferYang · 2026-04-22T07:30:37Z

Summary

LanceArrowWriter.createFieldWriter had no case for UserDefinedType, so writing a DataFrame with a UDT column (most commonly MLlib's VectorUDT — the type produced by VectorAssembler, returned by many ML transformers) threw UnsupportedOperationException. This PR adds a single case that unwraps the UDT to its sqlType and recurses, matching the pattern Spark's native ArrowWriter uses. Read-back already worked because Arrow has no UDT concept — the column comes back as the underlying struct sqlType — so the write path is the only gap, and that's what this fixes.

Changes

One new case in LanceArrowWriter.createFieldWriter:

case (udt: UserDefinedType[_], _) =>
  createFieldWriter(vector, udt.sqlType, metadata)

Placed just before the catch-all case (dt, _) => throw new UnsupportedOperationException(...) so specific matchers run first and anything that's still a UDT at that point gets unwrapped. Metadata is threaded through so nested cases that need it — e.g. FixedSizeListWriter when a UDT's sqlType resolves to ArrayType with embedding metadata — still receive the right field metadata.

Tests

BaseSparkDataTypeRoundtripTest picks up testVectorUDTRoundtrip: creates a DataFrame with an id INT column and a vec VectorUDT column, writes two rows (a DenseVector and a SparseVector) as Lance, reads back, and asserts two things:

Schema contract. Arrow has no UDT, so the read-back schema must NOT carry VectorUDT — it must come back as the underlying sqlType (StructType). Asserts both halves: !instanceof VectorUDT and instanceof StructType. Locks this behavior so a future change that accidentally preserves UDT in field metadata is caught loudly.
Value fidelity. Reconstructs each row's vector from the read-back struct and checks equals against the original. A small reconstructVector(Row) helper does the manual reconstruction because VectorUDT.deserialize expects InternalRow, not the public Row that Spark hands back in test assertions. The helper switches on the type byte — 0 = sparse, 1 = dense — and throws IllegalArgumentException for anything else rather than silently misinterpreting bogus data as sparse.

A null-VectorUDT row is intentionally omitted and documented in the test's Javadoc: VectorUDT.sqlType marks its inner type field as non-nullable, and writing a parent-null struct produces null placeholders in the non-nullable children that Lance's native writer rejects. That's a Lance-side limitation with struct-of-non-nullable-children, not specific to UDTs; out of scope here.

spark-mllib is added as a test-scope dependency across all eight non-bundle modules (-base_2.12, -base_2.13, -3.4_2.12/2.13, -3.5_2.12/2.13, -4.0_2.13, -4.1_2.13). Each module pins its own spark<version>.version so cross-compile stays version-accurate. Bundle modules don't run tests so they don't need it. The 4.0/4.1 modules inherit the concrete SparkDataTypeRoundtripTest subclass via cross-compile from lance-spark-3.5_2.12/src/test/java, so the VectorUDT test executes on 4.0 and 4.1 too — spark-mllib is needed there for both compile and runtime.

LanceArrowWriter.createFieldWriter lacked a UserDefinedType branch, causing VectorUDT columns to throw UnsupportedOperationException on write. Add a UDT unwrapping case that delegates to udt.sqlType, consistent with Spark's native ArrowWriter. Add an E2E roundtrip test covering DenseVector and SparseVector via VectorUDT, with spark-mllib added as a test-scope dependency across all 8 module POMs.

Multi-persona review of the roundtrip test surfaced three issues: 1. reconstructVector fell through to the sparse branch for any type byte other than 1 — silently misinterpreting bogus serialization as a sparse vector. Switch on the type byte and throw IllegalArgumentException for unknown values (0=sparse, 1=dense). 2. The test's Javadoc documented that the read-back schema loses the VectorUDT wrapper, but the test didn't verify it. Added two schema assertions (not instanceof VectorUDT, is instanceof StructType) so a future change that accidentally preserves UDT is caught loudly. 3. The original Javadoc pointed at VectorUDT.deserialize as a recovery path, but deserialize expects InternalRow while the test works with public Row. Clarified why the helper reconstructs manually.

LuciferYang · 2026-05-08T03:21:49Z

Thanks for your review. @jiaoew1991 ,can we merge this PR? I don't have merge permission.

LuciferYang added 2 commits April 22, 2026 02:06

github-actions Bot added the enhancement New feature or request label Apr 22, 2026

jiaoew1991 approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter#471

feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter#471
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/vectorudt-roundtrip

LuciferYang commented Apr 22, 2026 •

edited

Loading

Uh oh!

LuciferYang commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Uh oh!

LuciferYang commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Apr 22, 2026 •

edited

Loading