Skip to content

feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter#471

Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/vectorudt-roundtrip
Open

feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter#471
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/vectorudt-roundtrip

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 22, 2026

Summary

LanceArrowWriter.createFieldWriter had no case for UserDefinedType, so writing a DataFrame with a UDT column (most commonly MLlib's VectorUDT — the type produced by VectorAssembler, returned by many ML transformers) threw UnsupportedOperationException. This PR adds a single case that unwraps the UDT to its sqlType and recurses, matching the pattern Spark's native ArrowWriter uses. Read-back already worked because Arrow has no UDT concept — the column comes back as the underlying struct sqlType — so the write path is the only gap, and that's what this fixes.

Changes

One new case in LanceArrowWriter.createFieldWriter:

case (udt: UserDefinedType[_], _) =>
  createFieldWriter(vector, udt.sqlType, metadata)

Placed just before the catch-all case (dt, _) => throw new UnsupportedOperationException(...) so specific matchers run first and anything that's still a UDT at that point gets unwrapped. Metadata is threaded through so nested cases that need it — e.g. FixedSizeListWriter when a UDT's sqlType resolves to ArrayType with embedding metadata — still receive the right field metadata.

Tests

BaseSparkDataTypeRoundtripTest picks up testVectorUDTRoundtrip: creates a DataFrame with an id INT column and a vec VectorUDT column, writes two rows (a DenseVector and a SparseVector) as Lance, reads back, and asserts two things:

  1. Schema contract. Arrow has no UDT, so the read-back schema must NOT carry VectorUDT — it must come back as the underlying sqlType (StructType). Asserts both halves: !instanceof VectorUDT and instanceof StructType. Locks this behavior so a future change that accidentally preserves UDT in field metadata is caught loudly.

  2. Value fidelity. Reconstructs each row's vector from the read-back struct and checks equals against the original. A small reconstructVector(Row) helper does the manual reconstruction because VectorUDT.deserialize expects InternalRow, not the public Row that Spark hands back in test assertions. The helper switches on the type byte — 0 = sparse, 1 = dense — and throws IllegalArgumentException for anything else rather than silently misinterpreting bogus data as sparse.

A null-VectorUDT row is intentionally omitted and documented in the test's Javadoc: VectorUDT.sqlType marks its inner type field as non-nullable, and writing a parent-null struct produces null placeholders in the non-nullable children that Lance's native writer rejects. That's a Lance-side limitation with struct-of-non-nullable-children, not specific to UDTs; out of scope here.

spark-mllib is added as a test-scope dependency across all eight non-bundle modules (-base_2.12, -base_2.13, -3.4_2.12/2.13, -3.5_2.12/2.13, -4.0_2.13, -4.1_2.13). Each module pins its own spark<version>.version so cross-compile stays version-accurate. Bundle modules don't run tests so they don't need it. The 4.0/4.1 modules inherit the concrete SparkDataTypeRoundtripTest subclass via cross-compile from lance-spark-3.5_2.12/src/test/java, so the VectorUDT test executes on 4.0 and 4.1 too — spark-mllib is needed there for both compile and runtime.

LanceArrowWriter.createFieldWriter lacked a UserDefinedType branch,
causing VectorUDT columns to throw UnsupportedOperationException on
write. Add a UDT unwrapping case that delegates to udt.sqlType,
consistent with Spark's native ArrowWriter.

Add an E2E roundtrip test covering DenseVector and SparseVector via
VectorUDT, with spark-mllib added as a test-scope dependency across
all 8 module POMs.
Multi-persona review of the roundtrip test surfaced three issues:

1. reconstructVector fell through to the sparse branch for any type
   byte other than 1 — silently misinterpreting bogus serialization as
   a sparse vector. Switch on the type byte and throw
   IllegalArgumentException for unknown values (0=sparse, 1=dense).

2. The test's Javadoc documented that the read-back schema loses the
   VectorUDT wrapper, but the test didn't verify it. Added two schema
   assertions (not instanceof VectorUDT, is instanceof StructType) so
   a future change that accidentally preserves UDT is caught loudly.

3. The original Javadoc pointed at VectorUDT.deserialize as a recovery
   path, but deserialize expects InternalRow while the test works with
   public Row. Clarified why the helper reconstructs manually.
@github-actions github-actions Bot added the enhancement New feature or request label Apr 22, 2026
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thanks for your review. @jiaoew1991 ,can we merge this PR? I don't have merge permission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants