feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter#471
Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
Open
feat: support VectorUDT writes via UDT unwrapping in LanceArrowWriter#471LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang wants to merge 2 commits intolance-format:mainfrom
Conversation
LanceArrowWriter.createFieldWriter lacked a UserDefinedType branch, causing VectorUDT columns to throw UnsupportedOperationException on write. Add a UDT unwrapping case that delegates to udt.sqlType, consistent with Spark's native ArrowWriter. Add an E2E roundtrip test covering DenseVector and SparseVector via VectorUDT, with spark-mllib added as a test-scope dependency across all 8 module POMs.
Multi-persona review of the roundtrip test surfaced three issues: 1. reconstructVector fell through to the sparse branch for any type byte other than 1 — silently misinterpreting bogus serialization as a sparse vector. Switch on the type byte and throw IllegalArgumentException for unknown values (0=sparse, 1=dense). 2. The test's Javadoc documented that the read-back schema loses the VectorUDT wrapper, but the test didn't verify it. Added two schema assertions (not instanceof VectorUDT, is instanceof StructType) so a future change that accidentally preserves UDT is caught loudly. 3. The original Javadoc pointed at VectorUDT.deserialize as a recovery path, but deserialize expects InternalRow while the test works with public Row. Clarified why the helper reconstructs manually.
jiaoew1991
approved these changes
May 6, 2026
Contributor
Author
|
Thanks for your review. @jiaoew1991 ,can we merge this PR? I don't have merge permission. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LanceArrowWriter.createFieldWriterhad no case forUserDefinedType, so writing a DataFrame with a UDT column (most commonly MLlib'sVectorUDT— the type produced byVectorAssembler, returned by many ML transformers) threwUnsupportedOperationException. This PR adds a single case that unwraps the UDT to itssqlTypeand recurses, matching the pattern Spark's nativeArrowWriteruses. Read-back already worked because Arrow has no UDT concept — the column comes back as the underlying struct sqlType — so the write path is the only gap, and that's what this fixes.Changes
One new case in
LanceArrowWriter.createFieldWriter:Placed just before the catch-all
case (dt, _) => throw new UnsupportedOperationException(...)so specific matchers run first and anything that's still a UDT at that point gets unwrapped. Metadata is threaded through so nested cases that need it — e.g.FixedSizeListWriterwhen a UDT's sqlType resolves toArrayTypewith embedding metadata — still receive the right field metadata.Tests
BaseSparkDataTypeRoundtripTestpicks uptestVectorUDTRoundtrip: creates a DataFrame with anid INTcolumn and avec VectorUDTcolumn, writes two rows (aDenseVectorand aSparseVector) as Lance, reads back, and asserts two things:Schema contract. Arrow has no UDT, so the read-back schema must NOT carry
VectorUDT— it must come back as the underlying sqlType (StructType). Asserts both halves:!instanceof VectorUDTandinstanceof StructType. Locks this behavior so a future change that accidentally preserves UDT in field metadata is caught loudly.Value fidelity. Reconstructs each row's vector from the read-back struct and checks
equalsagainst the original. A smallreconstructVector(Row)helper does the manual reconstruction becauseVectorUDT.deserializeexpectsInternalRow, not the publicRowthat Spark hands back in test assertions. The helper switches on the type byte — 0 = sparse, 1 = dense — and throwsIllegalArgumentExceptionfor anything else rather than silently misinterpreting bogus data as sparse.A null-VectorUDT row is intentionally omitted and documented in the test's Javadoc:
VectorUDT.sqlTypemarks its innertypefield as non-nullable, and writing a parent-null struct produces null placeholders in the non-nullable children that Lance's native writer rejects. That's a Lance-side limitation with struct-of-non-nullable-children, not specific to UDTs; out of scope here.spark-mllibis added as a test-scope dependency across all eight non-bundle modules (-base_2.12,-base_2.13,-3.4_2.12/2.13,-3.5_2.12/2.13,-4.0_2.13,-4.1_2.13). Each module pins its ownspark<version>.versionso cross-compile stays version-accurate. Bundle modules don't run tests so they don't need it. The 4.0/4.1 modules inherit the concreteSparkDataTypeRoundtripTestsubclass via cross-compile fromlance-spark-3.5_2.12/src/test/java, so the VectorUDT test executes on 4.0 and 4.1 too —spark-mllibis needed there for both compile and runtime.