Spark: Avoid String and java.util.UUID allocations on UUID read/write paths by wombatu-kun · Pull Request #16349 · apache/iceberg

wombatu-kun · 2026-05-15T11:20:03Z

What & why

This implements the existing // TODO: direct conversion from string to byte buffer in SparkValueWriters. The Spark data layer converted UUID values through an intermediate String and a java.util.UUID object on every value, in both directions: write did UUID.fromString(s.toString()) then re-serialized the two longs to 16 bytes; read did UUIDUtil.convert(buf).toString() then wrapped that as UTF8String. The UUID arrives as the ASCII bytes of its canonical string and leaves as 16 raw bytes (and vice versa), so the String/UUID objects are pure per-row allocation overhead.

Changes

Add UUIDUtil.convertToByteBuffer(byte[] uuidStringBytes, ByteBuffer reuse) — parses the 36 ASCII bytes of a canonical UUID string directly into the 16-byte big-endian form.
Add UUIDUtil.convertToStringBytes(ByteBuffer uuidBytes, byte[] reuse) — renders 16 bytes back to the 36 ASCII bytes of the canonical string.
Rewire all UUID read/write sites in Avro/Parquet/ORC Spark*Readers/Spark*Writers for Spark 3.4, 3.5, 4.0, 4.1 to use these helpers.
Add TestUUIDUtil coverage for the new methods.

Correctness

Both helpers pivot on the (mostSigBits, leastSigBits) long pair: the parser reproduces java.util.UUID.fromString (parse [0,8) → <<16 | [9,13) → <<16 | [14,18) for MSB; [19,23) → <<48 | [24,36) for LSB) and then putLong(0, msb); putLong(8, lsb) exactly as the previous convertToByteBuffer(UUID, reuse); the formatter is the inverse and matches UUID.toString(). The output is therefore byte-for-byte identical to the previous code. The write side keeps the reusable thread-local buffer; the read side must allocate a fresh array because UTF8String.fromBytes wraps without copying (a reused buffer would alias across rows).

… paths Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use a local accumulator instead of mutating the `value` parameter so checkstyle's ParameterAssignment rule passes. Behavior is byte-for-byte unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spark: Avoid String and java.util.UUID allocations on UUID read/write…

776eaf3

… paths Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added API spark labels May 15, 2026

wombatu-kun marked this pull request as draft May 15, 2026 11:26

Spark: Avoid parameter reassignment in UUIDUtil.formatHex

0b34b4f

Use a local accumulator instead of mutating the `value` parameter so checkstyle's ParameterAssignment rule passes. Behavior is byte-for-byte unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wombatu-kun marked this pull request as ready for review May 15, 2026 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Avoid String and java.util.UUID allocations on UUID read/write paths#16349

Spark: Avoid String and java.util.UUID allocations on UUID read/write paths#16349
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:spark-uuid-direct-conversion

wombatu-kun commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wombatu-kun commented May 15, 2026

What & why

Changes

Correctness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant