Skip to content

Spark: Avoid String and java.util.UUID allocations on UUID read/write paths#16349

Open
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:spark-uuid-direct-conversion
Open

Spark: Avoid String and java.util.UUID allocations on UUID read/write paths#16349
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:spark-uuid-direct-conversion

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

What & why

This implements the existing // TODO: direct conversion from string to byte buffer in SparkValueWriters. The Spark data layer converted UUID values through an intermediate String and a java.util.UUID object on every value, in both directions: write did UUID.fromString(s.toString()) then re-serialized the two longs to 16 bytes; read did UUIDUtil.convert(buf).toString() then wrapped that as UTF8String. The UUID arrives as the ASCII bytes of its canonical string and leaves as 16 raw bytes (and vice versa), so the String/UUID objects are pure per-row allocation overhead.

Changes

  • Add UUIDUtil.convertToByteBuffer(byte[] uuidStringBytes, ByteBuffer reuse) — parses the 36 ASCII bytes of a canonical UUID string directly into the 16-byte big-endian form.
  • Add UUIDUtil.convertToStringBytes(ByteBuffer uuidBytes, byte[] reuse) — renders 16 bytes back to the 36 ASCII bytes of the canonical string.
  • Rewire all UUID read/write sites in Avro/Parquet/ORC Spark*Readers/Spark*Writers for Spark 3.4, 3.5, 4.0, 4.1 to use these helpers.
  • Add TestUUIDUtil coverage for the new methods.

Correctness

Both helpers pivot on the (mostSigBits, leastSigBits) long pair: the parser reproduces java.util.UUID.fromString (parse [0,8)<<16 | [9,13)<<16 | [14,18) for MSB; [19,23)<<48 | [24,36) for LSB) and then putLong(0, msb); putLong(8, lsb) exactly as the previous convertToByteBuffer(UUID, reuse); the formatter is the inverse and matches UUID.toString(). The output is therefore byte-for-byte identical to the previous code. The write side keeps the reusable thread-local buffer; the read side must allocate a fresh array because UTF8String.fromBytes wraps without copying (a reused buffer would alias across rows).

… paths

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun marked this pull request as draft May 15, 2026 11:26
Use a local accumulator instead of mutating the `value` parameter so checkstyle's ParameterAssignment rule passes. Behavior is byte-for-byte unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun marked this pull request as ready for review May 15, 2026 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant