Skip to content

[WIP]Support range-based reads for deletion vectors#3478

Open
KaiqiJinWow wants to merge 1 commit into
apache:mainfrom
KaiqiJinWow:fix-dv-content-range-read
Open

[WIP]Support range-based reads for deletion vectors#3478
KaiqiJinWow wants to merge 1 commit into
apache:mainfrom
KaiqiJinWow:fix-dv-content-range-read

Conversation

@KaiqiJinWow

Copy link
Copy Markdown

Summary

  • Expose Iceberg V3 deletion vector content range fields on DataFile
  • Read Puffin deletion vectors from manifest-described content ranges when content_offset/content_size_in_bytes are present
  • Validate deletion vector blobs for length, magic number, CRC, and cardinality while preserving existing whole-file Puffin reads

Testing

  • .venv/bin/python -m pytest -q tests/table/test_puffin.py tests/io/test_pyarrow.py::test_read_deletion_vector_blob_from_content_range tests/io/test_pyarrow.py::test_read_deletes

@KaiqiJinWow KaiqiJinWow changed the title Support range-based reads for deletion vectors [WIP]Support range-based reads for deletion vectors Jun 11, 2026
@KaiqiJinWow KaiqiJinWow force-pushed the fix-dv-content-range-read branch 2 times, most recently from fdc8d3b to 859efdc Compare June 11, 2026 21:37
@KaiqiJinWow KaiqiJinWow force-pushed the fix-dv-content-range-read branch from 859efdc to 118c561 Compare June 11, 2026 23:07
Comment thread pyiceberg/io/pyarrow.py
Comment on lines 1163 to +1167
elif data_file.file_format == FileFormat.PUFFIN:
with io.new_input(data_file.file_path).open() as fi:
content_offset = getattr(data_file, "content_offset", None)
content_size_in_bytes = getattr(data_file, "content_size_in_bytes", None)
if content_offset is not None or content_size_in_bytes is not None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file format is Puffin, these two fields are never None, right?

https://iceberg.apache.org/spec/#data-file-fields

The content_offset and content_size_in_bytes fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the offset and length stored in the Puffin footer for the deletion vector blob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants