Skip to content

[WIP]Support range-based reads for deletion vectors#3478

Open
KaiqiJinWow wants to merge 1 commit into
apache:mainfrom
KaiqiJinWow:fix-dv-content-range-read
Open

[WIP]Support range-based reads for deletion vectors#3478
KaiqiJinWow wants to merge 1 commit into
apache:mainfrom
KaiqiJinWow:fix-dv-content-range-read

Conversation

@KaiqiJinWow

Copy link
Copy Markdown

Summary

  • Expose Iceberg V3 deletion vector content range fields on DataFile
  • Read Puffin deletion vectors from manifest-described content ranges when content_offset/content_size_in_bytes are present
  • Validate deletion vector blobs for length, magic number, CRC, and cardinality while preserving existing whole-file Puffin reads

Testing

  • .venv/bin/python -m pytest -q tests/table/test_puffin.py tests/io/test_pyarrow.py::test_read_deletion_vector_blob_from_content_range tests/io/test_pyarrow.py::test_read_deletes

@KaiqiJinWow KaiqiJinWow changed the title Support range-based reads for deletion vectors [WIP]Support range-based reads for deletion vectors Jun 11, 2026
@KaiqiJinWow KaiqiJinWow force-pushed the fix-dv-content-range-read branch 2 times, most recently from fdc8d3b to 859efdc Compare June 11, 2026 21:37
@KaiqiJinWow KaiqiJinWow force-pushed the fix-dv-content-range-read branch from 859efdc to 118c561 Compare June 11, 2026 23:07
Comment thread pyiceberg/io/pyarrow.py
Comment on lines 1163 to +1167
elif data_file.file_format == FileFormat.PUFFIN:
with io.new_input(data_file.file_path).open() as fi:
content_offset = getattr(data_file, "content_offset", None)
content_size_in_bytes = getattr(data_file, "content_size_in_bytes", None)
if content_offset is not None or content_size_in_bytes is not None:

@ebyhr ebyhr Jun 12, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two fields are never None when the file format is Puffin, right?

https://iceberg.apache.org/spec/#data-file-fields

The content_offset and content_size_in_bytes fields are used to reference a specific blob for direct access to a deletion vector.
For deletion vectors, these values are required and must exactly match the offset and length stored in the Puffin footer for the deletion vector blob.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct @ebyhr , these should always exist. so I think we can do a stricter read on these fields.

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KaiqiJinWow, main comment is that I think we should introduce a new deletion_vector module which exposes a read_deletion_vector API and hides all the I/O, deserialization, validation. Looks like currently that's all kinda spread out over different classes.

Also just for transparency on what's driving this change to others, currently Databricks Runtime produces deletion vectors that are Iceberg spec compliant DV blobs but they are not neccessarily written in literal Puffin files (they're written in .bin files as a single blob) . The current PyIceberg implementation has strict checks that the DVs must be in literal Puffin files but that's not strictly neccessary. As long as the blob is spec compliant I think there's a reasonable argument that we can consume them regardless of what kind of literal file the blob is stored in. For context, the Java implementation also just works off a similar principle of just reading a spec compliant blob from a range.

Comment thread pyiceberg/table/puffin.py
Comment on lines +35 to +40
_DV_BLOB_LENGTH = struct.Struct(">I")
_DV_BLOB_MAGIC = struct.Struct("<I")
_DV_BLOB_CRC = struct.Struct(">I")
_DV_BLOB_MAGIC_NUMBER = 1681511377
_ROARING_BITMAP_COUNT_SIZE_BYTES = 8
_DV_BLOB_MIN_SIZE_BYTES = _DV_BLOB_LENGTH.size + _DV_BLOB_MAGIC.size + _ROARING_BITMAP_COUNT_SIZE_BYTES + _DV_BLOB_CRC.size

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to decouple the DV specific parts into a separate module, table/deletion_vector.py? That would expose something like read_deletion_vector(io, file) -> [bytes] and that handles deserialization and delegates to the io to do the range read.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to decoupling the logic from this class. I submitted a PR yesterday (#3491) that addresses this issue. The code still resides in the same file, but I'll create a separate file named deletion_vector.py if necessary.

Comment thread pyiceberg/io/pyarrow.py
raise ValueError(f"Unsupported file format: {file_format}")


def _validate_deletion_vector(data_file: DataFile) -> tuple[int, int, str]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a helper in the deletion_vector module I mentioned below that I think we should introduce. This class looks like it's kinda become a bit of a kitchen sink of different things but the original intent I think was for fileIO and so I don't think we would want to have the low level DV validation here.

Comment thread pyiceberg/io/pyarrow.py
Comment on lines 1163 to +1167
elif data_file.file_format == FileFormat.PUFFIN:
with io.new_input(data_file.file_path).open() as fi:
content_offset = getattr(data_file, "content_offset", None)
content_size_in_bytes = getattr(data_file, "content_size_in_bytes", None)
if content_offset is not None or content_size_in_bytes is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct @ebyhr , these should always exist. so I think we can do a stricter read on these fields.

Comment thread pyiceberg/io/pyarrow.py
Comment on lines +1168 to +1179
# A DV is declared as PUFFIN in the manifest, but the content range points directly
# to the serialized bitmap blob, so avoid parsing the entire file as a Puffin file.
content_offset, content_size_in_bytes, referenced_data_file = _validate_deletion_vector(data_file)

fi.seek(content_offset)
payload = fi.read(content_size_in_bytes)
if len(payload) != content_size_in_bytes:
raise ValueError(
f"Could not read deletion vector, expected {content_size_in_bytes} bytes, got {len(payload)}"
)
bitmaps = _deserialize_dv_blob(payload, data_file.record_count)
return {referenced_data_file: _bitmaps_to_chunked_array(bitmaps)}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, I think if we abstract away a read_deletion_vector(io, delete_file) -> bytes API it'll be a bit cleaner. That takes care of all the offset handling, io, deserialization, validation etc .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants