[WIP]Support range-based reads for deletion vectors#3478
Conversation
fdc8d3b to
859efdc
Compare
859efdc to
118c561
Compare
| elif data_file.file_format == FileFormat.PUFFIN: | ||
| with io.new_input(data_file.file_path).open() as fi: | ||
| content_offset = getattr(data_file, "content_offset", None) | ||
| content_size_in_bytes = getattr(data_file, "content_size_in_bytes", None) | ||
| if content_offset is not None or content_size_in_bytes is not None: |
There was a problem hiding this comment.
These two fields are never None when the file format is Puffin, right?
https://iceberg.apache.org/spec/#data-file-fields
The content_offset and content_size_in_bytes fields are used to reference a specific blob for direct access to a deletion vector.
For deletion vectors, these values are required and must exactly match the offset and length stored in the Puffin footer for the deletion vector blob.
There was a problem hiding this comment.
Correct @ebyhr , these should always exist. so I think we can do a stricter read on these fields.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @KaiqiJinWow, main comment is that I think we should introduce a new deletion_vector module which exposes a read_deletion_vector API and hides all the I/O, deserialization, validation. Looks like currently that's all kinda spread out over different classes.
Also just for transparency on what's driving this change to others, currently Databricks Runtime produces deletion vectors that are Iceberg spec compliant DV blobs but they are not neccessarily written in literal Puffin files (they're written in .bin files as a single blob) . The current PyIceberg implementation has strict checks that the DVs must be in literal Puffin files but that's not strictly neccessary. As long as the blob is spec compliant I think there's a reasonable argument that we can consume them regardless of what kind of literal file the blob is stored in. For context, the Java implementation also just works off a similar principle of just reading a spec compliant blob from a range.
| _DV_BLOB_LENGTH = struct.Struct(">I") | ||
| _DV_BLOB_MAGIC = struct.Struct("<I") | ||
| _DV_BLOB_CRC = struct.Struct(">I") | ||
| _DV_BLOB_MAGIC_NUMBER = 1681511377 | ||
| _ROARING_BITMAP_COUNT_SIZE_BYTES = 8 | ||
| _DV_BLOB_MIN_SIZE_BYTES = _DV_BLOB_LENGTH.size + _DV_BLOB_MAGIC.size + _ROARING_BITMAP_COUNT_SIZE_BYTES + _DV_BLOB_CRC.size |
There was a problem hiding this comment.
I think we'd want to decouple the DV specific parts into a separate module, table/deletion_vector.py? That would expose something like read_deletion_vector(io, file) -> [bytes] and that handles deserialization and delegates to the io to do the range read.
There was a problem hiding this comment.
+1 to decoupling the logic from this class. I submitted a PR yesterday (#3491) that addresses this issue. The code still resides in the same file, but I'll create a separate file named deletion_vector.py if necessary.
| raise ValueError(f"Unsupported file format: {file_format}") | ||
|
|
||
|
|
||
| def _validate_deletion_vector(data_file: DataFile) -> tuple[int, int, str]: |
There was a problem hiding this comment.
I think this would be a helper in the deletion_vector module I mentioned below that I think we should introduce. This class looks like it's kinda become a bit of a kitchen sink of different things but the original intent I think was for fileIO and so I don't think we would want to have the low level DV validation here.
| elif data_file.file_format == FileFormat.PUFFIN: | ||
| with io.new_input(data_file.file_path).open() as fi: | ||
| content_offset = getattr(data_file, "content_offset", None) | ||
| content_size_in_bytes = getattr(data_file, "content_size_in_bytes", None) | ||
| if content_offset is not None or content_size_in_bytes is not None: |
There was a problem hiding this comment.
Correct @ebyhr , these should always exist. so I think we can do a stricter read on these fields.
| # A DV is declared as PUFFIN in the manifest, but the content range points directly | ||
| # to the serialized bitmap blob, so avoid parsing the entire file as a Puffin file. | ||
| content_offset, content_size_in_bytes, referenced_data_file = _validate_deletion_vector(data_file) | ||
|
|
||
| fi.seek(content_offset) | ||
| payload = fi.read(content_size_in_bytes) | ||
| if len(payload) != content_size_in_bytes: | ||
| raise ValueError( | ||
| f"Could not read deletion vector, expected {content_size_in_bytes} bytes, got {len(payload)}" | ||
| ) | ||
| bitmaps = _deserialize_dv_blob(payload, data_file.record_count) | ||
| return {referenced_data_file: _bitmaps_to_chunked_array(bitmaps)} |
There was a problem hiding this comment.
Same as below, I think if we abstract away a read_deletion_vector(io, delete_file) -> bytes API it'll be a bit cleaner. That takes care of all the offset handling, io, deserialization, validation etc .
Summary
Testing
.venv/bin/python -m pytest -q tests/table/test_puffin.py tests/io/test_pyarrow.py::test_read_deletion_vector_blob_from_content_range tests/io/test_pyarrow.py::test_read_deletes