[rust] Support schema evolution for log scanner.#654
Conversation
|
@fresh-borzoni , CC |
fresh-borzoni
left a comment
There was a problem hiding this comment.
@loserwang1024 Thank you for the PR, left comments, PTAL
| let guard = self.contexts.read(); | ||
| guard.get(&effective_id).map(|ctx| { | ||
| if is_remote { | ||
| ctx.remote.clone() |
There was a problem hiding this comment.
this returns an owned ReadContext clone per batch, where it used to be a borrow. Fine in the no-projection case, but with projection it deep-copies the Vecs on every batch.
Could we store Arc and return an Arc clone? Then it's a single refcount bump and the projection Vecs aren't copied this way
| remote_log_downloader.clone(), | ||
| log_fetch_buffer.clone(), | ||
| remote_read_context.clone(), | ||
| Arc::clone(&resolver), |
There was a problem hiding this comment.
Schema prewarm only runs on the local path, not here. So reading tiered data written under an older schema fails with No ReadContext found for schema_id N.
The local branch below fetches missing schemas before decoding and the remote path needs the same.
WDYT?
|
|
||
| let log_batch = log_batch_result?; | ||
| let mut record_batch = log_batch.record_batch(&self.read_context)?; | ||
| let read_context = self.resolve_context_for_batch(&log_batch)?; |
There was a problem hiding this comment.
After a schema change, batches here have different column counts, but the reader advertises one fixed schema. Consumers like DataFusion/pyarrow crash on that.
Should we pad each batch up to the advertised schema with null columns for the missing ones.
Also probably needs a test, the IT only covers the row scanner
Purpose
Linked issue: close #547
Brief change log
Tests
API and Format
Documentation