Skip to content

[rust] Support schema evolution for log scanner.#654

Open
loserwang1024 wants to merge 1 commit into
apache:mainfrom
loserwang1024:log_schema
Open

[rust] Support schema evolution for log scanner.#654
loserwang1024 wants to merge 1 commit into
apache:mainfrom
loserwang1024:log_schema

Conversation

@loserwang1024

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #547

Brief change log

Tests

API and Format

Documentation

@loserwang1024

Copy link
Copy Markdown
Contributor Author

@fresh-borzoni , CC

@fresh-borzoni fresh-borzoni left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loserwang1024 Thank you for the PR, left comments, PTAL

let guard = self.contexts.read();
guard.get(&effective_id).map(|ctx| {
if is_remote {
ctx.remote.clone()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns an owned ReadContext clone per batch, where it used to be a borrow. Fine in the no-projection case, but with projection it deep-copies the Vecs on every batch.
Could we store Arc and return an Arc clone? Then it's a single refcount bump and the projection Vecs aren't copied this way

remote_log_downloader.clone(),
log_fetch_buffer.clone(),
remote_read_context.clone(),
Arc::clone(&resolver),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema prewarm only runs on the local path, not here. So reading tiered data written under an older schema fails with No ReadContext found for schema_id N.

The local branch below fetches missing schemas before decoding and the remote path needs the same.
WDYT?


let log_batch = log_batch_result?;
let mut record_batch = log_batch.record_batch(&self.read_context)?;
let read_context = self.resolve_context_for_batch(&log_batch)?;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a schema change, batches here have different column counts, but the reader advertises one fixed schema. Consumers like DataFusion/pyarrow crash on that.
Should we pad each batch up to the advertised schema with null columns for the missing ones.
Also probably needs a test, the IT only covers the row scanner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[rust] Apply schema evolution on log scan path (batch + row)

2 participants