perf: unwrap identity casts in schema adapter to enable Parquet stats pruning by mbutrovich · Pull Request #4730 · apache/datafusion-comet

mbutrovich · 2026-06-25T18:31:14Z

Which issue does this PR close?

Closes #.

Rationale for this change

DataFusion's default PhysicalExprAdapter inserts a CastExpr around every Column reference whenever the logical and physical Arrow Fields differ in any attribute, including metadata-only or nullability-only mismatches. DataFusion itself absorbs this because its PruningPredicate analyzer recognizes its own CastExpr and peels it to resolve the column against parquet statistics.

Comet's SparkPhysicalExprAdapter::replace_with_spark_cast then swaps that CastExpr for a datafusion_comet_spark_expr::Cast because Spark cast semantics diverge from arrow-cast for overflow, null handling, ANSI mode, etc. The Spark Cast is a different PhysicalExpr type that DataFusion's pruning analyzer does not understand, so build_pruning_predicates returns None at file open time and no row groups are pruned. With Spark range-derived schemas (non-nullable logical) read from Parquet (nullable physical), this fires on every column reference, silently disabling row-group and page-index stats pruning.

For identical source and target data types there is no Spark-specific cast semantics to preserve, so the swap costs us pruning for no benefit.

What changes are included in this PR?

SparkPhysicalExprAdapter::replace_with_spark_cast now skips the Spark Cast wrap when the physical and target data types are equal. Unwrapping is safe because a Cast with equal source and target types is a value-level identity (it does not null-strip or enforce non-null), and Arrow field nullability and metadata are informational, not computational.

How are these changes tested?

CometNativeReaderSuite adds a regression test that writes a 1000-row Parquet file with parquet.block.size=1024, asserts more than one row group via ParquetFileReader, then runs SELECT ... WHERE c1 > 500 and asserts the scan's numOutputRows is strictly less than 1000. Without the fix the scan reads all rows; with the fix row groups whose max is at most 500 are pruned.

andygrove

Nice catch . Thanks @mbutrovich

mbutrovich · 2026-06-25T20:00:12Z

On TPC-DS Q99, for example, we can see the effect of these filters now firing.

On main:

PR #4730:

unwrap identity casts in schema adapter for filters

d4cecb9

mbutrovich changed the title ~~unwrap identity casts in schema adapter for filters~~ perf: unwrap identity casts in schema adapter to enable Parquet stats pruning Jun 25, 2026

mbutrovich mentioned this pull request Jun 25, 2026

[EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) apache/datafusion#20324

Open

2 tasks

andygrove approved these changes Jun 25, 2026

View reviewed changes

mbutrovich merged commit 2473aaa into apache:main Jun 25, 2026
71 of 72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: unwrap identity casts in schema adapter to enable Parquet stats pruning#4730

perf: unwrap identity casts in schema adapter to enable Parquet stats pruning#4730
mbutrovich merged 1 commit into
apache:mainfrom
mbutrovich:unwrap_cast

mbutrovich commented Jun 25, 2026 •

edited

Loading

Uh oh!

andygrove left a comment

Uh oh!

mbutrovich commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mbutrovich commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Jun 25, 2026 •

edited

Loading