Skip to content

perf: unwrap identity casts in schema adapter to enable Parquet stats pruning#4730

Merged
mbutrovich merged 1 commit into
apache:mainfrom
mbutrovich:unwrap_cast
Jun 25, 2026
Merged

perf: unwrap identity casts in schema adapter to enable Parquet stats pruning#4730
mbutrovich merged 1 commit into
apache:mainfrom
mbutrovich:unwrap_cast

Conversation

@mbutrovich

@mbutrovich mbutrovich commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

DataFusion's default PhysicalExprAdapter inserts a CastExpr around every Column reference whenever the logical and physical Arrow Fields differ in any attribute, including metadata-only or nullability-only mismatches. DataFusion itself absorbs this because its PruningPredicate analyzer recognizes its own CastExpr and peels it to resolve the column against parquet statistics.

Comet's SparkPhysicalExprAdapter::replace_with_spark_cast then swaps that CastExpr for a datafusion_comet_spark_expr::Cast because Spark cast semantics diverge from arrow-cast for overflow, null handling, ANSI mode, etc. The Spark Cast is a different PhysicalExpr type that DataFusion's pruning analyzer does not understand, so build_pruning_predicates returns None at file open time and no row groups are pruned. With Spark range-derived schemas (non-nullable logical) read from Parquet (nullable physical), this fires on every column reference, silently disabling row-group and page-index stats pruning.

For identical source and target data types there is no Spark-specific cast semantics to preserve, so the swap costs us pruning for no benefit.

What changes are included in this PR?

SparkPhysicalExprAdapter::replace_with_spark_cast now skips the Spark Cast wrap when the physical and target data types are equal. Unwrapping is safe because a Cast with equal source and target types is a value-level identity (it does not null-strip or enforce non-null), and Arrow field nullability and metadata are informational, not computational.

How are these changes tested?

CometNativeReaderSuite adds a regression test that writes a 1000-row Parquet file with parquet.block.size=1024, asserts more than one row group via ParquetFileReader, then runs SELECT ... WHERE c1 > 500 and asserts the scan's numOutputRows is strictly less than 1000. Without the fix the scan reads all rows; with the fix row groups whose max is at most 500 are pruned.

@mbutrovich mbutrovich changed the title unwrap identity casts in schema adapter for filters perf: unwrap identity casts in schema adapter to enable Parquet stats pruning Jun 25, 2026

@andygrove andygrove left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch . Thanks @mbutrovich

@mbutrovich

Copy link
Copy Markdown
Contributor Author

On TPC-DS Q99, for example, we can see the effect of these filters now firing.

On main:
main

PR #4730:
4730

@mbutrovich mbutrovich merged commit 2473aaa into apache:main Jun 25, 2026
71 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants