Skip to content

Visitor and evaluator edge cases can over-prune files or mishandle nulls #3498

@kevinjqliu

Description

@kevinjqliu

Several visitor/evaluator edge cases appear unsafe or inconsistent:

  1. _StrictMetricsEvaluator.visit_not_equal / visit_not_in return ROWS_MUST_MATCH when a file can contain nulls or NaNs. Example stats with [null, 5] or [NaN, 5.0] and lower/upper bounds both 5 return true for NotEqualTo("x", 5) / NotIn("x", {5}), even though one row does not match. This can incorrectly mark whole files deleted.

  2. _StrictMetricsEvaluator.eval returns ROWS_MUST_MATCH for record_count <= 0. record_count=0 is vacuously true, but record_count=-1 is unknown per the local comment; even AlwaysFalse() returns true.

  3. ResidualVisitor comparison methods directly compare partition values to literals. A nullable identity partition value of None with LessThan("x", 1) raises TypeError, while row evaluation returns false.

  4. ResidualVisitor.visit_not_nan(None) returns AlwaysFalse, while expression evaluation treats NotNaN(None) as true. Existing tests encode both behaviors, so the semantics are inconsistent.

Validated against the current tree; examples use stats/partition shapes already supported by the repo tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions