aggregate_fn: Return NaN instead of null for mean of all-null input#8365
Conversation
Mean::finalize_scalar returned null when count was zero, while the array finalize path computes sum/count = 0/0 = NaN for the same input. The result of a mean over an all-null group therefore depended on which reduce path it took. Nulls are skipped during accumulation, so an all-null input is an empty mean: let the division produce NaN in both paths. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
Merging this PR will improve performance by 32.96%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
35.4 µs | 20.5 µs | +73.06% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
213.2 µs | 177.1 µs | +20.42% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
308.5 µs | 273.5 µs | +12.8% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing mitko/aggregate/mean-all-null-nan (90d7433) with develop (eda4dd0)
|
This looks incorrect to me. I think this should return Null |
|
I thought that NaN made more semantic sense but you can argue either way. Just have to document what is mean of an empty array |
|
Basically I think NaN as a result composes from partials more naturally without any additional checks |
|
if we combine NaNs, then the associativity of combining partials breaks. both duckdb and datafusion return nulls instead of nans for all-null input. opened #8389 to revert this commit and fix |
Mean::finalize_scalarreturned null when the count was zero, while the arrayfinalizepath computessum/count = 0/0 = NaNfor the same input. A mean over an all-null group therefore gave different results depending on which accumulator we're using.vortex/vortex-array/src/aggregate_fn/fns/mean/mod.rs
Lines 85 to 93 in 90d7433
Since nulls are skipped during accumulation (as in standard SQL aggregation), an all-null input is an empty mean. Both paths now let the division produce NaN.
Note this only matters for the count = 0 case: sum overflow still returns null.