Return null for mean of all-null input#8389
Conversation
Mean::finalize divided the sum and count arrays directly, so an all-null group produced 0/0 = NaN while finalize_scalar returned null for the same input. Nulls are skipped during accumulation, so a zero-count group is an empty mean and should be null, as in SQL. Mask out zero-count entries after the division so both paths agree. The new tests run the same set of groups (including NaN and null mixtures) through both the scalar-partial path and the grouped array path with identical assertions. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
Merging this PR will degrade performance by 19.02%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | compare[48] |
213 µs | 300.6 µs | -29.15% |
| ❌ | Simulation | compare[50] |
227.7 µs | 319.2 µs | -28.65% |
| ❌ | Simulation | compare[49] |
228.2 µs | 317.7 µs | -28.18% |
| ❌ | Simulation | compare[44] |
207.5 µs | 287.8 µs | -27.89% |
| ❌ | Simulation | compare[46] |
218.5 µs | 302.5 µs | -27.76% |
| ❌ | Simulation | compare[47] |
223.5 µs | 309.3 µs | -27.73% |
| ❌ | Simulation | compare[40] |
190.7 µs | 263.6 µs | -27.67% |
| ❌ | Simulation | compare[44] |
212.2 µs | 292.3 µs | -27.43% |
| ❌ | Simulation | compare[45] |
218.9 µs | 301 µs | -27.28% |
| ❌ | Simulation | compare[43] |
209.2 µs | 287.6 µs | -27.25% |
| ❌ | Simulation | compare[42] |
204.6 µs | 281.1 µs | -27.23% |
| ❌ | Simulation | compare[40] |
195.6 µs | 268.4 µs | -27.1% |
| ❌ | Simulation | compare[43] |
214.2 µs | 292.5 µs | -26.77% |
| ❌ | Simulation | compare[41] |
204.5 µs | 279.3 µs | -26.76% |
| ❌ | Simulation | compare[42] |
209.4 µs | 286 µs | -26.76% |
| ❌ | Simulation | compare[41] |
209.3 µs | 284 µs | -26.28% |
| ❌ | Simulation | compare[39] |
199.9 µs | 270.8 µs | -26.2% |
| ❌ | Simulation | compare[38] |
195.5 µs | 264.6 µs | -26.11% |
| ❌ | Simulation | compare[36] |
187.3 µs | 252.5 µs | -25.84% |
| ❌ | Simulation | compare[39] |
204.7 µs | 275.6 µs | -25.71% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing mitko/aggregate/mean-all-null-returns-null (561876e) with develop (3731d5b)
Footnotes
-
10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
| // the division produces 0/0 = NaN. The mean of an empty group is null (as in SQL), so | ||
| // mask out zero-count entries. This matches `finalize_scalar`. | ||
| let non_empty = count.binary( | ||
| ConstantArray::new(0u64, count.len()).into_array(), |
There was a problem hiding this comment.
doesn't this need to be target dtype?
|
sorry this was prematurely opened by claude. i'm not sure about the |
| ConstantArray::new(0u64, count.len()).into_array(), | ||
| Operator::NotEq, | ||
| )?; | ||
| mean.mask(non_empty) |
| (vec![Some(f64::NAN), Some(1.0), None], Some(0.5)), | ||
| (vec![Some(f64::NAN), Some(1.0), Some(1.0)], Some(2.0 / 3.0)), | ||
| (vec![None, None, Some(f64::NAN)], Some(0.0)), | ||
| (vec![Some(f64::NAN), Some(1.0), Some(1.0)], Some(2.0 / 3.0)), |
Float sums skipped NaN values, so mean(NaN, 1, null) returned 0.5 while count still counted the NaN element. NaN is a value, not an absence: aggregations skip nulls but should let NaN poison the result, matching IEEE 754 addition and other engines. Remove the NaN skip from sum_float_all, which covers the scalar, valid-slices, and grouped sum paths at once; is_saturated already treats a NaN accumulator as saturated (NaN could previously arise from inf + -inf). Also compare the count against a zero of the target dtype rather than u64 in Mean::finalize, per review feedback. The comparison output becomes nullable, so fill nulls (null groups) with false before masking. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
|
I am going to drop this here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html. I think long term we will need |
i just found out about #7009, i think we'll need the option now so we have the default stats adhere to the IEEE standard and our stats be more useful. WIP |
|
we talked about changing only stats to contain the sum as "sum of non-NaNs." But in order to use the sum stat, we have to now always look for the NanCount stat and compute it if it's not present. are we ok with that perf change for old arrays? I suppose we could do the sum and the nan check in one pass and not change it too much, but that'd be a new accumulator implementation; do you guys have better ideas? |
|
We need to compute it anyway to make this correct. I am working through adding skipNaN option to our aggregates |
Why
Reverts #8365 and fixes the inconsistency it addressed in the opposite direction: a mean over zero non-null values should be null (as in SQL, where nulls are skipped and an empty mean is null), not NaN.
Before this change
Mean::finalize(the grouped array path) divided the sum and count arrays directly, so an all-null group produced0/0 = NaN, whileMean::finalize_scalarreturned null for the same input — the result depended on which reduce path the group took. Nowfinalizemasks out zero-count entries after the division so both paths return null.This also changes float sums to propagate NaN instead of skipping it: previously
sumskipped NaN values whilecountcounted them as valid, somean(NaN, 1, null)returned 0.5. NaN is a value, not an absence — aggregations skip nulls but let NaN poison the result, matching IEEE 754 addition and other engines.is_saturatedalready treated a NaN accumulator as saturated (NaN could previously arise frominf + -inf), so only the skip insum_float_allhad to go.The new tests run the same groups (NaN/null mixtures, all-null, all-valid) through both the scalar-partial path and the grouped array path with identical assertions.