Skip to content

Return null for mean of all-null input#8389

Draft
dimitarvdimitrov wants to merge 3 commits into
developfrom
mitko/aggregate/mean-all-null-returns-null
Draft

Return null for mean of all-null input#8389
dimitarvdimitrov wants to merge 3 commits into
developfrom
mitko/aggregate/mean-all-null-returns-null

Conversation

@dimitarvdimitrov

@dimitarvdimitrov dimitarvdimitrov commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Why

Reverts #8365 and fixes the inconsistency it addressed in the opposite direction: a mean over zero non-null values should be null (as in SQL, where nulls are skipped and an empty mean is null), not NaN.

Before this change Mean::finalize (the grouped array path) divided the sum and count arrays directly, so an all-null group produced 0/0 = NaN, while Mean::finalize_scalar returned null for the same input — the result depended on which reduce path the group took. Now finalize masks out zero-count entries after the division so both paths return null.

This also changes float sums to propagate NaN instead of skipping it: previously sum skipped NaN values while count counted them as valid, so mean(NaN, 1, null) returned 0.5. NaN is a value, not an absence — aggregations skip nulls but let NaN poison the result, matching IEEE 754 addition and other engines. is_saturated already treated a NaN accumulator as saturated (NaN could previously arise from inf + -inf), so only the skip in sum_float_all had to go.

The new tests run the same groups (NaN/null mixtures, all-null, all-valid) through both the scalar-partial path and the grouped array path with identical assertions.

… input (#8365)"

This reverts commit 05871c3.

Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
Mean::finalize divided the sum and count arrays directly, so an
all-null group produced 0/0 = NaN while finalize_scalar returned null
for the same input. Nulls are skipped during accumulation, so a
zero-count group is an empty mean and should be null, as in SQL. Mask
out zero-count entries after the division so both paths agree.

The new tests run the same set of groups (including NaN and null
mixtures) through both the scalar-partial path and the grouped array
path with identical assertions.

Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
@dimitarvdimitrov dimitarvdimitrov requested a review from a team June 12, 2026 11:47
@codspeed-hq

codspeed-hq Bot commented Jun 12, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 19.02%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 5 improved benchmarks
❌ 93 regressed benchmarks
✅ 1430 untouched benchmarks
⏩ 10 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation compare[48] 213 µs 300.6 µs -29.15%
Simulation compare[50] 227.7 µs 319.2 µs -28.65%
Simulation compare[49] 228.2 µs 317.7 µs -28.18%
Simulation compare[44] 207.5 µs 287.8 µs -27.89%
Simulation compare[46] 218.5 µs 302.5 µs -27.76%
Simulation compare[47] 223.5 µs 309.3 µs -27.73%
Simulation compare[40] 190.7 µs 263.6 µs -27.67%
Simulation compare[44] 212.2 µs 292.3 µs -27.43%
Simulation compare[45] 218.9 µs 301 µs -27.28%
Simulation compare[43] 209.2 µs 287.6 µs -27.25%
Simulation compare[42] 204.6 µs 281.1 µs -27.23%
Simulation compare[40] 195.6 µs 268.4 µs -27.1%
Simulation compare[43] 214.2 µs 292.5 µs -26.77%
Simulation compare[41] 204.5 µs 279.3 µs -26.76%
Simulation compare[42] 209.4 µs 286 µs -26.76%
Simulation compare[41] 209.3 µs 284 µs -26.28%
Simulation compare[39] 199.9 µs 270.8 µs -26.2%
Simulation compare[38] 195.5 µs 264.6 µs -26.11%
Simulation compare[36] 187.3 µs 252.5 µs -25.84%
Simulation compare[39] 204.7 µs 275.6 µs -25.71%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing mitko/aggregate/mean-all-null-returns-null (561876e) with develop (3731d5b)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

// the division produces 0/0 = NaN. The mean of an empty group is null (as in SQL), so
// mask out zero-count entries. This matches `finalize_scalar`.
let non_empty = count.binary(
ConstantArray::new(0u64, count.len()).into_array(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this need to be target dtype?

@dimitarvdimitrov dimitarvdimitrov marked this pull request as draft June 12, 2026 12:57
@dimitarvdimitrov

Copy link
Copy Markdown
Contributor Author

sorry this was prematurely opened by claude. i'm not sure about the mean(nan, 1) == 0.5

ConstantArray::new(0u64, count.len()).into_array(),
Operator::NotEq,
)?;
mean.mask(non_empty)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

(vec![Some(f64::NAN), Some(1.0), None], Some(0.5)),
(vec![Some(f64::NAN), Some(1.0), Some(1.0)], Some(2.0 / 3.0)),
(vec![None, None, Some(f64::NAN)], Some(0.0)),
(vec![Some(f64::NAN), Some(1.0), Some(1.0)], Some(2.0 / 3.0)),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong

Float sums skipped NaN values, so mean(NaN, 1, null) returned 0.5 while
count still counted the NaN element. NaN is a value, not an absence:
aggregations skip nulls but should let NaN poison the result, matching
IEEE 754 addition and other engines. Remove the NaN skip from
sum_float_all, which covers the scalar, valid-slices, and grouped sum
paths at once; is_saturated already treats a NaN accumulator as
saturated (NaN could previously arise from inf + -inf).

Also compare the count against a zero of the target dtype rather than
u64 in Mean::finalize, per review feedback. The comparison output
becomes nullable, so fill nulls (null groups) with false before
masking.

Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
@robert3005

Copy link
Copy Markdown
Contributor

I am going to drop this here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html. I think long term we will need skipna flag (or similar) to make this correct we need count to skip nans as well

@dimitarvdimitrov

Copy link
Copy Markdown
Contributor Author

I am going to drop this here pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html. I think long term we will need skipna flag (or similar) to make this correct we need count to skip nans as well

i just found out about #7009, i think we'll need the option now so we have the default stats adhere to the IEEE standard and our stats be more useful. WIP

@dimitarvdimitrov

Copy link
Copy Markdown
Contributor Author

we talked about changing only stats to contain the sum as "sum of non-NaNs." But in order to use the sum stat, we have to now always look for the NanCount stat and compute it if it's not present.

are we ok with that perf change for old arrays? I suppose we could do the sum and the nan check in one pass and not change it too much, but that'd be a new accumulator implementation; do you guys have better ideas?

@robert3005

Copy link
Copy Markdown
Contributor

We need to compute it anyway to make this correct. I am working through adding skipNaN option to our aggregates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants