Return null for mean of all-null input by dimitarvdimitrov · Pull Request #8389 · vortex-data/vortex

dimitarvdimitrov · 2026-06-12T11:47:41Z

Why

Reverts #8365 and fixes the inconsistency it addressed in the opposite direction: a mean over zero non-null values should be null (as in SQL, where nulls are skipped and an empty mean is null), not NaN.

Before this change Mean::finalize (the grouped array path) divided the sum and count arrays directly, so an all-null group produced 0/0 = NaN, while Mean::finalize_scalar returned null for the same input — the result depended on which reduce path the group took. Now finalize masks out zero-count entries after the division so both paths return null.

This also changes float sums to propagate NaN instead of skipping it: previously sum skipped NaN values while count counted them as valid, so mean(NaN, 1, null) returned 0.5. NaN is a value, not an absence — aggregations skip nulls but let NaN poison the result, matching IEEE 754 addition and other engines. is_saturated already treated a NaN accumulator as saturated (NaN could previously arise from inf + -inf), so only the skip in sum_float_all had to go.

The new tests run the same groups (NaN/null mixtures, all-null, all-valid) through both the scalar-partial path and the grouped array path with identical assertions.

… input (#8365)" This reverts commit 05871c3. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>

Mean::finalize divided the sum and count arrays directly, so an all-null group produced 0/0 = NaN while finalize_scalar returned null for the same input. Nulls are skipped during accumulation, so a zero-count group is an empty mean and should be null, as in SQL. Mask out zero-count entries after the division so both paths agree. The new tests run the same set of groups (including NaN and null mixtures) through both the scalar-partial path and the grouped array path with identical assertions. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>

codspeed-hq · 2026-06-12T11:56:58Z

Merging this PR will degrade performance by 19.02%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 5 improved benchmarks
❌ 93 regressed benchmarks
✅ 1430 untouched benchmarks
⏩ 10 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`compare[48]`	213 µs	300.6 µs	-29.15%
❌	Simulation	`compare[50]`	227.7 µs	319.2 µs	-28.65%
❌	Simulation	`compare[49]`	228.2 µs	317.7 µs	-28.18%
❌	Simulation	`compare[44]`	207.5 µs	287.8 µs	-27.89%
❌	Simulation	`compare[46]`	218.5 µs	302.5 µs	-27.76%
❌	Simulation	`compare[47]`	223.5 µs	309.3 µs	-27.73%
❌	Simulation	`compare[40]`	190.7 µs	263.6 µs	-27.67%
❌	Simulation	`compare[44]`	212.2 µs	292.3 µs	-27.43%
❌	Simulation	`compare[45]`	218.9 µs	301 µs	-27.28%
❌	Simulation	`compare[43]`	209.2 µs	287.6 µs	-27.25%
❌	Simulation	`compare[42]`	204.6 µs	281.1 µs	-27.23%
❌	Simulation	`compare[40]`	195.6 µs	268.4 µs	-27.1%
❌	Simulation	`compare[43]`	214.2 µs	292.5 µs	-26.77%
❌	Simulation	`compare[41]`	204.5 µs	279.3 µs	-26.76%
❌	Simulation	`compare[42]`	209.4 µs	286 µs	-26.76%
❌	Simulation	`compare[41]`	209.3 µs	284 µs	-26.28%
❌	Simulation	`compare[39]`	199.9 µs	270.8 µs	-26.2%
❌	Simulation	`compare[38]`	195.5 µs	264.6 µs	-26.11%
❌	Simulation	`compare[36]`	187.3 µs	252.5 µs	-25.84%
❌	Simulation	`compare[39]`	204.7 µs	275.6 µs	-25.71%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing mitko/aggregate/mean-all-null-returns-null (561876e) with develop (3731d5b)}

10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

robert3005 · 2026-06-12T12:14:43Z

+        // the division produces 0/0 = NaN. The mean of an empty group is null (as in SQL), so
+        // mask out zero-count entries. This matches `finalize_scalar`.
+        let non_empty = count.binary(
+            ConstantArray::new(0u64, count.len()).into_array(),


doesn't this need to be target dtype?

dimitarvdimitrov · 2026-06-12T12:57:58Z

sorry this was prematurely opened by claude. i'm not sure about the mean(nan, 1) == 0.5

joseph-isaacs · 2026-06-12T12:58:27Z

+            ConstantArray::new(0u64, count.len()).into_array(),
+            Operator::NotEq,
+        )?;
+        mean.mask(non_empty)


joseph-isaacs · 2026-06-12T12:59:53Z

+            (vec![Some(f64::NAN), Some(1.0), None], Some(0.5)),
+            (vec![Some(f64::NAN), Some(1.0), Some(1.0)], Some(2.0 / 3.0)),
+            (vec![None, None, Some(f64::NAN)], Some(0.0)),
+            (vec![Some(f64::NAN), Some(1.0), Some(1.0)], Some(2.0 / 3.0)),


This looks wrong

Float sums skipped NaN values, so mean(NaN, 1, null) returned 0.5 while count still counted the NaN element. NaN is a value, not an absence: aggregations skip nulls but should let NaN poison the result, matching IEEE 754 addition and other engines. Remove the NaN skip from sum_float_all, which covers the scalar, valid-slices, and grouped sum paths at once; is_saturated already treats a NaN accumulator as saturated (NaN could previously arise from inf + -inf). Also compare the count against a zero of the target dtype rather than u64 in Mean::finalize, per review feedback. The comparison output becomes nullable, so fill nulls (null groups) with false before masking. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>

robert3005 · 2026-06-12T13:23:39Z

I am going to drop this here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html. I think long term we will need skipna flag (or similar) to make this correct we need count to skip nans as well

dimitarvdimitrov · 2026-06-12T13:44:21Z

I am going to drop this here pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html. I think long term we will need skipna flag (or similar) to make this correct we need count to skip nans as well

i just found out about #7009, i think we'll need the option now so we have the default stats adhere to the IEEE standard and our stats be more useful. WIP

dimitarvdimitrov · 2026-06-12T16:40:00Z

we talked about changing only stats to contain the sum as "sum of non-NaNs." But in order to use the sum stat, we have to now always look for the NanCount stat and compute it if it's not present.

are we ok with that perf change for old arrays? I suppose we could do the sum and the nan check in one pass and not change it too much, but that'd be a new accumulator implementation; do you guys have better ideas?

robert3005 · 2026-06-12T16:43:23Z

We need to compute it anyway to make this correct. I am working through adding skipNaN option to our aggregates

dimitarvdimitrov added 2 commits June 12, 2026 12:33

Revert "aggregate_fn: Return NaN instead of null for mean of all-null…

58a710c

… input (#8365)" This reverts commit 05871c3. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>

dimitarvdimitrov requested a review from a team June 12, 2026 11:47

robert3005 reviewed Jun 12, 2026

View reviewed changes

dimitarvdimitrov marked this pull request as draft June 12, 2026 12:57

joseph-isaacs reviewed Jun 12, 2026

View reviewed changes

dimitarvdimitrov added the changelog/fix A bug fix label Jun 12, 2026

dimitarvdimitrov mentioned this pull request Jun 12, 2026

aggregate_fn: Return NaN instead of null for mean of all-null input #8365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return null for mean of all-null input#8389

Return null for mean of all-null input#8389
dimitarvdimitrov wants to merge 3 commits into
developfrom
mitko/aggregate/mean-all-null-returns-null

dimitarvdimitrov commented Jun 12, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

robert3005 Jun 12, 2026

Uh oh!

dimitarvdimitrov commented Jun 12, 2026

Uh oh!

joseph-isaacs Jun 12, 2026

Uh oh!

joseph-isaacs Jun 12, 2026

Uh oh!

robert3005 commented Jun 12, 2026

Uh oh!

dimitarvdimitrov commented Jun 12, 2026

Uh oh!

dimitarvdimitrov commented Jun 12, 2026

Uh oh!

robert3005 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dimitarvdimitrov commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Uh oh!

codspeed-hq Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 19.02%

Performance Changes

Footnotes

Uh oh!

robert3005 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

dimitarvdimitrov commented Jun 12, 2026

Uh oh!

joseph-isaacs Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

robert3005 commented Jun 12, 2026

Uh oh!

dimitarvdimitrov commented Jun 12, 2026

Uh oh!

dimitarvdimitrov commented Jun 12, 2026

Uh oh!

robert3005 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dimitarvdimitrov commented Jun 12, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 12, 2026 •

edited

Loading