Clamp max_iter in OOC KMeans test by tarang-jain · Pull Request #2256 · NVIDIA/cuvs

tarang-jain · 2026-06-23T12:25:58Z

Addresses the KMeans flaky test in #2100. Since we bit match centroids, we limit the number of iterations to ensure that both the OOC and in-memory run the same number of iters. The flaky test is due to numerical differences arising out of atomic operations (perhaps due to different ordering or adding several batch sums in the OOC case versus a single atomic op in the in-memory case) -- raft::reduce_rows_by_key.

coderabbitai · 2026-06-23T12:27:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 869b9e32-3b4c-4549-9b2f-31d8c9827dd1

📥 Commits

Reviewing files that changed from the base of the PR and between c3372d0 and 0d47620.

📒 Files selected for processing (1)

cpp/tests/cluster/kmeans.cu

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/tests/cluster/kmeans.cu

📝 Walkthrough

Summary by CodeRabbit

Tests
- Updated KMeans clustering test parameters to ensure consistent iteration counts across different testing approaches, improving test reliability.

Walkthrough

In KmeansFitBatchedTest::fitBatchedTest(), params.max_iter is now set to 3 (with explanatory comments) immediately after params.oversampling_factor = 0, and the later params.max_iter = 20 assignment that overrode it has been removed.

Changes

Batched KMeans Test Iteration Cap

Layer / File(s)	Summary
Batched KMeans test max_iter fix `cpp/tests/cluster/kmeans.cu`	Adds `params.max_iter = 3` (with explanatory comments) after `params.oversampling_factor = 0`, and removes the later `params.max_iter = 20` override so both the reference and batched `fit` calls use the same iteration count.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Clamp max_iter in OOC KMeans test' accurately describes the main change: limiting the maximum iterations in the out-of-core KMeans test, which directly addresses the test flakiness issue.
Description check	✅ Passed	The description is clearly related to the changeset, explaining the rationale for clamping max_iter to prevent flaky tests due to numerical differences between OOC and in-memory KMeans implementations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

…tolerance

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/cluster/detail/kmeans_common.cuh`:
- Around line 541-603: The diagnostic fingerprinting code in
compute_centroid_adjustments performs expensive host-device memory copies,
stream synchronization, and hashing operations unconditionally in the KMeans hot
path. Wrap both the input diagnostic block (containing the static s_call_id
counter, mat_labels device buffer creation, host vector copies via raft::copy,
and the fnv/fmt_head lambdas with RAFT_LOG_INFO call) and the corresponding
output diagnostic block (mentioned as also applying to lines 626-666) behind a
compile-time or runtime gate macro/flag so these operations only execute when
explicitly enabled for debugging, not during normal operation.

In `@cpp/src/cluster/detail/minClusterDistanceCompute.cu`:
- Around line 151-208: The self-test block starting with s_self_test_done and
the entire diagnostic block containing the call_id, tensor copying, hash
computation via fnv_fp, and RAFT_LOG_INFO call are expensive operations that
should not run in the default hot path. Gate both the
runFusedNNDeterminismSelfTest invocation block and the diagnostic fingerprinting
block (including the h_X, h_c, h_norm allocations and device-to-host copies)
behind a compile-time feature flag or runtime environment variable that is
disabled by default, so these diagnostics only execute when explicitly enabled
for debugging purposes.

In `@cpp/tests/cluster/kmeans.cu`:
- Around line 384-386: The max_iter parameter is being set to 3 to limit
iterations for consistency between reference and batched code paths, but this
value is later overridden to 20 at line 409, making the initial cap ineffective.
Remove or comment out the later params.max_iter = 20 assignment that occurs
after the max_iter = 3 setting so that the iteration limit remains at 3
throughout the test execution and the flakiness mitigation takes effect as
intended.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3577ffe9-d6da-42d2-a562-20698b05d1d4

📥 Commits

Reviewing files that changed from the base of the PR and between 71306ec and c3372d0.

📒 Files selected for processing (3)

cpp/src/cluster/detail/kmeans_common.cuh
cpp/src/cluster/detail/minClusterDistanceCompute.cu
cpp/tests/cluster/kmeans.cu

cjnolet · 2026-06-23T16:00:17Z

    params.oversampling_factor = 0;
+    // Limit the number of iterations to ensure same number of iterations for reference and batched
+    // code paths.
+    params.max_iter = 3;


This fix doesn't seem particularly robust, and appears to me like it'll reduce the coverage for the tests, right? Default max_iters for k-means is much higher. How do we test in a way where we can validate its ability to stop early without having to reduce the max-iters to the point where we are essentially not testing the early stopping at all?

I think the point of the test when we are bit matching centroids is to ensure that the accumulation of partial sums in the OOC setting in a single iteration matches the in-memory setting. The test fails when there is an edge case -- the variation due to floating point precision in the arithmetic is just enough to push the difference in centroids between consecutive iterations to fall below the convergence criteria -- in-memory converges one iteration sooner in than OOC.

tarang-jain added 2 commits June 22, 2026 05:46

checks

1098a78

limit iterations

a84501d

tarang-jain requested a review from a team as a code owner June 23, 2026 12:25

github-project-automation Bot added this to Unstructured Data Processing Jun 23, 2026

Merge branch 'main' into tolerance

c3372d0

tarang-jain added 2 commits June 23, 2026 12:29

revert debug statements

1a42257

Merge branch 'tolerance' of https://github.com/tarang-jain/cuvs into …

0f495f4

…tolerance

tarang-jain self-assigned this Jun 23, 2026

tarang-jain added bug Something isn't working non-breaking Introduces a non-breaking change labels Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread cpp/src/cluster/detail/kmeans_common.cuh Outdated

Comment thread cpp/src/cluster/detail/minClusterDistanceCompute.cu Outdated

Comment thread cpp/tests/cluster/kmeans.cu

fix param reset

0d47620

cjnolet moved this to In Progress in Unstructured Data Processing Jun 23, 2026

cjnolet reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clamp max_iter in OOC KMeans test#2256

Clamp max_iter in OOC KMeans test#2256
tarang-jain wants to merge 6 commits into
NVIDIA:mainfrom
tarang-jain:tolerance

tarang-jain commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjnolet Jun 23, 2026

Uh oh!

tarang-jain Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tarang-jain commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjnolet Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

tarang-jain Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading