Skip to content

Clamp max_iter in OOC KMeans test#2256

Open
tarang-jain wants to merge 6 commits into
NVIDIA:mainfrom
tarang-jain:tolerance
Open

Clamp max_iter in OOC KMeans test#2256
tarang-jain wants to merge 6 commits into
NVIDIA:mainfrom
tarang-jain:tolerance

Conversation

@tarang-jain

Copy link
Copy Markdown
Contributor

Addresses the KMeans flaky test in #2100. Since we bit match centroids, we limit the number of iterations to ensure that both the OOC and in-memory run the same number of iters. The flaky test is due to numerical differences arising out of atomic operations (perhaps due to different ordering or adding several batch sums in the OOC case versus a single atomic op in the in-memory case) -- raft::reduce_rows_by_key.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 869b9e32-3b4c-4549-9b2f-31d8c9827dd1

📥 Commits

Reviewing files that changed from the base of the PR and between c3372d0 and 0d47620.

📒 Files selected for processing (1)
  • cpp/tests/cluster/kmeans.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tests/cluster/kmeans.cu

📝 Walkthrough

Summary by CodeRabbit

  • Tests
    • Updated KMeans clustering test parameters to ensure consistent iteration counts across different testing approaches, improving test reliability.

Walkthrough

In KmeansFitBatchedTest::fitBatchedTest(), params.max_iter is now set to 3 (with explanatory comments) immediately after params.oversampling_factor = 0, and the later params.max_iter = 20 assignment that overrode it has been removed.

Changes

Batched KMeans Test Iteration Cap

Layer / File(s) Summary
Batched KMeans test max_iter fix
cpp/tests/cluster/kmeans.cu
Adds params.max_iter = 3 (with explanatory comments) after params.oversampling_factor = 0, and removes the later params.max_iter = 20 override so both the reference and batched fit calls use the same iteration count.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Clamp max_iter in OOC KMeans test' accurately describes the main change: limiting the maximum iterations in the out-of-core KMeans test, which directly addresses the test flakiness issue.
Description check ✅ Passed The description is clearly related to the changeset, explaining the rationale for clamping max_iter to prevent flaky tests due to numerical differences between OOC and in-memory KMeans implementations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@tarang-jain tarang-jain self-assigned this Jun 23, 2026
@tarang-jain tarang-jain added bug Something isn't working non-breaking Introduces a non-breaking change labels Jun 23, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/cluster/detail/kmeans_common.cuh`:
- Around line 541-603: The diagnostic fingerprinting code in
compute_centroid_adjustments performs expensive host-device memory copies,
stream synchronization, and hashing operations unconditionally in the KMeans hot
path. Wrap both the input diagnostic block (containing the static s_call_id
counter, mat_labels device buffer creation, host vector copies via raft::copy,
and the fnv/fmt_head lambdas with RAFT_LOG_INFO call) and the corresponding
output diagnostic block (mentioned as also applying to lines 626-666) behind a
compile-time or runtime gate macro/flag so these operations only execute when
explicitly enabled for debugging, not during normal operation.

In `@cpp/src/cluster/detail/minClusterDistanceCompute.cu`:
- Around line 151-208: The self-test block starting with s_self_test_done and
the entire diagnostic block containing the call_id, tensor copying, hash
computation via fnv_fp, and RAFT_LOG_INFO call are expensive operations that
should not run in the default hot path. Gate both the
runFusedNNDeterminismSelfTest invocation block and the diagnostic fingerprinting
block (including the h_X, h_c, h_norm allocations and device-to-host copies)
behind a compile-time feature flag or runtime environment variable that is
disabled by default, so these diagnostics only execute when explicitly enabled
for debugging purposes.

In `@cpp/tests/cluster/kmeans.cu`:
- Around line 384-386: The max_iter parameter is being set to 3 to limit
iterations for consistency between reference and batched code paths, but this
value is later overridden to 20 at line 409, making the initial cap ineffective.
Remove or comment out the later params.max_iter = 20 assignment that occurs
after the max_iter = 3 setting so that the iteration limit remains at 3
throughout the test execution and the flakiness mitigation takes effect as
intended.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3577ffe9-d6da-42d2-a562-20698b05d1d4

📥 Commits

Reviewing files that changed from the base of the PR and between 71306ec and c3372d0.

📒 Files selected for processing (3)
  • cpp/src/cluster/detail/kmeans_common.cuh
  • cpp/src/cluster/detail/minClusterDistanceCompute.cu
  • cpp/tests/cluster/kmeans.cu

Comment thread cpp/src/cluster/detail/kmeans_common.cuh Outdated
Comment thread cpp/src/cluster/detail/minClusterDistanceCompute.cu Outdated
Comment thread cpp/tests/cluster/kmeans.cu
@cjnolet cjnolet moved this to In Progress in Unstructured Data Processing Jun 23, 2026
params.oversampling_factor = 0;
// Limit the number of iterations to ensure same number of iterations for reference and batched
// code paths.
params.max_iter = 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix doesn't seem particularly robust, and appears to me like it'll reduce the coverage for the tests, right? Default max_iters for k-means is much higher. How do we test in a way where we can validate its ability to stop early without having to reduce the max-iters to the point where we are essentially not testing the early stopping at all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point of the test when we are bit matching centroids is to ensure that the accumulation of partial sums in the OOC setting in a single iteration matches the in-memory setting. The test fails when there is an edge case -- the variation due to floating point precision in the arithmetic is just enough to push the difference in centroids between consecutive iterations to fall below the convergence criteria -- in-memory converges one iteration sooner in than OOC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants