Use KvikIO for CAGRA-ACE#2257
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughIntegrates KvikIO into cuVS build, runtime, file I/O, neighbor disk workflows, tests, and packaging. Large-file I/O now uses kvikio handles, CAGRA and HNSW disk paths use kvikio-based reads and writes, and dependency declarations are updated across CMake, conda, Python, and release tooling. ChangesKvikIO GPU Direct Storage Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/neighbors/detail/hnsw.hpp`:
- Around line 840-846: The code calls get() on graph_future, dataset_future, and
label_future sequentially, but if graph_future.get() throws an exception, the
subsequent calls to dataset_future.get() and label_future.get() are never
executed, leaving those async operations unresolved in the background. To fix
this, ensure all three futures are awaited before any exception can propagate by
either using try-catch blocks to drain all futures before re-throwing, or by
restructuring the code to call all three get() methods within a context that
guarantees all are executed (such as a scoped guard or utility function that
ensures all futures are consumed).
In `@cpp/src/util/file_io.cpp`:
- Around line 26-38: The bulk I/O operations are reopening files by path using
kvikio::FileHandle instead of preserving the already-open file descriptor, which
creates a TOCTOU race condition where the path could be replaced between open
and I/O. Fix this by either using the existing file descriptor directly instead
of reopening via path, or by verifying that the inode and device have not
changed before proceeding with the I/O operation. This change breaks the API
contract for descriptor-only usage (where get_path() is unavailable), so add
proper deprecation warnings at the affected call sites (around line 26-38 and
line 49-61 in file_io.cpp) and update the migration guide to document the
required changes for users relying on descriptor-only semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 300f4901-414b-474c-924a-8a674e657be5
📒 Files selected for processing (24)
conda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-133_arch-aarch64.yamlconda/environments/all_cuda-133_arch-x86_64.yamlconda/environments/bench_ann_cuda-129_arch-x86_64.yamlconda/environments/bench_ann_cuda-133_arch-aarch64.yamlconda/environments/bench_ann_cuda-133_arch-x86_64.yamlconda/environments/go_cuda-129_arch-x86_64.yamlconda/environments/go_cuda-133_arch-aarch64.yamlconda/environments/go_cuda-133_arch-x86_64.yamlconda/environments/rust_cuda-129_arch-x86_64.yamlconda/environments/rust_cuda-133_arch-aarch64.yamlconda/environments/rust_cuda-133_arch-x86_64.yamlconda/recipes/libcuvs/recipe.yamlcpp/CMakeLists.txtcpp/cmake/thirdparty/get_kvikio.cmakecpp/include/cuvs/util/file_io.hppcpp/src/neighbors/detail/cagra/cagra_build.cuhcpp/src/neighbors/detail/hnsw.hppcpp/src/util/file_io.cppcpp/tests/CMakeLists.txtcpp/tests/util/file_io_test.cppdependencies.yamlpython/libcuvs/pyproject.toml
jameslamb
left a comment
There was a problem hiding this comment.
I left a few small initial comments from a packaging-codeowners perspective. In general, not opposed to having kvikio be a dependency of cuvs if cuVS maintainers are ok with it.
I'd be happy to review again once cuVS maintainers have more thoroughly gone through this and given their opinions.
| # CUDA 12 on x86_64 only. | ||
| - if: cuda_major == "13" or linux64 | ||
| then: | ||
| - libcufile-dev |
There was a problem hiding this comment.
It should be possible to depend on libcufile unconditionally. RAPIDS moved its CUDA 12 floor to CUDA 12.2 specifically for that.
libkvikio does that:
Can you please explain why this limitation was added? If it's leftover from older RAPIDS references, please remove these guards and comments throughout this PR and make this an unconditional dependency.
There was a problem hiding this comment.
This also may be totally unnecessary for this recipe... I don't see any direct uses of cufile headers in this PR. If this was just to satisfy libkvikio's transitive dependency on cuFile, leave it out... libkvikio conda packages already carry a runtime dependency on libcufile-dev.
There was a problem hiding this comment.
Thanks, I've missed that this was fixed in CUDA 12.2. I've also removed the outdated libcufile-dev dependency.
| requires-python = ">=3.11" | ||
| dependencies = [ | ||
| "cuda-toolkit[cublas,curand,cusolver,cusparse,nvrtc]==13.*", | ||
| "libkvikio==26.8.*,>=0.0.0a0", |
There was a problem hiding this comment.
RAPIDS wheels runtime dlopen() their RAPIDS shared-library dependency.
Add libkvikio here:
cuvs/python/libcuvs/libcuvs/load.py
Lines 38 to 44 in 71306ec
I don't believe rmm or raft depend on it or that it depends on either of them so it can be in any order (I think).
There was a problem hiding this comment.
Applied. I've added it first in case rmm or raft might use kvikio in the future.
| build-backend = "scikit_build_core.build" | ||
| requires = [ | ||
| "cmake>=4.0", | ||
| "libkvikio==26.8.*,>=0.0.0a0", |
There was a problem hiding this comment.
Please update the script we use when cutting new release branches: https://github.com/rapidsai/cuvs/blob/71306ec8e76f6c104768cfaba301ca2dcba780bd/ci/release/update-version.sh#L107
It now needs libkvikio entries.
# Conflicts: # cpp/src/neighbors/detail/cagra/cagra_build.cuh
Motivation:
raft::sample_rowsin each partition. ACE uses large files leading to high page-cache pressure and memory fragmentation. Direct file I/O skips the page cache and resolves the potentially failing pinned memory allocations.Changes: