Skip to content

BE-629: Add spherical k-means entity clustering endpoint via /entities/embeddings/clusters#8919

Open
indietyp wants to merge 1 commit into
mainfrom
bm/be-629-implement-kmeans-clustering-in-the-hash-graph
Open

BE-629: Add spherical k-means entity clustering endpoint via /entities/embeddings/clusters#8919
indietyp wants to merge 1 commit into
mainfrom
bm/be-629-implement-kmeans-clustering-in-the-hash-graph

Conversation

@indietyp

Copy link
Copy Markdown
Member

🌟 What is the purpose of this PR?

This PR adds a POST /entities/embeddings/clusters endpoint that groups a set of entities by embedding similarity using spherical k-means clustering. Callers supply a list of entity IDs, a desired cluster count, and an optional embedding dimension (matryoshka truncation). The response contains the cluster assignments with unit-normalized centroids, plus a list of entities that had no stored embedding.

The clustering algorithm is implemented from scratch in Rust using SIMD-accelerated kernels (f32x8), k-means++ seeding, multiple restarts, and parallel assignment via Rayon. Embeddings are truncated server-side in Postgres using subvector before being sent over the wire, keeping network cost proportional to the requested dimension rather than the full stored width.

The implementation is up to 24x faster than existing crates that operate on CPUs.

🔍 What does this change?

  • Adds a Dimension newtype that enforces the positive-multiple-of-8 invariant required by the SIMD kernels.
  • Adds a kernel module with SIMD-accelerated primitives: dot, add_into, scale_into, scale, add_scaled_into, normalize, micro_4x2 (4-point × 2-centroid tiled dot product), and nearest4 (nearest-centroid search for 4 points simultaneously).
  • Adds a clustering module implementing spherical k-means with k-means++ D² seeding, Lloyd iterations, empty-cluster reseeding, convergence tolerance, and configurable restarts via a Config struct.
  • Adds ClusterEntitiesParams, EntityCluster, and ClusterEntitiesResponse types to the entity store API.
  • Adds a ClusterError error type covering invalid dimension, dimension-too-large, and store failure cases.
  • Adds cluster_entities to the EntityStore trait and implements it in the Postgres store, including permission filtering that avoids leaking which entity IDs were denied versus missing embeddings.
  • Registers the new endpoint at POST /entities/embeddings/clusters and nests the existing POST /entities/embeddings handler under /entities/embeddings/ to keep the routing consistent.
  • Exposes the new types and endpoint in the OpenAPI schema.
  • Forwards cluster_entities through the type-fetcher store wrapper and the integration test DatabaseApi shim.

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • does not modify any publishable blocks or libraries, or modifications do not need publishing

📜 Does this require a change to the docs?

The changes in this PR:

  • are internal and do not require a docs change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph

🛡 What tests cover this?

  • Unit tests for squared_chord_distance covering identical, orthogonal, opposite, and zero-norm cases.
  • Unit tests for all SIMD kernel functions (dot, add_into, scale_into, scale, add_scaled_into, normalize, micro_4x2, nearest4) verified against scalar reference implementations.
  • Unit tests for the clustering algorithm covering: empty input, k=0, k=1, single point, n < 4, n = k, well-separated blobs (>95% accuracy), determinism with the same seed, unit-normalized centroids, labels in range, labels nearest to assigned centroid, subsampled clustering, more clusters than natural groups, all-identical points, and mixed zero-norm rows.
  • Unit tests for the Dimension newtype covering valid multiples of 8, zero rejection, and non-multiples rejection.

❓ How to test this?

  1. Checkout the branch.
  2. Ensure entities with stored embeddings exist in the database.
  3. Send a POST /entities/embeddings/clusters request with a JSON body containing entityIds, clusterCount, and optionally dimension and seed.
  4. Confirm the response contains clusters (each with clusterId, entityIds, and centroid) and missingEmbeddings for any entities without stored embeddings.

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: embedding clustering

feat: embedding clustering

feat: embedding clustering

feat: embedding clustering

feat: checkpoint

feat: checkpoint

feat: checkpoint

fix: merge

feat: checkpoint

feat: checkpoint

feat: checkpoint

fix: merge

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint

feat: checkpoint]

feat: checkpoint]

feat: checkpoint]

feat: checkpoint

feat: checkpoint
Copilot AI review requested due to automatic review settings June 30, 2026 09:46
@vercel

vercel Bot commented Jun 30, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hash Error Error Jun 30, 2026 9:51am
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
hashdotdesign-tokens Ignored Ignored Preview Jun 30, 2026 9:51am
petrinaut Skipped Skipped Jun 30, 2026 9:51am

@vercel vercel Bot temporarily deployed to Preview – petrinaut June 30, 2026 09:46 Inactive
@cursor

cursor Bot commented Jun 30, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
New authenticated endpoint runs permission-filtered SQL and CPU-heavy parallel clustering on arbitrary entity ID lists; routing nests embeddings under /entities/embeddings/ but paths stay compatible.

Overview
Adds spherical k-means clustering over entity embeddings so callers can group a set of entities by similarity, with optional matryoshka dimension truncation and an optional RNG seed.

API & routing: New POST /entities/embeddings/clusters handler; embedding updates move under a nested /entities/embeddings/ router (/ for updates, /clusters for clustering). OpenAPI documents ClusterEntitiesParams, ClusterEntitiesResponse, and EntityCluster.

Store layer: EntityStore::cluster_entities plus ClusterError for invalid/oversized dimensions and store failures. Postgres loads entity-level embeddings via subvector in SQL, filters by view permission, runs in-process clustering, and returns clusters with centroids while listing request IDs without embeddings in missing_embeddings (without distinguishing permission denials).

New embedding stack: Dimension (multiple-of-8), SIMD kernel helpers, and a Rayon-backed clustering module (k-means++, restarts, subsampling). Wired through type-fetcher and integration test shims.

Reviewed by Cursor Bugbot for commit bf17f16. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions github-actions Bot added area/libs Relates to first-party libraries/crates/packages (area) type/eng > backend Owned by the @backend team area/tests New or updated tests labels Jun 30, 2026

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bf17f16. Configure here.

Comment thread libs/@local/graph/store/src/embedding/clustering.rs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds embedding-based spherical k-means clustering to the graph store and exposes it via a new REST endpoint (POST /entities/embeddings/clusters). The implementation introduces a SIMD-accelerated Rust clustering engine and wires it through the Postgres store, API routing/OpenAPI, and relevant store wrappers/shims.

Changes:

  • Introduces a new embedding module in hash_graph_store with a Dimension invariant type, SIMD kernels, and a spherical k-means implementation.
  • Extends the EntityStore API with cluster_entities and implements it in the Postgres store, including permission filtering and embedding truncation via subvector.
  • Adds the REST endpoint and forwards the new store method through type-fetcher and integration test shims.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/graph/integration/postgres/lib.rs Forwards cluster_entities through the DatabaseApi integration shim.
libs/@local/graph/type-fetcher/src/store.rs Forwards cluster_entities through the type-fetcher store wrapper.
libs/@local/graph/store/src/lib.rs Enables required nightly features and registers the new embedding module.
libs/@local/graph/store/src/error.rs Adds ClusterError for clustering-related failures.
libs/@local/graph/store/src/entity/store.rs Adds request/response types and the EntityStore::cluster_entities trait method.
libs/@local/graph/store/src/entity/mod.rs Re-exports the new clustering API types.
libs/@local/graph/store/src/embedding/mod.rs Declares the new embedding submodules and lint expectations.
libs/@local/graph/store/src/embedding/kernel.rs Implements SIMD-accelerated vector primitives and tests.
libs/@local/graph/store/src/embedding/dimension.rs Adds Dimension newtype enforcing “positive multiple of 8”.
libs/@local/graph/store/src/embedding/clustering.rs Implements spherical k-means (+ seeding/restarts/parallel assignment) and tests.
libs/@local/graph/postgres-store/src/store/postgres/knowledge/entity/mod.rs Implements cluster_entities query + permission filtering + clustering execution.
libs/@local/graph/api/src/rest/entity.rs Registers the new REST endpoint and nests existing embeddings routing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/@local/graph/store/src/embedding/clustering.rs
Comment on lines +2584 to +2596
let mut groups: HashMap<u16, Vec<EntityId>> = HashMap::new();
for (index, id) in found_ids.iter().enumerate() {
groups.entry(result.label(index)).or_default().push(*id);
}

let clusters = groups
.into_iter()
.map(|(cluster_id, entity_ids)| EntityCluster {
cluster_id,
entity_ids,
centroid: result.centroid(cluster_id).to_vec(),
})
.collect();
Comment on lines +766 to +770
store
.cluster_entities(actor_id, params)
.await
.map_err(report_to_response)
.map(Json)
Comment thread libs/@local/graph/store/src/entity/store.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests type/eng > backend Owned by the @backend team

Development

Successfully merging this pull request may close these issues.

2 participants