BE-629: Add spherical k-means entity clustering endpoint via /entities/embeddings/clusters#8919
Conversation
feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: embedding clustering feat: embedding clustering feat: embedding clustering feat: embedding clustering feat: checkpoint feat: checkpoint feat: checkpoint fix: merge feat: checkpoint feat: checkpoint feat: checkpoint fix: merge feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint] feat: checkpoint] feat: checkpoint] feat: checkpoint feat: checkpoint
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
PR SummaryMedium Risk Overview API & routing: New Store layer: New embedding stack: Reviewed by Cursor Bugbot for commit bf17f16. Bugbot is set up for automated code reviews on this repo. Configure here. |
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit bf17f16. Configure here.
There was a problem hiding this comment.
Pull request overview
Adds embedding-based spherical k-means clustering to the graph store and exposes it via a new REST endpoint (POST /entities/embeddings/clusters). The implementation introduces a SIMD-accelerated Rust clustering engine and wires it through the Postgres store, API routing/OpenAPI, and relevant store wrappers/shims.
Changes:
- Introduces a new
embeddingmodule inhash_graph_storewith aDimensioninvariant type, SIMD kernels, and a spherical k-means implementation. - Extends the
EntityStoreAPI withcluster_entitiesand implements it in the Postgres store, including permission filtering and embedding truncation viasubvector. - Adds the REST endpoint and forwards the new store method through type-fetcher and integration test shims.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/graph/integration/postgres/lib.rs | Forwards cluster_entities through the DatabaseApi integration shim. |
| libs/@local/graph/type-fetcher/src/store.rs | Forwards cluster_entities through the type-fetcher store wrapper. |
| libs/@local/graph/store/src/lib.rs | Enables required nightly features and registers the new embedding module. |
| libs/@local/graph/store/src/error.rs | Adds ClusterError for clustering-related failures. |
| libs/@local/graph/store/src/entity/store.rs | Adds request/response types and the EntityStore::cluster_entities trait method. |
| libs/@local/graph/store/src/entity/mod.rs | Re-exports the new clustering API types. |
| libs/@local/graph/store/src/embedding/mod.rs | Declares the new embedding submodules and lint expectations. |
| libs/@local/graph/store/src/embedding/kernel.rs | Implements SIMD-accelerated vector primitives and tests. |
| libs/@local/graph/store/src/embedding/dimension.rs | Adds Dimension newtype enforcing “positive multiple of 8”. |
| libs/@local/graph/store/src/embedding/clustering.rs | Implements spherical k-means (+ seeding/restarts/parallel assignment) and tests. |
| libs/@local/graph/postgres-store/src/store/postgres/knowledge/entity/mod.rs | Implements cluster_entities query + permission filtering + clustering execution. |
| libs/@local/graph/api/src/rest/entity.rs | Registers the new REST endpoint and nests existing embeddings routing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut groups: HashMap<u16, Vec<EntityId>> = HashMap::new(); | ||
| for (index, id) in found_ids.iter().enumerate() { | ||
| groups.entry(result.label(index)).or_default().push(*id); | ||
| } | ||
|
|
||
| let clusters = groups | ||
| .into_iter() | ||
| .map(|(cluster_id, entity_ids)| EntityCluster { | ||
| cluster_id, | ||
| entity_ids, | ||
| centroid: result.centroid(cluster_id).to_vec(), | ||
| }) | ||
| .collect(); |
| store | ||
| .cluster_entities(actor_id, params) | ||
| .await | ||
| .map_err(report_to_response) | ||
| .map(Json) |


🌟 What is the purpose of this PR?
This PR adds a
POST /entities/embeddings/clustersendpoint that groups a set of entities by embedding similarity using spherical k-means clustering. Callers supply a list of entity IDs, a desired cluster count, and an optional embedding dimension (matryoshka truncation). The response contains the cluster assignments with unit-normalized centroids, plus a list of entities that had no stored embedding.The clustering algorithm is implemented from scratch in Rust using SIMD-accelerated kernels (
f32x8), k-means++ seeding, multiple restarts, and parallel assignment via Rayon. Embeddings are truncated server-side in Postgres usingsubvectorbefore being sent over the wire, keeping network cost proportional to the requested dimension rather than the full stored width.The implementation is up to 24x faster than existing crates that operate on CPUs.
🔍 What does this change?
Dimensionnewtype that enforces the positive-multiple-of-8 invariant required by the SIMD kernels.kernelmodule with SIMD-accelerated primitives:dot,add_into,scale_into,scale,add_scaled_into,normalize,micro_4x2(4-point × 2-centroid tiled dot product), andnearest4(nearest-centroid search for 4 points simultaneously).clusteringmodule implementing spherical k-means with k-means++ D² seeding, Lloyd iterations, empty-cluster reseeding, convergence tolerance, and configurable restarts via aConfigstruct.ClusterEntitiesParams,EntityCluster, andClusterEntitiesResponsetypes to the entity store API.ClusterErrorerror type covering invalid dimension, dimension-too-large, and store failure cases.cluster_entitiesto theEntityStoretrait and implements it in the Postgres store, including permission filtering that avoids leaking which entity IDs were denied versus missing embeddings.POST /entities/embeddings/clustersand nests the existingPOST /entities/embeddingshandler under/entities/embeddings/to keep the routing consistent.cluster_entitiesthrough the type-fetcher store wrapper and the integration testDatabaseApishim.Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
🛡 What tests cover this?
squared_chord_distancecovering identical, orthogonal, opposite, and zero-norm cases.dot,add_into,scale_into,scale,add_scaled_into,normalize,micro_4x2,nearest4) verified against scalar reference implementations.Dimensionnewtype covering valid multiples of 8, zero rejection, and non-multiples rejection.❓ How to test this?
POST /entities/embeddings/clustersrequest with a JSON body containingentityIds,clusterCount, and optionallydimensionandseed.clusters(each withclusterId,entityIds, andcentroid) andmissingEmbeddingsfor any entities without stored embeddings.