diff --git a/dev/specs/ihs-138-store-merge/decisions.md b/dev/specs/ihs-138-store-merge/decisions.md new file mode 100644 index 00000000..9f8834eb --- /dev/null +++ b/dev/specs/ihs-138-store-merge/decisions.md @@ -0,0 +1,187 @@ +# IHS-138 - Decision sign-off sheet + +One row per open decision from [`plan.md`](./plan.md) section 11. Each has a +recommendation and the reasoning; mark **Accept** / **Override** and note any +change. Nothing in Stage 1 (presence flags) depends on these - they gate Stage 2 +onward. + +Already settled (recorded in the plan, listed here for completeness): + +- **Version:** SDK 1.23.0, semver minor, shipped as a bug fix with a required + opt-out and a required behaviour-change note. +- **Where the option lives:** `Config` default -> store default -> per-call + `merge=` override. + +--- + +## D1 - Returned object vs stored object (C2) + +**Question.** After a re-fetch, the store keeps the merged canonical object. What +does a query method (`get` / `all` / `filters`) return? + +- **(a) Return the freshly built per-query object** (diverges from the store copy). +- **(b) Return the canonical merged store object.** + +**Recommendation: (a).** + +**Why.** + +- Minor-version safe: what `client.get()` returns is byte-for-byte what it returns + today (exactly the fields this query asked for). Only `store.get()` gets richer. + Option (b) changes every query's return value to carry data the caller did not + request - much harder to call non-breaking. +- It shrinks the D2 blast radius: the only "living" object is the internal store + copy; returned objects are stable per-query snapshots. +- Less code: (b) needs the query path to map incoming -> canonical and swap + returned references. +- Clean mental model: "the return value is this query's result; the store + accumulates across queries." + +**Cost / what to verify.** `client.get(id) is client.store.get(id)` stops being +true (it holds today). Value equality `==` still holds (id-based, `node.py:720`). +Add a `save()` test: `save()` calls `store.set(self)`, so `self` merges into the +canonical copy while the caller keeps `self` - confirm no field is revived that +the save did not touch. + +**Decision:** ____________________ + +--- + +## D2 - In-place mutation / living objects (C3) + +**Question.** Merge mutates the existing stored object, so a `store.get()` +reference can change under the caller when an unrelated query re-fetches that node. +Is that the intended model? + +**Recommendation: accept it, documented, with local-edit protection.** + +**Why.** + +- It is the correct property for a cache: every store reference points at the one + canonical object, which is always current. The alternative (merge into a new + object, replace the entry) leaves previously handed-out `store.get()` references + stale and diverging from the store - worse. +- Under D1(a) the surprise is confined to code that deliberately holds a + `store.get()` result across queries; plain query return values never mutate. +- Protect unsaved local edits: merge must skip fields flagged + `value_has_been_mutated` (`attribute.py:100`) / `_peer_has_been_mutated` + (`related_node.py:74`), so re-querying never discards in-memory changes. + +**Cost.** New, documented behaviour. Covered by the section 6 "behaviour change +must be clear" requirement. + +**Decision:** ____________________ + +--- + +## D3 - Default for the public `store.set()` (C5) + +**Question.** The query-population path defaults to merge (the fix). What should the +public `store.set(node=...)` default to? + +- **(a) `merge=False` (replace) by default**, opt into merge. +- **(b) `merge=True` (merge) by default**, same as queries. + +**Recommendation: (a) - public `set()` defaults to replace.** + +**Why.** + +- "set" reads as an imperative "make the store hold this," like `dict[k] = v`. + Merge is the enrichment behaviour the *query* path needs, not what an explicit + `set` implies. +- Preserves the documented contract ("store this object", `store.mdx:147`). That + example sets a fresh object under a custom key, usually with no prior entry, so + replace and merge are identical there - no example breaks. +- The IHS-138 bug lives in the query path, not in manual `set()`. Keep the default + change where the bug is; leave the explicit call predictable. +- A user who wants enrichment passes `merge=True`. + +**Note.** This means the default differs by entry point: queries follow +`Config.store_merge` (default merge); `set()` defaults to replace regardless. That +is intentional and should be stated in the `set()` docstring. + +**Decision:** ____________________ + +--- + +## D4 - Config option name and type + +**Question.** How is the global default expressed on `Config`? + +- **(a) Bool `store_merge` with a full `description`** (drafted in plan section 3). +- **(b) A clearer bool name** (e.g. `merge_store_results`). +- **(c) Enum `store_update_mode: StoreUpdateMode` (`MERGE` / `REPLACE`)**, matching + existing `Config` enums (`InfrahubClientMode`, `RecorderType`). + +**Recommendation: (a) - bool `store_merge` carried by its description.** + +**Why.** + +- Type-consistency with the per-call `merge: bool` argument. An enum on `Config` + plus a bool per call is a mismatch; keeping both bool is simplest to reason about. +- The clarity the user asked for is delivered by the `description` (which states + both behaviours and the default), per the section 6 hard requirement - so the + terse name is acceptable. +- Enum is the fallback if call-site self-documentation is valued over type + consistency, and we are willing to accept the enum/bool split (or make the + per-call arg accept the enum too). + +**Decision:** ____________________ + +--- + +## D5 - Internal presence-flag naming + +**Question.** Name for the new "present in this response" flag on `Attribute` and +`RelatedNode`. + +**Recommendation: `is_fetched`, with a uniform accessor across all three field +types.** + +**Why.** + +- `RelatedNode` already has `initialized` meaning "has a peer" + (`related_node.py:189`) - a *different* concept - so the new flag there cannot be + called `initialized` without a confusing collision. +- `RelationshipManager.initialized` already means "was fetched." Expose + `is_fetched` on it as a thin alias of `initialized`, add a real `is_fetched` + flag to `Attribute` and `RelatedNode`, and the merge code can branch uniformly + on `field.is_fetched` regardless of field type. +- Internal only (not public API), so low stakes - but the uniform accessor removes + per-type special-casing in the merge. + +**Decision:** ____________________ + +--- + +## D6 - Scope of `merge=False` (replace) + +**Question.** When a node is stored with `merge=False`, does it also drop that +node's relationship peers from the store, or only replace the node's own entry? + +**Recommendation: only replace the node's own entry.** + +**Why.** + +- Peers are independent store entries that other nodes may reference; cascading + eviction could break unrelated references. +- Peers fetched in the same query go through their own `set()` calls and follow the + same `merge` flag independently. No special cascade needed. + +**Decision:** ____________________ + +--- + +## Summary table + +| ID | Decision | Recommendation | +|----|----------|----------------| +| D1 | Returned vs stored object | (a) return per-query object; store is canonical | +| D2 | In-place mutation model | Accept, documented, protect local edits | +| D3 | `store.set()` default | (a) replace by default; queries merge by default | +| D4 | Config option name/type | (a) bool `store_merge` + full description | +| D5 | Presence-flag name | `is_fetched`, uniform accessor on all three types | +| D6 | `merge=False` scope | Node entry only; peers handled independently | + +Once D1-D3 are settled, Stage 2 onward is unblocked. D4-D6 can be confirmed during +implementation without reblocking. diff --git a/dev/specs/ihs-138-store-merge/plan.md b/dev/specs/ihs-138-store-merge/plan.md new file mode 100644 index 00000000..3abd937e --- /dev/null +++ b/dev/specs/ihs-138-store-merge/plan.md @@ -0,0 +1,514 @@ +# IHS-138 - Merge nodes in the client store instead of overwriting + +- **Ticket:** [IHS-138](https://opsmill.atlassian.net/browse/IHS-138) (GitHub [#413](https://github.com/opsmill/infrahub-sdk-python/issues/413)) +- **Priority:** High +- **Affected versions:** observed on SDK 1.12.1 +- **Target version:** SDK 1.23.0 (ships alongside Infrahub 1.11.0) +- **Status:** plan / not started + +> The SDK is versioned independently of Infrahub core. 1.23.0 is a semver *minor* +> bump even though it rides the Infrahub 1.11.0 release. Because it changes +> default store behaviour (see sections 3 and 10), shipping it as a minor depends +> on two things being true: the change is framed as a bug fix, and an explicit +> opt-out exists. Both are required deliverables, not nice-to-haves. + +## 1. Problem + +Querying the same Infrahub node more than once can silently drop information +that was previously held in the client store. + +Reproduction from the ticket: + +```python +client = InfrahubClientSync(address="https://demo.infrahub.app/") + +interface = client.get("InfraInterface", device__name__value="jfk1-edge1", + name__value="Ethernet6", prefetch_relationships=True) +client.store.get(interface.id).device.id # works + +client.all("InfraCircuitEndpoint", prefetch_relationships=True) +client.store.get(interface.id).device.id # AttributeError: 'NoneType' has no attribute 'id' +``` + +The first query stores the interface with its `device` relationship. The second +query re-fetches the same interface as a *related node* of each circuit endpoint, +at a depth that does not include the interface's `device`. After the second query +the device link is gone, and the interface is present in the store twice. + +## 2. Root cause + +The store keys objects by a random per-object id, not by their stable UUID. + +- `infrahub_sdk/node/node.py:110` - every node object gets a fresh + `self._internal_id = generate_short_id()` on construction. +- `infrahub_sdk/store.py:47-60` - `NodeStoreBranch.set()` stores under + `self._objs[node._internal_id]` and points `self._uuids[node.id]` / + `self._hfids[...]` at that random id. +- `infrahub_sdk/client.py:1161-1168` - the query paths unconditionally call + `self.store.set(node=node)` for every node and related node, with no check for + whether the UUID is already present. + +So a second fetch of the same UUID: + +1. creates a new Python object with a new `_internal_id`, +2. leaves a duplicate entry in `_objs` (unbounded growth), +3. overwrites `_uuids[node.id]` to point at the newest, possibly poorer copy. + +`store.get(uuid)` then resolves to the poorer copy, whose relationship was never +initialized -> `AttributeError`. + +## 3. Design + +Replace "last write wins, keyed by a random id" with "merge into the existing +object, keyed by UUID, field by field." A field is only overwritten when the new +fetch actually carried it; fields the new fetch did not request are left intact. + +The decision of whether the new fetch carried a field requires a per-field +"was this present in the response" signal. The state of that signal across the +field types is **not uniform today**, and getting it uniform is the heart of the +fix: + +- **Cardinality-many relationships** (`RelationshipManager`, including the + hierarchical `children` / `ancestors` / `descendants`) already have a correct + signal: `initialized = (data is not None)` (`relationship.py:202`). This genuinely + means "was fetched." Usable as-is. +- **Cardinality-one relationships** (`RelatedNode`, including the hierarchical + `parent` and every normal one-relationship) have a *misleading* signal: + `initialized = bool(self.id) or bool(self.hfid)` (`related_node.py:189`) means + "has a peer," NOT "was fetched." A fetched-but-empty one-relationship (node moved + to root, optional relationship cleared) reports `initialized == False` and is + indistinguishable from "not fetched." Using it as the merge gate would keep a + stale parent after a move-to-root. **We must add a real presence flag to + `RelatedNode`** (set from key-presence in `_init_relationships`) and gate the + merge on that, not on `initialized`. +- **Attributes** have no signal at all. `_init_attributes` (`node.py:269`) builds an + `Attribute` for every schema attribute regardless of the query, and a field absent + from the response collapses to `_value = None` (`attribute.py:88-99`), + indistinguishable from "fetched and genuinely null." We add the flag. + +**Unifying principle:** every field type - attribute, cardinality-one, +cardinality-many, hierarchical - must carry a "present in this response" flag, and +the merge replaces present fields (even to empty / None) and keeps absent ones. +This is what makes moves, clears, and partial fetches all behave the way a user +reads them: *the store reflects the latest server state for whatever you actually +re-fetched, and leaves everything else untouched.* + +### Merge rule (per field) + +```text +for each field on the incoming node: + if field was present in this response (initialized): + if the stored field was locally mutated by the user (unsaved): + keep the local value # local edits win + else: + take the incoming value # refresh, even to None + else: + keep the stored value # not queried -> no fresher info +``` + +- Within a fetched cardinality-many relationship, the member list is *replaced*, + never unioned, so a peer removed on the server is correctly dropped. +- For attributes, the cleanest implementation is to swap the whole `Attribute` + object when it should be taken, so metadata (`is_default`, `is_from_profile`, + `id`) refreshes with the value, not just `_value`. Caveat: if the re-fetch + selected the value but not the property metadata, swapping wholesale can null + those flags - preserve them from the stored attribute when the incoming ones are + absent. + +### "Field" means all three relationship buckets + +A node holds relationships in **three** separate containers, and the merge must +iterate all of them or it reintroduces the bug on the ones it skips: + +- `_relationship_cardinality_one_data` (`RelatedNode`) +- `_relationship_cardinality_many_data` (`RelationshipManager`) +- `_hierarchical_data` (parent / children / ancestors / descendants) + +Note `self._relationships` (`node.py:117`) lists **only** `schema.relationships`, +so it does *not* include the hierarchical bucket. Do not drive the merge off +`self._relationships` alone. + +### Object identity + +Merge mutates the *existing* stored object and keeps its `_internal_id`. This +removes the duplicate `_objs` entries and keeps `store.get(id)` returning a stable +object across repeated fetches. Equality is unaffected: `InfrahubNode.__eq__` / +`__hash__` are id-based (`node.py:717-723`), so `node == client.store.get(node.id)` +remains true. + +### Decisions deliberately taken + +1. **Local edits win over a re-fetch.** Honour `value_has_been_mutated` + (`attribute.py:100`) and `_peer_has_been_mutated` so re-collecting the same + info never clobbers unsaved in-memory changes. Least surprising. +2. **Merge is the default.** Re-collecting the same information is therefore + idempotent and only ever refreshes or adds knowledge. +3. **`fetched-None` counts as fetched.** The presence flag means "present in the + response," not "value is non-null," so a genuine server-side clear still + overwrites a stale cached value. + +### Known limitation -> escape hatch + +Merge can never *forget* a field that was cached earlier but not re-fetched (for +example, to observe that a relationship no longer exists when your refresh query +did not select it). This is the one case where a full replace is wanted. Provide +an explicit opt-out rather than guessing. It lives in three layers, mirroring how +`populate_store` / `pagination_size` already work (config default -> per-call +override): + +1. **Implementation:** `NodeStoreBranch.set()` / `NodeStoreBase._set()` in + `store.py` - the only place that writes `_objs`. +2. **Per-call control:** a `merge` argument on `NodeStore.set()` / + `NodeStoreSync.set()` (`store.py:344`/`432`) and on the query methods `get` / + `all` / `filters` (and sync variants), placed next to the existing + `populate_store` argument. It threads into the `store.set(...)` calls at + `client.py:1164`/`1168`/`2903`/`2907`. +3. **Global default / opt-out:** a field on `Config` (`config.py`), passed into the + store at construction (`client.py:358`, currently + `NodeStore(default_branch=self.default_branch)`). Defaults to merge. + `InfrahubClient(config=Config(