Add unit tests for SONiC sync and exporter modules and fix issues found in review#2333
Conversation
07abc30 to
abf328d
Compare
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Both
test_sync.pyandtest_exporter.pydefine identical helpers like_has_log; consider moving these small utilities into a shared test helper/fixture module undertests/unit/tasks/conductor/sonic/to avoid duplication and keep future changes in one place. - Several tests assert on exact log message substrings and precise call sequences (e.g. full cache lifecycle order); if the behavior doesn’t require strict ordering or exact wording, consider loosening these expectations slightly to reduce brittleness against harmless refactors while still checking the important semantics.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Both `test_sync.py` and `test_exporter.py` define identical helpers like `_has_log`; consider moving these small utilities into a shared test helper/fixture module under `tests/unit/tasks/conductor/sonic/` to avoid duplication and keep future changes in one place.
- Several tests assert on exact log message substrings and precise call sequences (e.g. full cache lifecycle order); if the behavior doesn’t require strict ordering or exact wording, consider loosening these expectations slightly to reduce brittleness against harmless refactors while still checking the important semantics.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
ideaship
left a comment
There was a problem hiding this comment.
Nothing major, just some minor issues.
| assert result == (False, None) | ||
|
|
||
|
|
||
| def test_save_config_change_writes_journal_and_saves(mock_nb): |
There was a problem hiding this comment.
Finding 1 — local_context_data wholesale-replace clobbers sibling keys (data loss)
These changed-context save tests only build local_context_data={"sonic_config": ...}, which hides that save_config_to_netbox replaces the device's entire dict (exporter.py:88/:94) and diffs the full existing context against {"sonic_config": ...} (exporter.py:42). Please add a case with a sibling key, e.g. local_context_data={"frr_parameters": {...}, "sonic_config": old}, and assert the sibling survives.
It won't today — verified by running the real exporter against a leaf-role device:
BEFORE: ['frr_parameters', 'sonic_config']
AFTER : ['sonic_config']
The exporter deletes frr_parameters, reports it as a removal in its diff log, and writes a journal entry attributing the deletion to the SONiC update. frr_parameters is not hypothetical — the inventory-reconciler reads it (and netplan_parameters) out of local_context_data via config_context.
Tests-only PR, so the source fix is a separate change: mirror the #2297 partitioned-ownership model — own and replace only the sonic_config key (device.local_context_data = {**(device.local_context_data or {}), "sonic_config": config}) and diff only that key. This test should pin sibling-key survival once that lands.
| assert _has_log(loguru_logs, "ERROR", "Failed to save diff to journal") | ||
|
|
||
|
|
||
| def test_save_config_save_failure_returns_false(mock_nb, loguru_logs): |
There was a problem hiding this comment.
Finding 6 — failed writes reported to the task layer as rc=0
assert result == (False, None) pins an ambiguity: save_config_to_netbox returns the same (False, None) for a failed save (exporter.py:103) as for no change (:48), and export_config_to_file does the same (:226 vs :222). In sync.py (not in this diff), :219 then treats a failed write as "no changes", logs accordingly, and still finishes rc=0 (:244) — failed syncs report success. Source fix: failures should raise or return a distinct sentinel so task status reflects them; this test should assert the distinction once that lands.
| assert _has_log(loguru_logs, "ERROR", "Failed to save local context") | ||
|
|
||
|
|
||
| def test_save_config_save_failure_bool_form(mock_nb): |
There was a problem hiding this comment.
Finding 7 — journal entry created before device.save()
Both save-failure tests pass local_context_data=None, hitting only the first-time branch. Please add a changed-path save failure (existing sonic_config + raising device.save()). In exporter.py (not in this diff) the journal entry is created at :71 before save() at :89, so a failed save returns (False, None) while an orphaned journal entry already claims the update succeeded (and the in-memory device stays mutated). Source fix: move journal creation after a successful save.
| # --- diff handling ---------------------------------------------------------- | ||
|
|
||
|
|
||
| def test_export_no_change_returns_false(tmp_path, export_settings, patch_hostname): |
There was a problem hiding this comment.
Finding 4 — symlink repair gated on config_changed
Symlink reconciliation should be tested independently of a content change. In exporter.py (not in this diff) the whole symlink block is gated by if config_changed: (:177), so an unchanged serial-mode config with a missing/stale hostname link is never repaired — and after a symlink failure, a later unchanged run never retries. Please add a case: existing identical config + missing hostname link, and assert the link gets created. It currently won't.
| # --- symlink handling ------------------------------------------------------- | ||
|
|
||
|
|
||
| def test_export_creates_symlink_when_hostname_absent( |
There was a problem hiding this comment.
Finding 2 — hostname == serial → self-referential symlink (low priority)
Low-priority defensive case. Every symlink test uses distinct hostname (sw-1) and serial (ABC123); add one where they're equal. In exporter.py (not in this diff), when serial == hostname the just-written config file is the hostname path, so :207 os.removes it and :209 symlinks it to itself — config destroyed, yet the function returns True. One-line source fix: skip the symlink when filepath == hostname_filepath.
Unlikely in practice — needs the serial set equal to the device name/inventory_hostname, which doesn't happen with real manufacturer serials or on the seeded testbed (serials unset → safe fallback). But the guard is trivial against silent data loss that reports success.
| # --- error handling --------------------------------------------------------- | ||
|
|
||
|
|
||
| def test_export_makedirs_failure_returns_false( |
There was a problem hiding this comment.
Finding 3 — non-atomic write truncates the previous export (cheap hygiene)
Cheap-hygiene note, not a blocker. Error coverage only injects a makedirs failure; a write/serialize-failure case (patch json.dump to raise, assert the previous export is still intact) would be a nice addition. In exporter.py (not in this diff) the file is opened "w" at :179, truncating before serialization, so a mid-write crash (ENOSPC / killed process) leaves the prior config truncated and returns False. Source fix: temp file + os.replace().
Low priority here — the export self-heals on the next sync and the in-repo reader rejects bad JSON — but it would rank higher if we expected attackers or external consumers of the export dir.
| assert result == {"sw-1": {"PORT": {"Ethernet0": {}}}} | ||
|
|
||
|
|
||
| def test_single_device_disallowed_role_returns_empty( |
There was a problem hiding this comment.
Finding 5 — cleanup / finish_task_output skipped on early returns and mid-loop exceptions (would block)
These early-return tests check the return value and log message but not that caches were cleared or finish_task_output was called. In sync.py (not in this diff), the early returns (:77/:80/:83) and any exception in the per-device loop — e.g. the unguarded len(sonic_config['PORT']) at :225 on a PORT-less config — return after caches load (:42–52) but before cleanup (:237–239) and finish_task_output (:244). Caches are module-level, so the leak corrupts the next run. Please assert cleanup + finalization on these paths and add a mid-loop raising-collaborator case; the source fix is a try/finally.
Cover save_config_to_netbox (NetBox local-context persistence with diff checking and journal logging) and export_config_to_file (on-disk export, filename selection, and the serial-number to hostname symlink). The file-export tests drive a real filesystem under tmp_path where that reads more clearly than asserting on mocked os calls, and patch os.* only to inject the symlink and makedirs failure paths. Part of #2199. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
Cover the sync_sonic entry point: cache lifecycle ordering, single-device and NetBox-filtered device resolution, role/HWSKU filtering, spine and superspine AS-mapping propagation, diff streaming to task output, and cache-stats logging. Every heavy collaborator (generate_sonic_config, find_interconnected_devices, NetBox access) is stubbed at its sync import site so only the orchestration glue is exercised; those modules are covered separately under #2199. Part of #2199. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
save_config_to_netbox previously replaced the device's entire
local_context_data dict and diffed the full existing context against
{"sonic_config": ...}. Sibling keys such as frr_parameters or
netplan_parameters (consumed by the inventory reconciler via
config_context) were silently deleted on every SONiC config update,
with the removal attributed to SONiC in the journal diff.
Mirror the partitioned-ownership model from #2297: merge the new
sonic_config key into the existing context instead of replacing it,
and diff only the owned key so unrelated sibling changes neither
trigger nor pollute SONiC updates.
AI-assisted: Claude Code
Signed-off-by: Christian Berendt <berendt@osism.tech>
The journal entry documenting a configuration update was created before device.save(). When the save subsequently failed, the function reported no change while an orphaned journal entry already claimed the update had happened. Move journal creation after the successful save. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
export_config_to_file opened the target with "w", truncating the previous export before serialization. A mid-write failure (ENOSPC, killed process) left the prior config truncated on disk. Write to a temporary file and os.replace() it into place so the previous export survives a failed write. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
The whole symlink block was gated on config_changed, so an unchanged config with a missing or stale hostname link was never repaired — and after a symlink failure, later unchanged runs never retried. Move the reconciliation out of the gate and make it idempotent (a link already pointing at the config file is left untouched). Also guard against hostname == serial: the just-written config file is the hostname path in that case, and the old code removed it and symlinked it to itself, destroying the config while reporting success. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
The single-device early returns and any exception in the per-device loop returned after the module-level caches were loaded but before the cleanup and finish_task_output calls, leaking cache state into the next run and leaving the task unfinished. Wrap the sync body in try/finally so cleanup and task finalization run on every exit path. Early returns and per-device failures now finish the task with rc=1 instead of silently reporting success; a failing device no longer aborts the sync of the remaining devices. The per-device summary log no longer raises on configs without a PORT section. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
save_config_to_netbox returned the same (False, None) for a failed save as for "no changes", and export_config_to_file returned the same False for a failed write as for an unchanged file. sync_sonic then treated failed writes as "no changes" and finished the task with rc=0, so failed syncs reported success. Log and re-raise in both exporter functions; the per-device handler in sync_sonic catches the failure, continues with the remaining devices, and finishes the task with rc=1. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
89ba47d to
cbf7708
Compare
Closes #2224. Part of the Tier 3 SONiC unit-test effort (#2199), follow-up
to #2192 / #2193.
Adds two test modules covering the SONiC sync pipeline. Both mirror the
existing
tests/unit/tasks/conductor/sonic/conventions: the sharedmock_nb/loguru_logsfixtures,SimpleNamespacedevice stubs, and aper-module fixture that stubs collaborators at their import site (the
patch_orchestrator_helperspattern).The review surfaced seven real defects in
exporter.py/sync.pythat theinitial tests would otherwise have pinned as intended behavior. Since the
requested regression tests only pass against fixed sources, the fixes are
included here as separate commits (see below), so the PR is no longer
tests-only.
Change set (commit order)
Add unit tests for SONiC exporter module —
test_exporter.pysave_config_to_netbox: first-time create, no-change (DeepDiffempty,no save), changed config (unified diff + journal entry + save), journal
failure still saves. Both the tuple (
return_diff=True) and bool returnforms are asserted.
export_config_to_file: filename selection (hostname vs. serial-number,empty/missing serial falls back to hostname with a warning), diff
handling (missing / identical / changed / unreadable file), symlink
handling (absent target, existing regular file, dangling symlink,
os.symlinkfailure, hostname mode / falsy serial skip), and the outeros.makedirsfailure path.tmp_pathfilesystem where that reads moreclearly (the symlink end-state proves remove-then-symlink ordering), and
patch
os.*only to inject the symlink / makedirs failures.Add unit tests for SONiC sync orchestrator —
test_sync.pyattach_mock, andthe metalbox / VIP loads called once per invocation.
role=None,not-found, and lookup-raises each return
{}; single spine triggers thefull spine fetch (
nb.dcim.devices.filter), single leaf does not.config_versionreadfrom custom fields.
generate_sonic_config, and amin_as=Nonegroup is not propagated.no-change still exports,
show_diff=Falsecalls save withoutreturn_diff.task_id; cache-stats logged only when non-empty.generate_sonic_configandfind_interconnected_devicesare stubbed (theyhave their own issues under Meta: Unit test coverage for osism/ #2199), per the issue's instructions.
Review feedback fixes (each with regression tests)
Own only the sonic_config key in NetBox local context (review finding 1)
—
save_config_to_netboxno longer replaces the device's entirelocal_context_datadict; sibling keys such asfrr_parameters/netplan_parameterssurvive, and diffing covers only the ownedsonic_configkey (mirrors the sonic: enforce full config_db.json ownership #2297 partitioned-ownership model).Create SONiC journal entry only after successful save (finding 7)
— a failed
device.save()can no longer leave an orphaned journal entryclaiming the update succeeded.
Write SONiC config exports atomically (finding 3)
— temp file +
os.replace()so a mid-write failure (ENOSPC, killedprocess) cannot truncate the previous export.
Reconcile hostname symlink independently of config changes
(findings 2 + 4) — a missing/stale link is repaired even when the config
content is unchanged, the reconciliation is idempotent, and a
hostname == serialdevice no longer destroys its own config via aself-referential symlink.
Always clean up caches and finish task output in sync_sonic (finding 5)
—
try/finallyaround the sync body; early returns and per-devicefailures no longer leak the module-level caches into the next run, finish
the task with
rc=1, and a failing device no longer aborts the remainingdevices. The summary log no longer raises on
PORT-less configs.Raise on SONiC persistence failures instead of reporting no change
(finding 6) — failed NetBox saves and file exports raise instead of
returning the same value as "no changes";
sync_sonicreports them viarc=1instead of finishing failed syncs withrc=0.Local verification
black --check— passflake8— passpytest— not run locally; the unit-test suite runs in thepython-osism-unit-testsZuul job. The coverage gate(
--cov=...sync/...exporter>= 90 %) is verified there.