You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SONiC config_db.json ownership policy was settled in #2297. The gaps below are
places that policy does not yet cover, found across the open SONiC PRs — plus
one merge hazard that is invisible when the affected PRs are reviewed in
isolation.
Status
The baseline ownership model is already decided (#2297). Decision C below is
implemented by open PR #2339, pending review. What
remains are two mechanism decisions (B, D), one urgent coordination call-out (A),
and a
small independent finding. This issue asks maintainers to ratify the preferred
option for each and steer the open SONiC PRs — several from external
contributors — toward a consistent model rather than merging divergent
approaches.
Affected
osism/tasks/conductor/sonic/config_generator.py (ownership registry, generate_sonic_config docstring, the ACL helpers)
Once both land, the second helper to run drops the first's entries. Whichever
loses, that service is left unrestricted — SSH or SNMP/gNMI open to every
source. Each PR's tests pass because neither calls the other helper. These two
PRs must not both merge as-is; the fix is Decision A below.
Baseline already decided in #2297 (do not re-litigate)
The generate_sonic_config docstring states the policy: the generator builds on
the device's image-provided base config_db.json and classifies every table it
touches into four categories. It fully owns the tables classified as owned —
those are dropped up front and regenerated from NetBox + hardcoded SONiC policy
every run, so pre-existing values do not survive. It does not own the whole
file: image-provided tables may be inherited (DEVICE_METADATA localhost
attributes, DATABASE/VERSIONS), consumed read-only (e.g. TELEMETRY — see
Decision C and the TELEMETRY finding), or passed through untouched. The
decisive rule holds across all of them: operator hand-edits to config_db.json are unsupported everywhere.
Customizations must be modeled in NetBox or expressed as generator policy. A
direct consequence: the ACL feature deleting an unrelated DATAACL entry from ACL_TABLE is the contract working as intended, not data loss — a
generator-owned table is dropped and rebuilt regardless of any unrelated content
it held. This is a strong
stance worth ratifying: a stock image ACL (e.g. a vendor SSH_ONLY) is likewise
deleted unless modeled in NetBox.
One part of the docstring is in tension with this, and Decision C's proposed
fix — implemented by open PR #2339 — closes it. The pass-through tier says
tables that are neither owned nor inherited "pass through unchanged… not
policed… not a supported place for operator customizations either" — leaving a
door open for unmanaged content to survive, which sits awkwardly against the
model's intent that operator content persist nowhere unmodeled. Once #2339 lands,
Decision C's four-category taxonomy
static guard force every table the generator references to be explicitly
classified (owned / inherited / image-consumed), so no referenced table can sit
in unpoliced limbo. (Tables the generator never touches remain genuinely
pass-through — the guard governs referenced tables, which is where the
contradiction lives.)
Open decisions (options ranked, most-preferred first)
Decision A — How should multiple generator helpers share one owned table?
ACL_TABLE/ACL_RULE is generator-owned (correct), but owned by several
helpers at once (SNMP, gNMI, SSH). This is not about protecting operator
content — it is generator helpers deleting each other's entries.
Per-key (named-entry) ownership (preferred). Each helper purges and
writes only its own keys (SNMP_ONLY, GNMI_ONLY, SSH_ONLY, and their *|RULE_*) and merges into the table rather than reassigning it. Helpers
compose; coexistence is automatic. This is already the in-tree pattern for
scaffolded tables (STATIC_ROUTE, ROUTE_REDISTRIBUTE, VXLAN_TUNNEL_MAP),
so it is a consistency fix, not a new mechanism. No per-helper stale-key
purging is needed: ACL_TABLE/ACL_RULE stay in OWNED_TABLE_KEYS, so the
orchestrator already drops them once up front (the central owned-table reset)
before any helper runs — on the no-OOB path no helper runs and the table is
simply absent, so no stale SNMP_ONLY survives. The only change is that
helpers merge their named entries (setdefault + assign keys) instead of reassigning the whole dict. Note this "per-key" coordination is within a
wholly generator-owned table — it does not preserve image/operator ACLs (the
central drop still removes them); contrast Decision D, where partitioning
preserves an external co-owner's keys.
Single control-plane-ACL builder. One function assembles all
control-plane ACL entries and assigns the table once. Removes the collision
but couples unrelated features (SSH/SNMP/gNMI) into one helper; less modular
and awkward as more services are added.
Decision B — What happens when a security control cannot be generated?
When a helper can't compute a restriction it currently emits nothing and the
service stays open, while generation reports success. Cases seen: IPv6 OOB
address (ACL code is IPv4-only), null/invalid gNMI port (L4_DST_PORT: "None"),
and arguably "no OOB IP at all."
Fail loud (preferred). Raise / have the orchestrator reject the device so
the operator sees that hardening could not be applied. Matches this area's
established pattern (10b8a34, "Raise on SONiC persistence failures instead
of reporting no change").
Fail closed, silently. Emit a default-deny for the service so it is
locked down rather than open. Safe-by-default, but can lock out management
unexpectedly and is not always expressible (e.g. no valid port to bind a
rule to).
Status quo — silent fail-open (rejected). SNMP/gNMI/SSH left reachable
from everywhere while the run reports success; the only trace is a log line.
Decision C — Keep ownership complete as tables are added — implemented by open PR #2339
Implemented by open PR #2339, pending review.
Recorded here for context; the mechanism is realized in that PR rather than left
as an open question.
The problem it closed: the registry was a hand-maintained allowlist with no
enforcement, so a table the generator references but forgets to classify falls
into the unpoliced pass-through tier and silently accumulates stale config — the
bug #2297 fixed. Open PRs already miss it: #1626 adds BFD_PEER_SINGLE_HOP; #2173 adds SAG, SAG_GLOBAL, EVPN_ETHERNET_SEGMENT, EVPN_MH_GLOBAL —
none registered. The prior "exhaustive" guard only iterated the registered
list, so it could not catch the omission, and the model lived in one docstring
far from the helpers that must honour it.
A static guard test that parses the generator source, collects every
top-level table reference on config (config["X"] subscripts, the
defensive config.get/setdefault/pop("X") accessors, and literal-keyed merge
forms such as update({...}) / |=), and fails when any referenced table is
not classified owned / inherited / image-consumed — naming the table and the
list to add it to. Static beats a runtime guard because BFD/EVPN tables only
emit when NetBox carries their data, which a generate-and-inspect test misses.
A mutation backstop that rejects unsupported direct mutation forms, so the
collector's coverage cannot silently erode. Dynamic keys (config[var]) and
indirect / interprocedural mutations (e.g. a table added inside a called
helper) remain documented limitations.
The four-category taxonomy the guard would enforce:
Category
Behavior
Examples
Owned
dropped up front, rebuilt every regen
BGP_GLOBALS, VLAN, ACL_TABLE, STATIC_ROUTE
Inherited
preserves base content, updates selected fields in place
DEVICE_METADATA, VERSIONS
Image-consumed (new)
read from the image base, not managed, accessed defensively via .get()
(TELEMETRY, once the ACL work lands)
Pass-through
untouched, unmanaged, no generator dependency
(everything else)
#2339 introduces the IMAGE_CONSUMED_TABLE_KEYS list empty: its first
consumer is the gNMI listen port read from TELEMETRY, which arrives with the
ACL work, not in #2339 (see the additional finding below).
Decision D — Make the second store an explicit, separate regime
SONiC config is also written to the NetBox device local_context_data, which
— unlike config_db.json — is genuinely shared with other generators
(frr_parameters, netplan_parameters). The exporter used to replace the whole
dict and clobber those siblings every sync (#2333); b8b5c35 fixed it to own
only the sonic_config key and merge. That is the correct discipline for a
co-owned store, and it differs from the drop-and-rebuild discipline used for
owned config_db.json tables.
Document two regimes explicitly (preferred)."Table-level ownership"
for config_db.json — the generator owns selected whole tables (dropped
and rebuilt) while leaving inherited / consumed / pass-through tables in
place; it does not own the file. "Partitioned key ownership" for shared
stores like local_context_data — own only your key, merge, never drop
siblings. The deciding factor is whether the store has an external
co-owner: local_context_data is co-owned by other generators
(frr_parameters, netplan_parameters), so the owned unit must be the key,
not the dict; config_db.json tables have no external co-owner, so a
generator-owned table can be dropped and rebuilt wholesale. Naming both
prevents a future change from "fixing" the local-context merge back toward
wholesale replacement.
Leave it implicit (status quo). Risk: the contradiction is undocumented
and invites regression.
Additional finding — implicit image dependency on TELEMETRY (small, independent)
_get_gnmi_port() reads the gNMI listen port from config.get("TELEMETRY", {})
— an image-provided table the generator depends on but never declares. A static
scan of the generator source confirms it is the only referenced-but-
unclassified top-level table (all other literal table references are owned or
inherited; TELEMETRY is reached solely via .get(), so a subscript-only scan
misses it).
This is the implicit coupling behind #2338's "null gNMI port → fail-open"
finding: the port is silently sourced from an unowned image table, falling back
to "8080". The IMAGE_CONSUMED_TABLE_KEYS category that would make this
explicit is added by open PR #2339, but empty — TELEMETRY only becomes a referenced table once the ACL work (which adds _get_gnmi_port()) lands, and
must be added to that list in the same change so the Decision C guard stays
green.
Overview
The SONiC
config_db.jsonownership policy was settled in#2297. The gaps below are
places that policy does not yet cover, found across the open SONiC PRs — plus
one merge hazard that is invisible when the affected PRs are reviewed in
isolation.
Status
The baseline ownership model is already decided (#2297). Decision C below is
implemented by open PR
#2339, pending review. What
remains are two mechanism decisions (B, D), one urgent coordination call-out (A),
and a
small independent finding. This issue asks maintainers to ratify the preferred
option for each and steer the open SONiC PRs — several from external
contributors — toward a consistent model rather than merging divergent
approaches.
Affected
osism/tasks/conductor/sonic/config_generator.py(ownership registry,generate_sonic_configdocstring, the ACL helpers)osism/tasks/conductor/sonic/exporter.py(NetBox local-context ownership)Related PRs
953be8a); baselinefor everything below.
taxonomy).
per-key pattern.
Pydantic-validator enforcement layer.
currently outside the generated schema set; separate from that issue's
schema-coverage work.
This is not visible when reviewing either PR alone — both pass their own tests
and appear correct when reviewed in isolation.
Both add a control-plane ACL helper, called from the same OOB block in
generate_sonic_config, and both replace the whole ACL table wholesale:Once both land, the second helper to run drops the first's entries. Whichever
loses, that service is left unrestricted — SSH or SNMP/gNMI open to every
source. Each PR's tests pass because neither calls the other helper. These two
PRs must not both merge as-is; the fix is Decision A below.
Baseline already decided in #2297 (do not re-litigate)
The
generate_sonic_configdocstring states the policy: the generator builds onthe device's image-provided base
config_db.jsonand classifies every table ittouches into four categories. It fully owns the tables classified as owned —
those are dropped up front and regenerated from NetBox + hardcoded SONiC policy
every run, so pre-existing values do not survive. It does not own the whole
file: image-provided tables may be inherited (
DEVICE_METADATAlocalhostattributes,
DATABASE/VERSIONS), consumed read-only (e.g.TELEMETRY— seeDecision C and the TELEMETRY finding), or passed through untouched. The
decisive rule holds across all of them: operator hand-edits to
config_db.jsonare unsupported everywhere.Customizations must be modeled in NetBox or expressed as generator policy. A
direct consequence: the ACL feature deleting an unrelated
DATAACLentry fromACL_TABLEis the contract working as intended, not data loss — agenerator-owned table is dropped and rebuilt regardless of any unrelated content
it held. This is a strong
stance worth ratifying: a stock image ACL (e.g. a vendor
SSH_ONLY) is likewisedeleted unless modeled in NetBox.
One part of the docstring is in tension with this, and Decision C's proposed
fix — implemented by open PR #2339 — closes it. The pass-through tier says
tables that are neither owned nor inherited "pass through unchanged… not
policed… not a supported place for operator customizations either" — leaving a
door open for unmanaged content to survive, which sits awkwardly against the
model's intent that operator content persist nowhere unmodeled. Once #2339 lands,
Decision C's four-category taxonomy
classified (owned / inherited / image-consumed), so no referenced table can sit
in unpoliced limbo. (Tables the generator never touches remain genuinely
pass-through — the guard governs referenced tables, which is where the
contradiction lives.)
Open decisions (options ranked, most-preferred first)
Decision A — How should multiple generator helpers share one owned table?
ACL_TABLE/ACL_RULEis generator-owned (correct), but owned by severalhelpers at once (SNMP, gNMI, SSH). This is not about protecting operator
content — it is generator helpers deleting each other's entries.
writes only its own keys (
SNMP_ONLY,GNMI_ONLY,SSH_ONLY, and their*|RULE_*) and merges into the table rather than reassigning it. Helperscompose; coexistence is automatic. This is already the in-tree pattern for
scaffolded tables (
STATIC_ROUTE,ROUTE_REDISTRIBUTE,VXLAN_TUNNEL_MAP),so it is a consistency fix, not a new mechanism. No per-helper stale-key
purging is needed:
ACL_TABLE/ACL_RULEstay inOWNED_TABLE_KEYS, so theorchestrator already drops them once up front (the central owned-table reset)
before any helper runs — on the no-OOB path no helper runs and the table is
simply absent, so no stale
SNMP_ONLYsurvives. The only change is thathelpers merge their named entries (
setdefault+ assign keys) instead ofreassigning the whole dict. Note this "per-key" coordination is within a
wholly generator-owned table — it does not preserve image/operator ACLs (the
central drop still removes them); contrast Decision D, where partitioning
preserves an external co-owner's keys.
control-plane ACL entries and assigns the table once. Removes the collision
but couples unrelated features (SSH/SNMP/gNMI) into one helper; less modular
and awkward as more services are added.
more than one helper writes the table. This is the sonic: Restrict SSH to the OOB management network #2337/Restrict SONiC SNMP and gNMI access to the OOB management network #2338 hazard.
Decision B — What happens when a security control cannot be generated?
When a helper can't compute a restriction it currently emits nothing and the
service stays open, while generation reports success. Cases seen: IPv6 OOB
address (ACL code is IPv4-only), null/invalid gNMI port (
L4_DST_PORT: "None"),and arguably "no OOB IP at all."
the operator sees that hardening could not be applied. Matches this area's
established pattern (
10b8a34, "Raise on SONiC persistence failures insteadof reporting no change").
locked down rather than open. Safe-by-default, but can lock out management
unexpectedly and is not always expressible (e.g. no valid port to bind a
rule to).
from everywhere while the run reports success; the only trace is a log line.
Decision C — Keep ownership complete as tables are added — implemented by open PR #2339
Implemented by open PR
#2339, pending review.
Recorded here for context; the mechanism is realized in that PR rather than left
as an open question.
The problem it closed: the registry was a hand-maintained allowlist with no
enforcement, so a table the generator references but forgets to classify falls
into the unpoliced pass-through tier and silently accumulates stale config — the
bug #2297 fixed. Open PRs already miss it: #1626 adds
BFD_PEER_SINGLE_HOP;#2173 adds
SAG,SAG_GLOBAL,EVPN_ETHERNET_SEGMENT,EVPN_MH_GLOBAL—none registered. The prior "exhaustive" guard only iterated the registered
list, so it could not catch the omission, and the model lived in one docstring
far from the helpers that must honour it.
What #2339 proposes:
top-level table reference on
config(config["X"]subscripts, thedefensive
config.get/setdefault/pop("X")accessors, and literal-keyed mergeforms such as
update({...})/|=), and fails when any referenced table isnot classified owned / inherited / image-consumed — naming the table and the
list to add it to. Static beats a runtime guard because BFD/EVPN tables only
emit when NetBox carries their data, which a generate-and-inspect test misses.
collector's coverage cannot silently erode. Dynamic keys (
config[var]) andindirect / interprocedural mutations (e.g. a table added inside a called
helper) remain documented limitations.
BGP_GLOBALS,VLAN,ACL_TABLE,STATIC_ROUTEDEVICE_METADATA,VERSIONS.get()TELEMETRY, once the ACL work lands)#2339 introduces the
IMAGE_CONSUMED_TABLE_KEYSlist empty: its firstconsumer is the gNMI listen port read from
TELEMETRY, which arrives with theACL work, not in #2339 (see the additional finding below).
Decision D — Make the second store an explicit, separate regime
SONiC config is also written to the NetBox device
local_context_data, which— unlike
config_db.json— is genuinely shared with other generators(
frr_parameters,netplan_parameters). The exporter used to replace the wholedict and clobber those siblings every sync (#2333);
b8b5c35fixed it to ownonly the
sonic_configkey and merge. That is the correct discipline for aco-owned store, and it differs from the drop-and-rebuild discipline used for
owned
config_db.jsontables.for
config_db.json— the generator owns selected whole tables (droppedand rebuilt) while leaving inherited / consumed / pass-through tables in
place; it does not own the file. "Partitioned key ownership" for shared
stores like
local_context_data— own only your key, merge, never dropsiblings. The deciding factor is whether the store has an external
co-owner:
local_context_datais co-owned by other generators(
frr_parameters,netplan_parameters), so the owned unit must be the key,not the dict;
config_db.jsontables have no external co-owner, so agenerator-owned table can be dropped and rebuilt wholesale. Naming both
prevents a future change from "fixing" the local-context merge back toward
wholesale replacement.
and invites regression.
Additional finding — implicit image dependency on
TELEMETRY(small, independent)_get_gnmi_port()reads the gNMI listen port fromconfig.get("TELEMETRY", {})— an image-provided table the generator depends on but never declares. A static
scan of the generator source confirms it is the only referenced-but-
unclassified top-level table (all other literal table references are owned or
inherited;
TELEMETRYis reached solely via.get(), so a subscript-only scanmisses it).
This is the implicit coupling behind #2338's "null gNMI port → fail-open"
finding: the port is silently sourced from an unowned image table, falling back
to
"8080". TheIMAGE_CONSUMED_TABLE_KEYScategory that would make thisexplicit is added by open PR #2339, but empty —
TELEMETRYonly becomes areferenced table once the ACL work (which adds
_get_gnmi_port()) lands, andmust be added to that list in the same change so the Decision C guard stays
green.