Skip to content

SONiC config_db.json ownership model: open decisions + a hidden PR collision #2340

@ideaship

Description

@ideaship

Overview

The SONiC config_db.json ownership policy was settled in
#2297. The gaps below are
places that policy does not yet cover, found across the open SONiC PRs — plus
one merge hazard that is invisible when the affected PRs are reviewed in
isolation.

Status

The baseline ownership model is already decided (#2297). Decision C below is
implemented by open PR
#2339, pending review. What
remains are two mechanism decisions (B, D), one urgent coordination call-out (A),
and a
small independent finding. This issue asks maintainers to ratify the preferred
option for each and steer the open SONiC PRs — several from external
contributors — toward a consistent model rather than merging divergent
approaches.

Affected

  • osism/tasks/conductor/sonic/config_generator.py (ownership registry,
    generate_sonic_config docstring, the ACL helpers)
  • osism/tasks/conductor/sonic/exporter.py (NetBox local-context ownership)

Related PRs

⚠️ Urgent: #2337 and #2338 silently break each other on merge

This is not visible when reviewing either PR alone — both pass their own tests
and appear correct when reviewed in isolation.

Both add a control-plane ACL helper, called from the same OOB block in
generate_sonic_config, and both replace the whole ACL table wholesale:

# #2338 _add_ctrlplane_acls
config["ACL_TABLE"] = { "SNMP_ONLY": ..., "GNMI_ONLY": ... }
config["ACL_RULE"]  = { "SNMP_ONLY|RULE_1": ..., "GNMI_ONLY|RULE_1": ... }

# #2337 _add_ssh_acl_configuration
config["ACL_TABLE"] = { "SSH_ONLY": ... }
config["ACL_RULE"]  = { "SSH_ONLY|RULE_1": ... }

Once both land, the second helper to run drops the first's entries. Whichever
loses, that service is left unrestricted — SSH or SNMP/gNMI open to every
source. Each PR's tests pass because neither calls the other helper. These two
PRs must not both merge as-is
; the fix is Decision A below.

Baseline already decided in #2297 (do not re-litigate)

The generate_sonic_config docstring states the policy: the generator builds on
the device's image-provided base config_db.json and classifies every table it
touches into four categories. It fully owns the tables classified as owned
those are dropped up front and regenerated from NetBox + hardcoded SONiC policy
every run, so pre-existing values do not survive. It does not own the whole
file: image-provided tables may be inherited (DEVICE_METADATA localhost
attributes, DATABASE/VERSIONS), consumed read-only (e.g. TELEMETRY — see
Decision C and the TELEMETRY finding), or passed through untouched. The
decisive rule holds across all of them: operator hand-edits to
config_db.json are unsupported everywhere.

Customizations must be modeled in NetBox or expressed as generator policy. A
direct consequence: the ACL feature deleting an unrelated DATAACL entry from
ACL_TABLE is the contract working as intended, not data loss — a
generator-owned table is dropped and rebuilt regardless of any unrelated content
it held. This is a strong
stance worth ratifying: a stock image ACL (e.g. a vendor SSH_ONLY) is likewise
deleted unless modeled in NetBox.

One part of the docstring is in tension with this, and Decision C's proposed
fix — implemented by open PR #2339 — closes it.
The pass-through tier says
tables that are neither owned nor inherited "pass through unchanged… not
policed… not a supported place for operator customizations either" — leaving a
door open for unmanaged content to survive, which sits awkwardly against the
model's intent that operator content persist nowhere unmodeled. Once #2339 lands,
Decision C's four-category taxonomy

  • static guard force every table the generator references to be explicitly
    classified (owned / inherited / image-consumed), so no referenced table can sit
    in unpoliced limbo. (Tables the generator never touches remain genuinely
    pass-through — the guard governs referenced tables, which is where the
    contradiction lives.)

Open decisions (options ranked, most-preferred first)

Decision A — How should multiple generator helpers share one owned table?

ACL_TABLE/ACL_RULE is generator-owned (correct), but owned by several
helpers at once
(SNMP, gNMI, SSH). This is not about protecting operator
content — it is generator helpers deleting each other's entries.

  1. Per-key (named-entry) ownership (preferred). Each helper purges and
    writes only its own keys (SNMP_ONLY, GNMI_ONLY, SSH_ONLY, and their
    *|RULE_*) and merges into the table rather than reassigning it. Helpers
    compose; coexistence is automatic. This is already the in-tree pattern for
    scaffolded tables (STATIC_ROUTE, ROUTE_REDISTRIBUTE, VXLAN_TUNNEL_MAP),
    so it is a consistency fix, not a new mechanism. No per-helper stale-key
    purging is needed: ACL_TABLE/ACL_RULE stay in OWNED_TABLE_KEYS, so the
    orchestrator already drops them once up front (the central owned-table reset)
    before any helper runs — on the no-OOB path no helper runs and the table is
    simply absent, so no stale SNMP_ONLY survives. The only change is that
    helpers merge their named entries (setdefault + assign keys) instead of
    reassigning the whole dict. Note this "per-key" coordination is within a
    wholly generator-owned table
    — it does not preserve image/operator ACLs (the
    central drop still removes them); contrast Decision D, where partitioning
    preserves an external co-owner's keys.
  2. Single control-plane-ACL builder. One function assembles all
    control-plane ACL entries and assigns the table once. Removes the collision
    but couples unrelated features (SSH/SNMP/gNMI) into one helper; less modular
    and awkward as more services are added.
  3. Status quo — wholesale per helper (rejected). Silently broken whenever
    more than one helper writes the table. This is the sonic: Restrict SSH to the OOB management network #2337/Restrict SONiC SNMP and gNMI access to the OOB management network #2338 hazard.

Decision B — What happens when a security control cannot be generated?

When a helper can't compute a restriction it currently emits nothing and the
service stays open, while generation reports success. Cases seen: IPv6 OOB
address (ACL code is IPv4-only), null/invalid gNMI port (L4_DST_PORT: "None"),
and arguably "no OOB IP at all."

  1. Fail loud (preferred). Raise / have the orchestrator reject the device so
    the operator sees that hardening could not be applied. Matches this area's
    established pattern (10b8a34, "Raise on SONiC persistence failures instead
    of reporting no change").
  2. Fail closed, silently. Emit a default-deny for the service so it is
    locked down rather than open. Safe-by-default, but can lock out management
    unexpectedly and is not always expressible (e.g. no valid port to bind a
    rule to).
  3. Status quo — silent fail-open (rejected). SNMP/gNMI/SSH left reachable
    from everywhere while the run reports success; the only trace is a log line.

Decision C — Keep ownership complete as tables are added — implemented by open PR #2339

Implemented by open PR
#2339, pending review.

Recorded here for context; the mechanism is realized in that PR rather than left
as an open question.

The problem it closed: the registry was a hand-maintained allowlist with no
enforcement, so a table the generator references but forgets to classify falls
into the unpoliced pass-through tier and silently accumulates stale config — the
bug #2297 fixed. Open PRs already miss it: #1626 adds BFD_PEER_SINGLE_HOP;
#2173 adds SAG, SAG_GLOBAL, EVPN_ETHERNET_SEGMENT, EVPN_MH_GLOBAL
none registered. The prior "exhaustive" guard only iterated the registered
list, so it could not catch the omission, and the model lived in one docstring
far from the helpers that must honour it.

What #2339 proposes:

  • A static guard test that parses the generator source, collects every
    top-level table reference on config (config["X"] subscripts, the
    defensive config.get/setdefault/pop("X") accessors, and literal-keyed merge
    forms such as update({...}) / |=), and fails when any referenced table is
    not classified owned / inherited / image-consumed — naming the table and the
    list to add it to. Static beats a runtime guard because BFD/EVPN tables only
    emit when NetBox carries their data, which a generate-and-inspect test misses.
  • A mutation backstop that rejects unsupported direct mutation forms, so the
    collector's coverage cannot silently erode. Dynamic keys (config[var]) and
    indirect / interprocedural mutations (e.g. a table added inside a called
    helper) remain documented limitations.
  • The four-category taxonomy the guard would enforce:
Category Behavior Examples
Owned dropped up front, rebuilt every regen BGP_GLOBALS, VLAN, ACL_TABLE, STATIC_ROUTE
Inherited preserves base content, updates selected fields in place DEVICE_METADATA, VERSIONS
Image-consumed (new) read from the image base, not managed, accessed defensively via .get() (TELEMETRY, once the ACL work lands)
Pass-through untouched, unmanaged, no generator dependency (everything else)

#2339 introduces the IMAGE_CONSUMED_TABLE_KEYS list empty: its first
consumer is the gNMI listen port read from TELEMETRY, which arrives with the
ACL work, not in #2339 (see the additional finding below).

Decision D — Make the second store an explicit, separate regime

SONiC config is also written to the NetBox device local_context_data, which
— unlike config_db.json — is genuinely shared with other generators
(frr_parameters, netplan_parameters). The exporter used to replace the whole
dict and clobber those siblings every sync (#2333); b8b5c35 fixed it to own
only the sonic_config key and merge. That is the correct discipline for a
co-owned store, and it differs from the drop-and-rebuild discipline used for
owned config_db.json tables.

  1. Document two regimes explicitly (preferred). "Table-level ownership"
    for config_db.json — the generator owns selected whole tables (dropped
    and rebuilt) while leaving inherited / consumed / pass-through tables in
    place; it does not own the file. "Partitioned key ownership" for shared
    stores like local_context_data — own only your key, merge, never drop
    siblings. The deciding factor is whether the store has an external
    co-owner
    : local_context_data is co-owned by other generators
    (frr_parameters, netplan_parameters), so the owned unit must be the key,
    not the dict; config_db.json tables have no external co-owner, so a
    generator-owned table can be dropped and rebuilt wholesale. Naming both
    prevents a future change from "fixing" the local-context merge back toward
    wholesale replacement.
  2. Leave it implicit (status quo). Risk: the contradiction is undocumented
    and invites regression.

Additional finding — implicit image dependency on TELEMETRY (small, independent)

_get_gnmi_port() reads the gNMI listen port from config.get("TELEMETRY", {})
— an image-provided table the generator depends on but never declares. A static
scan of the generator source confirms it is the only referenced-but-
unclassified top-level table (all other literal table references are owned or
inherited; TELEMETRY is reached solely via .get(), so a subscript-only scan
misses it).

This is the implicit coupling behind #2338's "null gNMI port → fail-open"
finding: the port is silently sourced from an unowned image table, falling back
to "8080". The IMAGE_CONSUMED_TABLE_KEYS category that would make this
explicit is added by open PR #2339, but empty — TELEMETRY only becomes a
referenced table once the ACL work (which adds _get_gnmi_port()) lands, and
must be added to that list in the same change so the Decision C guard stays
green.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions