feat(sandbox): seccomp-notify DNS-pinned allowlist for Platform mode#17
Closed
Ladas wants to merge 3 commits into
Closed
feat(sandbox): seccomp-notify DNS-pinned allowlist for Platform mode#17Ladas wants to merge 3 commits into
Ladas wants to merge 3 commits into
Conversation
408aa3b to
2446c42
Compare
9ec5718 to
179d108
Compare
Ladas
added a commit
that referenced
this pull request
Jun 12, 2026
Add kernel-level network syscall interception using SECCOMP_RET_USER_NOTIF for Platform mode. Provides mandatory, syscall-level enforcement without any capabilities. DnsPinnedAllowlist: resolve domains to IPs at sandbox creation, freeze for session lifetime (DNS rebinding prevention). BPF filter intercepts: connect, sendto, sendmsg, recvfrom, recvmsg, bind. Validates AUDIT_ARCH to prevent x32/compat ABI bypass. Linux syscall wrappers: notification fd ioctls, pidfd_open/pidfd_getfd for on-behalf-of operations (TOCTOU-safe), read_process_memory with read_exact (no short reads), sockaddr parser (correct endianness for sa_family, port, flowinfo), verify_socket_fd (mitigates fd-swap race), deny/allow_connect response helpers. Code review fixes applied across all PRs: - PR #15: gateway propagates network_enforcement to DriverSandboxSpec - PR #15: driver uses typed enum comparison (not magic integer) - PR #16: saturating_sub prevents underflow in Landlock skipped count - PR #16: warn!() on TCP port restriction failure (was debug) - PR #17: BPF arch check, recvfrom/recvmsg/bind interception, verify_socket_fd, read_exact, allow_connect rename, flowinfo endianness, safety comments on all unsafe blocks 8 tests. Compiles, 949 tests pass, clippy clean. Ref: NVIDIA#899
2446c42 to
6078a8e
Compare
This was referenced Jun 15, 2026
179d108 to
59d148a
Compare
6078a8e to
254154b
Compare
59d148a to
8354d68
Compare
254154b to
e16b82b
Compare
8354d68 to
0846c55
Compare
e16b82b to
ea5b495
Compare
0846c55 to
15e0993
Compare
ea5b495 to
df5c661
Compare
Author
|
/ok |
Author
|
/ok to test |
93452fe to
d57e999
Compare
Add NetworkMode::Platform for running the supervisor without elevated capabilities on Kubernetes platforms enforcing the restricted Pod Security Standard (including OpenShift restricted-v2 SCC). Platform Mode keeps Landlock filesystem isolation, seccomp syscall filtering, OPA policy evaluation, credential injection, and L7 inspection via a loopback CONNECT proxy. It replaces the network namespace (which requires CAP_SYS_ADMIN + CAP_NET_ADMIN) with: - Loopback proxy binding (127.0.0.1 instead of veth interface) - K8s driver: zero capabilities, drop ALL, non-root UID - seccomp: block SOCK_DGRAM (UDP) on AF_INET/AF_INET6 to match the nftables UDP reject in namespace mode -- the proxy resolves DNS on behalf of the agent, so UDP is not needed - Landlock scope: restrict abstract Unix sockets and signals (ABI v5+, BestEffort degrades on older kernels) Security parity with namespace mode: | Attack | Namespace mode | Platform mode | |------------------------|------------------------|--------------------------| | TCP bypass proxy | nftables REJECT | Landlock port 3128 only | | UDP exfiltration | nftables REJECT | seccomp SOCK_DGRAM block | | DNS tunneling | no UDP accept rule | no SOCK_DGRAM | | Abstract Unix sockets | netns isolation | Landlock scope | | Signals to supervisor | N/A (same netns) | Landlock scope | | Container escape | Risk (CAP_SYS_ADMIN) | Impossible (zero caps) | Remaining gap: Landlock NetPort allows port 3128 on any IP (not just loopback). Mitigate with egress NetworkPolicy denying all sandbox pod egress -- loopback traffic is unaffected by NetworkPolicy. Proto: add NetworkEnforcementMode enum and field to SandboxPolicy and DriverSandboxSpec. Default NAMESPACE (0) preserves existing behavior; PLATFORM (1) activates the new mode. Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Add Landlock ABI v4 TCP connect restriction for Platform mode. When the kernel supports ABI v4, only the proxy port (default 3128) is allowed for outbound TCP connections. On older kernels, BestEffort compat level silently degrades -- the rule has no effect but the proxy still works cooperatively. Both handle_access(ConnectTcp) and add_rule(NetPort) use the ? operator since BestEffort guarantees they succeed on all kernel versions. Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Add kernel-level connect() interception using SECCOMP_RET_USER_NOTIF. The supervisor intercepts network syscalls, reads the destination sockaddr from the child's memory via /proc/pid/mem, evaluates it against a DNS-pinned allowlist, and either performs the operation on behalf of the child via pidfd_getfd() or denies it with EPERM. Components: - DnsPinnedAllowlist: resolve domains to IPs at sandbox creation, freeze for session lifetime to prevent DNS rebinding - BPF filter with AUDIT_ARCH validation for connect/sendto/sendmsg/ recvfrom/recvmsg/bind syscalls - pidfd_open + pidfd_getfd for TOCTOU-safe on-behalf-of operations - parse_sockaddr with correct endianness for IPv4/IPv6 - read_process_memory with read_exact for short-read safety Known limitation: DnsPinnedAllowlist cannot handle wildcard domains (*.example.com) because getaddrinfo does not support wildcards. Callers must skip wildcard endpoints and rely on the proxy OPA glob.match() for wildcard domain enforcement. Signed-off-by: Ladislav Smola <lsmola@redhat.com>
d57e999 to
8f124bd
Compare
Author
|
Closing for now. seccomp-notify is optional defense-in-depth -- Platform Mode (PR #15) with UDP seccomp block and Landlock scope achieves security parity with namespace mode without it. Can revisit as a follow-up for standalone deployments without claw-proxy. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Foundation for kernel-level connect() interception using seccomp-notify.
Adds
DnsPinnedAllowlistmodule: resolves allowed domains to IPs atsandbox creation, freezes them for the session (prevents DNS rebinding).
The notification event loop and on-behalf-of operations (pidfd_getfd)
will be wired once OPA policy integration is complete.
Depends on: #16 (Landlock TCP port restriction) → #15 (Platform mode base)
2 files, +135 lines. 820 tests pass, clippy clean.
What this PR adds
DnsPinnedAllowlist: resolve domains, pin IPs, check connect targetsWhat's NOT in this PR (follow-up)
SECCOMP_FILTER_FLAG_NEW_LISTENER)pidfd_getfd()Ref: NVIDIA#899
Assisted-By: Claude Code