Files
sandbox-runtime/vendor
David Dworken 7ee4ac602d Isolate seccomp workload in nested PID ns and block io_uring (#183)
* Isolate seccomp workload in nested PID namespace and block io_uring

apply-seccomp now creates a nested user+PID+mount namespace before applying
the seccomp filter. The user command runs as PID 2 under a non-dumpable PID 1
reaper, with /proc remounted so only the inner process tree is visible. This
prevents the sandboxed command from ptracing or patching the unfiltered bwrap
init, bash wrapper, or socat helpers via /proc/N/mem, regardless of the host's
kernel.yama.ptrace_scope setting. Namespace setup failure aborts rather than
silently degrading.

The BPF filter now also blocks io_uring_setup/enter/register. IORING_OP_SOCKET
(Linux 5.19+) creates sockets without going through socket(), and seccomp
cannot inspect SQEs in the shared ring, so denying ring creation entirely is
the only safe option.

The filter generator now accepts an optional target-arch argument so a single
builder can emit both x64 and arm64 filters. Prebuilt binaries and filters are
regenerated for both architectures.

* Pass CAP_SYS_ADMIN to apply-seccomp and clear ambient caps before exec

apply-seccomp needs CAP_SYS_ADMIN to unshare PID+mount namespaces. The
original approach obtained it via unshare(CLONE_NEWUSER), but on hosts
where an LSM restricts unprivileged user namespaces (Ubuntu 24.04 with
AppArmor defaults), the nested userns is created without capabilities
and the setgroups write fails.

bwrap now passes --cap-add CAP_SYS_ADMIN (scoped to its user namespace)
so apply-seccomp can unshare directly. The nested-userns path remains as
a fallback for standalone invocation.

apply-seccomp clears the ambient capability set after remounting /proc,
so the sandboxed command's execve drops to zero capabilities and cannot
umount /proc to reveal the outer mount underneath. Two new tests cover
CapEff=0 and umount denial.

* chore: bump version to 0.0.44

* Add --unshare-user so --cap-add works with setuid bwrap

Setuid bwrap rejects --cap-add from non-root because it would grant
real host capabilities. --unshare-user forces user-namespace mode so
the capability is scoped to that namespace and the flag is accepted.

* Disable AppArmor userns restriction in CI instead of using setuid bwrap

Setuid bwrap rejects --cap-add from non-root, so that path is a dead end.
Instead, disable kernel.apparmor_restrict_unprivileged_userns in CI so
apply-seccomp's nested-userns path works without any bwrap cooperation.
This matches what production Ubuntu 24.04 users need to do anyway, now
documented in the README.

* Exit inner init as soon as the worker exits

reap_until was waiting for all children including orphaned background
processes reparented to PID 1, which hung the sandbox when the user
command backgrounded something long-running and then exited. Return
immediately when the worker terminates; PID 1 exiting tears down the
namespace and SIGKILLs any stragglers.
2026-03-30 17:07:40 -07:00
..