Skip to content

Add opt-in seccomp BPF allowlist behind --seccomp#53

Merged
jserv merged 1 commit intomasterfrom
seccomp
Apr 30, 2026
Merged

Add opt-in seccomp BPF allowlist behind --seccomp#53
jserv merged 1 commit intomasterfrom
seccomp

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented Apr 30, 2026

A memory-corruption RCE in device emulation could otherwise pivot to arbitrary host syscalls (open, execve, socket, unlink). The new filter reduces blast radius to the syscalls every device worker, the vcpu loop, and pthread internals actually need.

src/seccomp.c builds an arch-aware BPF allowlist for x86_64 and aarch64. x86_64 explicitly rejects the x32 ABI: x32 shares AUDIT_ARCH_X86_64 but tags syscall numbers with __X32_SYSCALL_BIT (0x40000000), so a naive allowlist that copies an Internet example without this guard would let an attacker pivot to x32 syscall numbers that alias different kernel handlers. Default action is SECCOMP_RET_KILL_PROCESS so a worker thread that takes a denied syscall aborts the whole VMM rather than leaving the device in an unrecoverable state.

Install uses seccomp(2) directly with SECCOMP_FILTER_FLAG_TSYNC. The serial worker thread is already running by the time seccomp_apply() is called (spawned in vm_arch_init_platform_device during vm_init); plain prctl(PR_SET_SECCOMP) installs only on the calling thread, leaving an attacker a path through the pre-existing worker. TSYNC's return is three-way: 0 success, -1 errno error, positive TID for partial-sync failure -- a naive < 0 check would silently treat partial-sync as success and leave the process unfiltered, so any non-zero return is reported as failure and surfaces the offending TID.

The flag is opt-in via --seccomp so existing test and development workflows are unaffected. CI gains a second "boot test (seccomp)" step on host-x64 that reuses .ci/autorun.sh with
KVM_HOST_FLAGS=--seccomp; reaching the "Linux version " banner exercises prctl(PR_SET_NO_NEW_PRIVS), seccomp(2)+TSYNC over the already-running serial worker, and the early KVM_RUN dispatch under the filter, so a regression that drops a steady-state syscall from the allowlist surfaces here as a SIGSYS before the banner.

Boot-tested on x86_64 and aarch64: --seccomp boots Linux to the busybox console prompt with virtio-blk mounting ext4 r/w (exercises pread/pwrite/fdatasync) and virtio-net probed; the lazy virtio-blk/virtio-net worker spawn paths inside vm_run rely on clone/clone3 + set_robust_list + rseq + sigaltstack being allowlisted.


Summary by cubic

Adds an opt-in seccomp BPF allowlist behind --seccomp to limit the VMM to required syscalls and reduce escape risk from device-emulation bugs. The filter installs via seccomp(2)+TSYNC, supports x86_64/aarch64, and rejects x32.

New Features

  • --seccomp installs a BPF allowlist with default KILL on deny; synchronized to all threads with TSYNC.
  • x86_64 guard blocks x32 ABI; aarch64 supported.
  • Allowlist covers vCPU loop, virtio workers, and pthread internals.
  • CI adds a “boot test (seccomp)” step; build wires in seccomp.o; docs updated.

Migration

  • Run with: build/kvm-host ... --seccomp
  • Or: make KVM_HOST_FLAGS=--seccomp check

Written for commit 7dcd9fa. Summary will update on new commits. Review in cubic

A memory-corruption RCE in device emulation could otherwise pivot to
arbitrary host syscalls (open, execve, socket, unlink). The new filter
reduces blast radius to the syscalls every device worker, the vcpu
loop, and pthread internals actually need.

src/seccomp.c builds an arch-aware BPF allowlist for x86_64 and
aarch64. x86_64 explicitly rejects the x32 ABI: x32 shares
AUDIT_ARCH_X86_64 but tags syscall numbers with __X32_SYSCALL_BIT
(0x40000000), so a naive allowlist that copies an Internet example
without this guard would let an attacker pivot to x32 syscall numbers
that alias different kernel handlers. Default action is
SECCOMP_RET_KILL_PROCESS so a worker thread that takes a denied
syscall aborts the whole VMM rather than leaving the device in an
unrecoverable state.

Install uses seccomp(2) directly with SECCOMP_FILTER_FLAG_TSYNC. The
serial worker thread is already running by the time seccomp_apply()
is called (spawned in vm_arch_init_platform_device during vm_init);
plain prctl(PR_SET_SECCOMP) installs only on the calling thread,
leaving an attacker a path through the pre-existing worker. TSYNC's
return is three-way: 0 success, -1 errno error, positive TID for
partial-sync failure -- a naive < 0 check would silently treat
partial-sync as success and leave the process unfiltered, so any
non-zero return is reported as failure and surfaces the offending
TID.

The flag is opt-in via --seccomp so existing test and development
workflows are unaffected. CI gains a second "boot test (seccomp)"
step on host-x64 that reuses .ci/autorun.sh with
KVM_HOST_FLAGS=--seccomp; reaching the "Linux version " banner
exercises prctl(PR_SET_NO_NEW_PRIVS), seccomp(2)+TSYNC over the
already-running serial worker, and the early KVM_RUN dispatch under
the filter, so a regression that drops a steady-state syscall from
the allowlist surfaces here as a SIGSYS before the banner.

Boot-tested on x86_64 and aarch64: --seccomp boots Linux to the
busybox console prompt with virtio-blk mounting ext4 r/w (exercises
pread/pwrite/fdatasync) and virtio-net probed; the lazy
virtio-blk/virtio-net worker spawn paths inside vm_run rely on
clone/clone3 + set_robust_list + rseq + sigaltstack being
allowlisted.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

@jserv jserv merged commit 75962e1 into master Apr 30, 2026
11 checks passed
@jserv jserv deleted the seccomp branch April 30, 2026 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant