Conversation
A memory-corruption RCE in device emulation could otherwise pivot to arbitrary host syscalls (open, execve, socket, unlink). The new filter reduces blast radius to the syscalls every device worker, the vcpu loop, and pthread internals actually need. src/seccomp.c builds an arch-aware BPF allowlist for x86_64 and aarch64. x86_64 explicitly rejects the x32 ABI: x32 shares AUDIT_ARCH_X86_64 but tags syscall numbers with __X32_SYSCALL_BIT (0x40000000), so a naive allowlist that copies an Internet example without this guard would let an attacker pivot to x32 syscall numbers that alias different kernel handlers. Default action is SECCOMP_RET_KILL_PROCESS so a worker thread that takes a denied syscall aborts the whole VMM rather than leaving the device in an unrecoverable state. Install uses seccomp(2) directly with SECCOMP_FILTER_FLAG_TSYNC. The serial worker thread is already running by the time seccomp_apply() is called (spawned in vm_arch_init_platform_device during vm_init); plain prctl(PR_SET_SECCOMP) installs only on the calling thread, leaving an attacker a path through the pre-existing worker. TSYNC's return is three-way: 0 success, -1 errno error, positive TID for partial-sync failure -- a naive < 0 check would silently treat partial-sync as success and leave the process unfiltered, so any non-zero return is reported as failure and surfaces the offending TID. The flag is opt-in via --seccomp so existing test and development workflows are unaffected. CI gains a second "boot test (seccomp)" step on host-x64 that reuses .ci/autorun.sh with KVM_HOST_FLAGS=--seccomp; reaching the "Linux version " banner exercises prctl(PR_SET_NO_NEW_PRIVS), seccomp(2)+TSYNC over the already-running serial worker, and the early KVM_RUN dispatch under the filter, so a regression that drops a steady-state syscall from the allowlist surfaces here as a SIGSYS before the banner. Boot-tested on x86_64 and aarch64: --seccomp boots Linux to the busybox console prompt with virtio-blk mounting ext4 r/w (exercises pread/pwrite/fdatasync) and virtio-net probed; the lazy virtio-blk/virtio-net worker spawn paths inside vm_run rely on clone/clone3 + set_robust_list + rseq + sigaltstack being allowlisted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A memory-corruption RCE in device emulation could otherwise pivot to arbitrary host syscalls (open, execve, socket, unlink). The new filter reduces blast radius to the syscalls every device worker, the vcpu loop, and pthread internals actually need.
src/seccomp.c builds an arch-aware BPF allowlist for x86_64 and aarch64. x86_64 explicitly rejects the x32 ABI: x32 shares AUDIT_ARCH_X86_64 but tags syscall numbers with __X32_SYSCALL_BIT (0x40000000), so a naive allowlist that copies an Internet example without this guard would let an attacker pivot to x32 syscall numbers that alias different kernel handlers. Default action is SECCOMP_RET_KILL_PROCESS so a worker thread that takes a denied syscall aborts the whole VMM rather than leaving the device in an unrecoverable state.
Install uses seccomp(2) directly with SECCOMP_FILTER_FLAG_TSYNC. The serial worker thread is already running by the time seccomp_apply() is called (spawned in vm_arch_init_platform_device during vm_init); plain prctl(PR_SET_SECCOMP) installs only on the calling thread, leaving an attacker a path through the pre-existing worker. TSYNC's return is three-way: 0 success, -1 errno error, positive TID for partial-sync failure -- a naive < 0 check would silently treat partial-sync as success and leave the process unfiltered, so any non-zero return is reported as failure and surfaces the offending TID.
The flag is opt-in via --seccomp so existing test and development workflows are unaffected. CI gains a second "boot test (seccomp)" step on host-x64 that reuses .ci/autorun.sh with
KVM_HOST_FLAGS=--seccomp; reaching the "Linux version " banner exercises prctl(PR_SET_NO_NEW_PRIVS), seccomp(2)+TSYNC over the already-running serial worker, and the early KVM_RUN dispatch under the filter, so a regression that drops a steady-state syscall from the allowlist surfaces here as a SIGSYS before the banner.
Boot-tested on x86_64 and aarch64: --seccomp boots Linux to the busybox console prompt with virtio-blk mounting ext4 r/w (exercises pread/pwrite/fdatasync) and virtio-net probed; the lazy virtio-blk/virtio-net worker spawn paths inside vm_run rely on clone/clone3 + set_robust_list + rseq + sigaltstack being allowlisted.
Summary by cubic
Adds an opt-in seccomp BPF allowlist behind
--seccompto limit the VMM to required syscalls and reduce escape risk from device-emulation bugs. The filter installs via seccomp(2)+TSYNC, supports x86_64/aarch64, and rejects x32.New Features
--seccompinstalls a BPF allowlist with default KILL on deny; synchronized to all threads with TSYNC.seccomp.o; docs updated.Migration
build/kvm-host ... --seccompmake KVM_HOST_FLAGS=--seccomp checkWritten for commit 7dcd9fa. Summary will update on new commits. Review in cubic