refactor(openshell-sandbox): Split sandbox into process and network subcrates.#1650
refactor(openshell-sandbox): Split sandbox into process and network subcrates.#1650rrhubenov wants to merge 66 commits into
sandbox into process and network subcrates.#1650Conversation
Lifts TLS state generation, network namespace setup, proxy startup, bypass monitor spawn, and SSH-side proxy URL / netns FD computation out of run_sandbox into a sibling async fn `run_networking` that returns a Networking struct. The identity cache moves with it (only consumed by the proxy). Entrypoint PID allocation moves just above the call site so it can be passed in. No behavior changes — same OCSF emits, same async order, same RAII lifetimes for the proxy and bypass-monitor handles, now held by the returned Networking value in run_sandbox's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH server, supervisor session, process spawn, OPA probe, policy poll loop, denial aggregator, wait/exit) into a sibling async fn `run_process`. Also moves network namespace creation out of `run_networking` into a new `create_netns_for_proxy` helper invoked from `run_sandbox`, so `run_networking` is purely the proxy component (OPA evaluation, TLS interception, credential injection, inference routing, gRPC control API). The netns is then borrowed into both `run_networking` and `run_process`. No behavior change. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ell-supervisor-process crates Add empty placeholder crates that subsequent commits will populate as the sandbox decomposition proceeds. Both crates compile clean as part of the workspace and are picked up automatically by the existing `members = ["crates/*"]` glob. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side) and the bypass monitor (process-side), and crosses the run_networking -> run_process API boundary. Move it to openshell-core so the eventual supervisor-networking and supervisor-process crates can both reference it without depending on each other. DenialAggregator and the channel/flush helpers stay in openshell-sandbox for now. A thin `pub use openshell_core::DenialEvent;` re-export from denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent` call site resolving without further edits. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the lexical path-normalization helper from openshell-policy to openshell-core::paths so it can be reached from crates that sit below openshell-policy in the dependency graph. openshell-policy keeps its existing public API via a `pub use` re-export, so all current call sites (e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged. This is a prerequisite for lifting openshell-sandbox/src/policy.rs into openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls normalize_path, and lifting it as-is would cycle through openshell-policy. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy, ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode, LandlockCompatibility, plus their Proto* TryFrom/From impls) to openshell-core/src/policy.rs. Both prospective supervisor leaves (networking and process) dispatch on SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it without depending on the other (or on the future orchestrator). The From<ProtoFilesystemPolicy> impl now calls the in-crate openshell_core::paths::normalize_path lifted in the previous commit, which is what made this move cycle-free. Update all crate::policy::* call sites in openshell-sandbox to openshell_core::policy::*. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed by process.rs and ssh.rs. With the orchestrator staying in openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf crates, so process-only modules can land in openshell-supervisor-process directly. Add openshell-supervisor-process as a path dependency of openshell-sandbox. Update process.rs and ssh.rs to import from openshell_supervisor_process::child_env. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the static skills installer (and its embedded resource directory) out of openshell-sandbox into openshell-supervisor-process. The module is process-side only — invoked once during sandbox start to drop agent skill files into the workspace — and has no cross-leaf consumers. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ll-sandbox Move the mechanistic mapper (HTTP method/path → operation classifier that derives policy proposals from connection summaries) out of openshell-sandbox into openshell-supervisor-networking. Single internal caller (run_policy_poll_loop in lib.rs) and only depends on openshell-core + tracing — no cross-leaf entanglement. First population of the openshell-supervisor-networking crate; adds openshell-core and tracing as dependencies. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner resolution, file SHA256 hashing) from openshell-sandbox into openshell-core. The module is consumed cross-leaf — by bypass_monitor on the process side and by identity / proxy on the networking side — so it has to sit below both leaves. Adds tracing, sha2, and hex as dependencies on openshell-core. Updates the three call sites in openshell-sandbox to import from openshell_core::procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move BinaryIdentityCache (path → SHA256 cache used to identify the process behind an outbound connection) from openshell-sandbox into openshell-supervisor-networking. The cache is consumed only by the networking-side proxy and the orchestrator; with procfs already in openshell-core there are no remaining cross-leaf dependencies. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-networking. Adds a Default impl for BinaryIdentityCache to satisfy clippy::new_without_default now that the type is publicly exposed across crates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…l-sandbox Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the test-only ProposalsFlagGuard out of openshell-sandbox into openshell-supervisor-process::proposals. The flag is read only by the process-side policy_local route handler and the orchestrator; lifting it to openshell-core would have made core carry sandbox-owned runtime state without buying anything. The test-only ProposalsFlagGuard is still consumed from networking-side l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands). Expose it via a new optional `test-helpers` feature on openshell-supervisor-process so test crates opt in explicitly without pulling tokio sync primitives into production builds. openshell-sandbox keeps its existing crate-private path (`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports so call sites and tests are unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox.
Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets.
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core. The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver(). Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it. Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged. When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away. Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Both prospective leaves (supervisor-networking and supervisor-process) need CachedOpenShellClient, AuthedChannel, and the connect/fetch helpers. Under Shape A the leaves cannot depend on openshell-sandbox, so the type has to live below them. openshell-core already pulls in tonic and miette; this enables tonic's channel/tls features and adds tokio as a direct dep. Updates all crate::grpc_client::* call sites in openshell-sandbox to openshell_core::grpc_client::*. No re-export shim — the call-site count was small enough to update directly. See architecture/plans/sandbox-split-design-choices.md for the full rationale and trade-offs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…l-sandbox DenialAggregator and FlushableDenialSummary belong with the proxy and L7 layer that emit denials. Moves the file into openshell-supervisor-networking; adds tokio as a regular dep there since DenialAggregator uses tokio::sync::mpsc. Drops the pub use openshell_core::DenialEvent re-export inside the moved file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs, and lib.rs to import openshell_core::DenialEvent directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
LogPushLayer is a process-side tracing layer that streams sandbox logs to the gateway via gRPC. Moves into openshell-supervisor-process; adds openshell-core, openshell-ocsf, tokio-stream, tracing, and tracing-subscriber as direct deps there. Updates the only external call site (openshell-sandbox/src/main.rs) to import from openshell_supervisor_process::log_push. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
bypass_monitor reads /dev/kmsg for nftables drop log lines and emits denial events. Pure process-side concern, called only from run_networking which spawns it on the netns. Moves into openshell-supervisor-process; all deps (openshell-core, openshell-ocsf, tokio, tracing) were already declared there. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF context singleton directly instead of going through openshell-sandbox's re-export. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
debug_rpc is the CLI subcommand handler that exercises authenticated gRPC calls (issue-token, refresh-token, get-config, etc.). Pure process-side concern, called only from openshell-sandbox/main.rs. Adds base64, hex, serde_json, sha2, and tonic (with channel/tls features) as direct deps on openshell-supervisor-process. Updates the single call site in main.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…sandbox supervisor_session opens a bidirectional gRPC stream that lets the gateway initiate shells inside the sandbox. Pure process-side concern, called only from run_process. Adds uuid as a direct dep on openshell-supervisor-process. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…shell-sandbox The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children (entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can distinguish them from incidental zombies. Pure process-side concern, moves to openshell_supervisor_process::managed_children with three public fns: register, unregister, is_managed. Updates lib.rs reaper, process.rs, and ssh.rs to call through the new module path. Drops the now-unused HashSet import from lib.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…andbox Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox, prepare/enforce, log_sandbox_readiness, top-level apply, and apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/ to crates/openshell-supervisor-process/src/sandbox/. Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace and its final home is decided when run_networking and run_process are extracted. Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs, process.rs, and ssh.rs to import sandbox from openshell_supervisor_process while keeping the netns import unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process to openshell-core so both eventual leaf crates can read it without depending on each other. The flag is process-wide singleton state initialised once during sandbox startup and read by both the policy.local route (networking-side) and the skills installer (process-side) — same shape as openshell_ocsf::ctx. Move the test-helpers Cargo feature alongside it: openshell-core gains the feature, openshell-supervisor-process loses it, and openshell-sandbox's dev-dependency now enables openshell-core/test-helpers. Update the sandbox re-export shim to point at openshell_core::proposals. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move NetworkNamespace and the nft_ruleset bypass-rule generator from
crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/.
Both eventual leaf crates (supervisor-networking and supervisor-process) read
from NetworkNamespace, so it must live somewhere both can depend on without
violating the Shape A no-leaf-to-leaf rule.
Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx()
calls, matching the pattern used in already-migrated process modules. Update
super::nft_ruleset references inside netns to nft_ruleset since the module
is now a sibling sub-module of netns/mod.rs.
Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and
gate pub mod netns on target_os = "linux" since the implementation uses
netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs,
linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and
process.rs to point at openshell_core::netns.
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ll-sandbox Lift the entrypoint process spawn module and the embedded SSH server module into openshell-supervisor-process. openshell-sandbox now re-exports ProcessHandle/ProcessStatus and calls openshell_supervisor_process::ssh::run_ssh_server directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…om openshell-sandbox Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local advisor API into openshell-supervisor-networking. Move accompanying data files (sandbox-policy.rego), test fixtures (testdata/), and integration tests (system_inference, websocket_upgrade). Sandbox lib.rs now references these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr is exposed as pub for the orchestrator call site. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…orchestrator Move the symlink-resolver, policy poll loop, and denial-aggregator flush spawns out of run_process and into run_sandbox so run_process no longer needs OpaEngine, retained_proto, the local policy context, the sandbox name, the gateway endpoint for telemetry, the OCSF flag, or the denial receiver. These long-running orchestrator-owned tasks now live alongside the other sandbox-startup wiring, matching the design log decision in architecture/plans/sandbox-split-design-choices.md (Q5). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Lift the workload supervision entry point (zombie reaper, SSH server spawn, supervisor session, entrypoint child spawn, exit-with-timeout) into its own module in openshell-supervisor-process. The orchestrator in openshell-sandbox now calls openshell_supervisor_process::run::run_process directly. With this move run_process names only types from openshell-core, openshell-ocsf, openshell-supervisor-process itself, std, and tokio — no openshell-supervisor-networking dependency. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…netns run_networking only ever read host_ip from the netns it was passed (the SSH plumbing reads moved to run_process in earlier commits). Replace the NetworkNamespace parameter with a plain Option<IpAddr> the orchestrator extracts. supervisor-network's run module no longer references the netns type for any consumer, only for create_netns_for_proxy (which still lives in this crate; relocates next).
Relocates the NetworkNamespace handle, nft ruleset builder, and create_netns_for_proxy constructor into openshell-supervisor-process. The orchestrator (openshell-sandbox) phantom-owns the RAII handle for the duration of run_sandbox; supervisor-network no longer references the type at all. Drops uuid, libc, nix, openshell-ocsf, and tempfile from core's Linux target deps (all were exclusive to netns). tempfile becomes a Linux runtime dep on supervisor-process for nft ruleset application.
cargo-machete flagged 26 direct dependencies that were carried over from the pre-split monolith and are no longer used by the orchestrator itself: regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, anyhow, base64, bytes, flate2, glob, hex, hmac, nix, rand_core, rustls-pemfile, serde, serde_yml, sha1, sha2, thiserror, tokio-stream, uuid, webpki-roots. These now live (transitively) in openshell-supervisor-network and openshell-supervisor-process where they are actually consumed.
- Drop unused `url` from openshell-supervisor-network. - Mark `prost` and `prost-types` as cargo-machete-ignored in openshell-core: they have no source-level `use`, but the tonic- generated proto code references them via `::prost::Message` etc. - openshell-supervisor-process is already clean.
The OPA symlink-resolution task reads entrypoint_pid once at the top of the spawned closure. Because the spawn happens before run_process publishes the workload PID, the load returns 0, the probe path bakes in as /proc/0/root/, and the loop exhausts its retries against a path that does not exist on Linux. The reload never fires, so policies that whitelist symlinked binaries (e.g. /usr/bin/python3 → python3.11) get silent denials when the workload exec's the realpath. Split the wait into two phases: 5s polling entrypoint_pid for a non-zero value, then the existing 5s window probing /proc/<pid>/root/. Distinct warn messages on each timeout so future debugging can tell "PID never published" apart from "container fs never appeared".
|
@drew Thank you for the review, I've applied changes based on your comments. Notable new things are:
Moving the bypass monitor to the process leaf crate caused the need for both leaf crates I think having the aggregator in the orchestrator is actually a pretty natural fit.
Only nastier thing here is that a subset of procfs had to be copied into the process
This acted as both a cleanup and a prerequisite to moving netns into the process crate.
This was easy to do after the previous steps, however one notable thing is that the There's probably a bunch of nice followup refactors and designs that can be applied, but
Opted for a merge commit since rebasing would have taken ages and would have been error |
|
@rrhubenov thanks for addressing the comments. Do you mind creating a followup ticket just so we don't lose track of the remaining cleanups? Otherwise this looks good to me and seems like a good start towards getting the supervisor running outside the agent container. I'll let @derekwaynecarr give the final 👍. |
|
/ok to test ed12505 |
Reconciles main's anonymous activity-aggregator feature with the supervisor split topology. ActivityEvent / ActivitySender are placed in openshell-core::activity (mirroring the DenialEvent producer/consumer pattern); the orchestrator owns the channel and the aggregator, while both supervisor leaves (proxy in supervisor-network, bypass monitor in supervisor-process) emit as pure producers via cloned senders.
ed12505 to
b9a6a7b
Compare
|
I've created an issue for general refactoring: #1810 I can also implement the cleanups, since I've gained familiarity with the code. I should have the capacity to do it in a timely manner. Feel free to assign me if needed! |
# Conflicts: # crates/openshell-supervisor-network/src/proxy.rs
|
/ok to test e530c1d |
# Conflicts: # crates/openshell-sandbox/src/lib.rs
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> # Conflicts: # crates/openshell-core/Cargo.toml # crates/openshell-sandbox/Cargo.toml # crates/openshell-sandbox/src/lib.rs
|
@rrhubenov if you could rebase, i would like to try and land this early Friday, June 12 AM EST. |
|
For your follow-up, or as part of rebase, it appears there may be a functional change from the AI assisted code review. Behavioral change in baseline enrichment: /proc promotion removed. I am fine cleaning up any GPU regressions real or not in a follow-on so we can get more targeted refactors and smaller diffs. |
ace2a80 to
1ac626e
Compare
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> # Conflicts: # Cargo.lock # crates/openshell-core/Cargo.toml # crates/openshell-core/src/lib.rs # crates/openshell-sandbox/Cargo.toml # crates/openshell-sandbox/src/lib.rs # crates/openshell-supervisor-network/src/l7/token_grant_injection.rs # crates/openshell-supervisor-process/src/process.rs # crates/openshell-supervisor-process/src/ssh.rs
0d8acf3 to
26bc05c
Compare
Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 5102cb9) Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Good catch! Turns out this was accidentally removed during the first merge conflict resolution. I've added back the functionality and merged the changes from main |
|
/ok to test 73c5f2d |
Upstream renamed the tonic `tls` feature to `tls-native-roots`. The supervisor-process Cargo.toml still referenced the old name, which broke the workspace build after merging upstream. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Upstream's SPIFFE-backed token grant feature landed in crates/openshell-sandbox/src/. After the supervisor split, the L7 enforcement code in supervisor-network calls into token_grant, which would require supervisor-network to depend back on sandbox. Move token_grant.rs and spiffe_endpoint.rs into supervisor-network where the only callers live, add the reqwest and spiffe deps to supervisor-network's Cargo.toml, and drop them from sandbox. Also fix two stale `openshell_core::proto::` self-references in openshell-core (a pre-existing breakage that surfaced once the rest of the merge compiled). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
|
/ok to test 009dc33 |
|
see e2e failures, also think there may be a regression. Critical
|
|
/ok to test 009dc33 |
|
looks like you need to fix up clippy errors from the rust linter |
Summary
This PR is an implementation of #1305.
For full context on why this is required, please read #981 and #1305 in that order.
In summary, this split has been proposed for 2 main reasons:
supervisor(previouslysandbox)Although this change has been mostly evaluated from the perspective of the "split pod" topology, this splitting allows for a more agile topology regardless of environment. If for any reason the "process" and "network" functionality need to work as standalone processes, this split is required.
Review suggestions:
This refactor is hard to review due to a big collection of changes that are mostly movements of files and/or code blocks. For this reason, I suggest that reviewers take the approach of reviewing the changes commit by commit. I have taken care that each commit is as small as possible, moving only one functionality (module, function, etc.) in one commit.
Bear in mind that during the implementation, one very important architectural decision is that the 2 new crates (
supervisor-networkandsupervisor-process) do not have a dependency on one another.Only the "orcherstrator" crate, that being the
openshell-supervisorshould have a dependency on both of them.One other important point was that no behavioural changes should occur as a side-effect from this refactoring.
NOTE:
I haven't renamed the
openshell-sandboxcrate toopenshell-supervisorsince if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.I have also tried to minimise the amount of architectural design decisions in this PR, since it is already big enough.
Related Issue
Closes #1305
Changes
Testing
mise run pre-commitpassesChecklist