Warm-pooled sandboxes: RFC 0005 + install agent-sandbox extensions#1813
Warm-pooled sandboxes: RFC 0005 + install agent-sandbox extensions#1813rmalani-nv wants to merge 3 commits into
Conversation
Propose adopting the upstream agent-sandbox warm-pool extension CRDs (SandboxTemplate / SandboxWarmPool / SandboxClaim, extensions.agents.x-k8s.io/v1alpha1) on the Kubernetes driver to hand out pre-warmed sandbox pods in ~milliseconds instead of cold-starting a Sandbox CR per request. Documents the claim-based create flow, what bakes into the shared template vs. late-binds over the supervisor relay, the one security-sensitive change (re-anchoring sandbox identity to the gateway-created SandboxClaim in auth/k8s_sa.rs), risks, alternatives, and a phased rollout. Drafted from a local spike validated against agent-sandbox v0.4.6. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…e2e clusters Apply extensions.yaml alongside manifest.yaml when bootstrapping the local k3d dev cluster and the e2e kube harness, reusing the pinned AGENT_SANDBOX_VERSION already used for core. This installs the SandboxTemplate / SandboxWarmPool / SandboxClaim CRDs and reconfigures the existing agent-sandbox-controller, so clusters are ready for the warm-pooled sandbox path (RFC 0005). extensions.yaml rolls the controller deployment, so the e2e harness waits for the rollout after both applies and for the new extension CRDs to be Established. No gateway behavior changes yet. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
The local k3d bootstrap now also applies the agent-sandbox warm-pool extensions; reflect that in the helm-dev-environment skill description. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
ba13a44 to
9dd7e1a
Compare
Security ReviewDetermination: Legitimate design concern, with no immediate exploitable gateway runtime path in this PR. SummaryThe warm-pool RFC changes the security-critical identity anchor for Kubernetes sandboxes from a gateway-created PR #1813 mainly installs the upstream extension CRDs in dev/e2e and records the design. I do not see an immediate OpenShell runtime exploit from this PR alone because the gateway does not yet create The largest design gap is workspace/PVC reuse. OpenShell's Kubernetes sandboxes currently use Severity Assessment
Attack ScenarioIf the phase 2/3 implementation trusts the pod annotation or
A separate data-isolation failure is possible if a claimed pod or its workspace PVC can ever return to the warm pool after user code runs. OpenShell's workspace init flow writes a sentinel and then preserves the PVC, so the next claimant could inherit filesystem state, cached credentials, gateway JWTs, logs, tool caches, source files, or malicious artifacts from the prior sandbox. Remediation Plan
Additional NotesThe RFC already identifies |
|
In addition to the review above, I would suggest we move the RFC proposal to an issue and not move forward with the agent-sandbox extensions at this time as there is currently no feature on main that requires these. I think we need to establish the design for how a warm pool will look with respect to how the supervisor sets up the workspace, since that has implications for how that data is utilized. I think we really should avoid sharing that workspace data between agents, so I'm not sure how a warmpool can be implemented at this time with those considerations in place. |
|
Thanks @TaylorMutch — this review is right on both counts, and I agree with moving it to an issue. Identity re-anchor: agreed, and the fail-closed chain you laid out matches the intended direction — I've captured it as a hard requirement. Confirmed there's no exploit in this PR: Workspace/PVC — confirmed empirically. I reproduced this on a local k3s + agent-sandbox
So warm pooling is reconcilable with workspace isolation, but only under those invariants — not the upstream defaults. Action taken: I've moved the proposal to #1879 with your remediation items as hard requirements (fail-closed validation chain, HA-safe Store-backed claim mapping, reserved warm-path metadata, single-use lifecycle + explicit PVC destruction, operator-only trusted pools, and workspace-isolation + adversarial e2e tests), plus the evidence above. Per your suggestion I'm closing this PR and not landing the extensions install — nothing on |
Summary
Groundwork for warm-pooled sandboxes on the Kubernetes compute driver: adds the design as RFC 0005 and installs the upstream agent-sandbox warm-pool extension CRDs (
SandboxTemplate/SandboxWarmPool/SandboxClaim) into the local k3d dev cluster and the e2e kube harness. No gateway runtime behavior changes yet — this prepares the clusters and records the plan for the follow-up driver work.Installing the extensions before the gateway consumes them is intentional: it keeps the dev and e2e clusters ready for the phase-2 driver work, completes the existing
AGENT_SANDBOX_VERSION"pinned for … extensions" intent already noted in those scripts, and is behavior-preserving — the extensions only add three CRDs and re-roll the sharedagent-sandbox-controller. The install path was validated on a live k3s cluster (idempotentapply, all three CRDs Established, controller rolled out, and the cold-path sandbox lifecycle still works).Related Issue
N/A — the design is captured in RFC 0005 in this PR. A spike/build issue can follow per the
create-spike→build-from-issueworkflow.Changes
rfc/0005-warm-pooled-sandboxes/README.md): propose claiming pre-warmed pods via the agent-sandbox extension CRDs (extensions.agents.x-k8s.io/v1alpha1). Documents the claim-based create flow, what bakes into the sharedSandboxTemplatevs. late-binds over the supervisor relay, the one security-sensitive change (re-anchoring sandbox identity to the gateway-createdSandboxClaiminauth/k8s_sa.rs), risks, alternatives, and a phased rollout.tasks/scripts/helm-k3s-local.sh,e2e/with-kube-gateway.sh): applyextensions.yamlalongsidemanifest.yaml, reusing the already-pinnedAGENT_SANDBOX_VERSION(v0.4.6). The e2e harness waits for the three new extension CRDs to be Established and for the (re-rolled)agent-sandbox-controller..agents/skills/helm-dev-environment/SKILL.md): note that the dev bootstrap now installs the warm-pool extensions.Three stacked commits: RFC → extension install → skill doc.
Testing
Validated end-to-end on a local k3s (k3d) cluster:
Installed agent-sandbox core + warm-pool extensions (
v0.4.6) and drove a realSandboxTemplate → SandboxWarmPool → SandboxClaimcycle: the claim bound a warm pod in ~0.13s, the claim-injectedopenshell.io/sandbox-idannotation landed on the pod, and the pool self-replenished.Deployed OpenShell via Skaffold and confirmed the cold-path baseline still works:
sandbox create→Ready,IssueSandboxTokenTokenReview → minted gateway JWT, and anechoexecuted inside the sandbox over the supervisor relay.bash -npasses on both modified scripts.mise run pre-commitpasses — ran the relevant lint sub-tasks (license:check✓,markdown:lint✓) andbash -non the scripts ✓. Did not run the fullci(Rust compile/tests) locally because no Rust/Python sources changed; CI covers it.Unit tests added/updated — N/A (no code changes)
E2E tests added/updated — the e2e harness now installs the extensions; a warm-pool e2e assertion follows in the driver-path PR (RFC 0005, phase 2)
Checklist
architecture/("how it works today") is unchanged.