Skip to content

Warm-pooled sandboxes: RFC 0005 + install agent-sandbox extensions#1813

Closed
rmalani-nv wants to merge 3 commits into
NVIDIA:mainfrom
rmalani-nv:rmalani-nv/warm-pooled-sandboxes
Closed

Warm-pooled sandboxes: RFC 0005 + install agent-sandbox extensions#1813
rmalani-nv wants to merge 3 commits into
NVIDIA:mainfrom
rmalani-nv:rmalani-nv/warm-pooled-sandboxes

Conversation

@rmalani-nv

@rmalani-nv rmalani-nv commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Groundwork for warm-pooled sandboxes on the Kubernetes compute driver: adds the design as RFC 0005 and installs the upstream agent-sandbox warm-pool extension CRDs (SandboxTemplate / SandboxWarmPool / SandboxClaim) into the local k3d dev cluster and the e2e kube harness. No gateway runtime behavior changes yet — this prepares the clusters and records the plan for the follow-up driver work.

Installing the extensions before the gateway consumes them is intentional: it keeps the dev and e2e clusters ready for the phase-2 driver work, completes the existing AGENT_SANDBOX_VERSION "pinned for … extensions" intent already noted in those scripts, and is behavior-preserving — the extensions only add three CRDs and re-roll the shared agent-sandbox-controller. The install path was validated on a live k3s cluster (idempotent apply, all three CRDs Established, controller rolled out, and the cold-path sandbox lifecycle still works).

Related Issue

N/A — the design is captured in RFC 0005 in this PR. A spike/build issue can follow per the create-spikebuild-from-issue workflow.

Changes

  • RFC 0005 (rfc/0005-warm-pooled-sandboxes/README.md): propose claiming pre-warmed pods via the agent-sandbox extension CRDs (extensions.agents.x-k8s.io/v1alpha1). Documents the claim-based create flow, what bakes into the shared SandboxTemplate vs. late-binds over the supervisor relay, the one security-sensitive change (re-anchoring sandbox identity to the gateway-created SandboxClaim in auth/k8s_sa.rs), risks, alternatives, and a phased rollout.
  • Install extensions in dev + e2e (tasks/scripts/helm-k3s-local.sh, e2e/with-kube-gateway.sh): apply extensions.yaml alongside manifest.yaml, reusing the already-pinned AGENT_SANDBOX_VERSION (v0.4.6). The e2e harness waits for the three new extension CRDs to be Established and for the (re-rolled) agent-sandbox-controller.
  • Skill doc (.agents/skills/helm-dev-environment/SKILL.md): note that the dev bootstrap now installs the warm-pool extensions.

Three stacked commits: RFC → extension install → skill doc.

Testing

Validated end-to-end on a local k3s (k3d) cluster:

  • Installed agent-sandbox core + warm-pool extensions (v0.4.6) and drove a real SandboxTemplate → SandboxWarmPool → SandboxClaim cycle: the claim bound a warm pod in ~0.13s, the claim-injected openshell.io/sandbox-id annotation landed on the pod, and the pool self-replenished.

  • Deployed OpenShell via Skaffold and confirmed the cold-path baseline still works: sandbox createReady, IssueSandboxToken TokenReview → minted gateway JWT, and an echo executed inside the sandbox over the supervisor relay.

  • bash -n passes on both modified scripts.

  • mise run pre-commit passes — ran the relevant lint sub-tasks (license:check ✓, markdown:lint ✓) and bash -n on the scripts ✓. Did not run the full ci (Rust compile/tests) locally because no Rust/Python sources changed; CI covers it.

  • Unit tests added/updated — N/A (no code changes)

  • E2E tests added/updated — the e2e harness now installs the extensions; a warm-pool e2e assertion follows in the driver-path PR (RFC 0005, phase 2)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — N/A; the design lives in RFC 0005 and the warm-pool runtime path isn't implemented yet, so architecture/ ("how it works today") is unchanged.

@rmalani-nv rmalani-nv requested a review from a team as a code owner June 8, 2026 18:13
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Propose adopting the upstream agent-sandbox warm-pool extension CRDs
(SandboxTemplate / SandboxWarmPool / SandboxClaim,
extensions.agents.x-k8s.io/v1alpha1) on the Kubernetes driver to hand out
pre-warmed sandbox pods in ~milliseconds instead of cold-starting a Sandbox
CR per request.

Documents the claim-based create flow, what bakes into the shared template
vs. late-binds over the supervisor relay, the one security-sensitive change
(re-anchoring sandbox identity to the gateway-created SandboxClaim in
auth/k8s_sa.rs), risks, alternatives, and a phased rollout. Drafted from a
local spike validated against agent-sandbox v0.4.6.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…e2e clusters

Apply extensions.yaml alongside manifest.yaml when bootstrapping the local
k3d dev cluster and the e2e kube harness, reusing the pinned
AGENT_SANDBOX_VERSION already used for core. This installs the
SandboxTemplate / SandboxWarmPool / SandboxClaim CRDs and reconfigures the
existing agent-sandbox-controller, so clusters are ready for the warm-pooled
sandbox path (RFC 0005).

extensions.yaml rolls the controller deployment, so the e2e harness waits for
the rollout after both applies and for the new extension CRDs to be
Established. No gateway behavior changes yet.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
The local k3d bootstrap now also applies the agent-sandbox warm-pool
extensions; reflect that in the helm-dev-environment skill description.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
@rmalani-nv rmalani-nv force-pushed the rmalani-nv/warm-pooled-sandboxes branch from ba13a44 to 9dd7e1a Compare June 8, 2026 18:28
@TaylorMutch

Copy link
Copy Markdown
Collaborator

🔒 security-review-agent

Security Review

Determination: Legitimate design concern, with no immediate exploitable gateway runtime path in this PR.

Summary

The warm-pool RFC changes the security-critical identity anchor for Kubernetes sandboxes from a gateway-created Sandbox CR label to a gateway-created SandboxClaim record. That direction can preserve the existing invariant, but only if the follow-up implementation treats the claim and the gateway's durable claim record as the sole authority and explicitly prevents pod reuse, user-controlled claim metadata, and stale claim mappings.

PR #1813 mainly installs the upstream extension CRDs in dev/e2e and records the design. I do not see an immediate OpenShell runtime exploit from this PR alone because the gateway does not yet create SandboxClaims, the Helm gateway RBAC does not yet grant extension-resource access, and the current IssueSandboxToken authenticator still fails closed against warm-pool-owned Sandbox CRs that lack the existing OpenShell sandbox-id label.

The largest design gap is workspace/PVC reuse. OpenShell's Kubernetes sandboxes currently use volumeClaimTemplates for a PVC-backed /sandbox workspace, seeded once by an init container and then intentionally preserved for the sandbox lifetime. In a warm-pool design, a pre-created warm Sandbox may already have a workspace PVC before it is claimed. That is acceptable only while the pod/PVC is still pristine and has never run user workload code. Once a claim is handed to a user sandbox, the pod, Sandbox, and workspace PVC must be treated as single-use and must not return to the warm pool.

Severity Assessment

  • Impact if implemented incorrectly: High. A broken claim re-anchor could mint a gateway sandbox JWT for the wrong sandbox ID, leading to cross-sandbox access to provider material, policy state, relay/session scope, logs, or other sandbox-scoped APIs. Reusing a claimed pod/PVC could also expose one sandbox's /sandbox contents to a later claimant.
  • Exploitability today: Low for this PR as submitted. The risky path is in the proposed phase 2/3 implementation, not in currently reachable gateway behavior.
  • Exploitability after warm-pool support: Depends on whether attackers can influence claim metadata, template selection, pool lifecycle, or Kubernetes extension resources. The design should assume user-controlled sandbox templates and multi-replica gateways unless explicitly restricted.
  • Affected components: crates/openshell-driver-kubernetes, crates/openshell-server/src/auth/k8s_sa.rs, Helm gateway RBAC, future gateway Store schema for claim mappings, and Kubernetes warm-pool lifecycle handling.

Attack Scenario

If the phase 2/3 implementation trusts the pod annotation or SandboxClaim.status without a gateway-created, live, UID-checked claim record:

  1. An attacker influences a warm-path claim, claim metadata, or a stale claim mapping so the claimed pod carries openshell.io/sandbox-id for a victim sandbox.
  2. The warm pod's supervisor presents its pod-bound ServiceAccount token to IssueSandboxToken.
  3. The gateway validates the pod token and owner chain but accepts the spoofed or stale sandbox ID.
  4. The gateway mints a sandbox JWT for the victim sandbox ID, allowing the attacker-controlled pod to call sandbox-scoped APIs as that victim.

A separate data-isolation failure is possible if a claimed pod or its workspace PVC can ever return to the warm pool after user code runs. OpenShell's workspace init flow writes a sentinel and then preserves the PVC, so the next claimant could inherit filesystem state, cached credentials, gateway JWTs, logs, tool caches, source files, or malicious artifacts from the prior sandbox.

Remediation Plan

  1. Update RFC 0005 before implementation to define the exact fail-closed validation chain for IssueSandboxToken on the warm path:
    • TokenReview audience, service account, pod name, and pod UID match the live Pod.
    • Pod has exactly one controlling Sandbox ownerReference with matching UID.
    • The owning Sandbox has exactly one controlling SandboxClaim ownerReference with matching name and UID, and any agents.x-k8s.io/claim-uid label agrees.
    • The live SandboxClaim exists, has the expected UID, is bound, and status.sandbox.name equals the owning Sandbox name.
    • The gateway Store has a durable record for (namespace, claim name, claim UID) that was created by the gateway, contains the expected sandbox ID/template/pool, and matches the Pod's openshell.io/sandbox-id annotation.
    • Missing records, deleted claims, UID mismatches, multiple owners, status mismatches, and annotation mismatches all reject.
  2. Make the claim mapping HA-safe. It must be persisted in the shared gateway Store, not process memory, because any gateway replica may handle the supervisor bootstrap. Claim creation and Store insertion should be ordered so a pod cannot successfully bootstrap before the mapping exists.
  3. Reserve warm-path metadata. Users should not be able to set or override SandboxClaim.spec.additionalPodMetadata, SandboxClaim.spec.env, openshell.io/sandbox-id, openshell.ai/sandbox-id, openshell.io/*, agents.x-k8s.io/*, SPIFFE selector labels, or OpenShell managed-by labels. If the public API accepts annotations/labels, the warm path should either reject those requests or prove they cannot affect claim metadata, identity, SPIFFE, NetworkPolicy selection, or provider token grants.
  4. State and enforce single-use claim semantics. A warm Sandbox/Pod/PVC can be pooled only before it has run user workload code. After a SandboxClaim binds it to an OpenShell sandbox, it must be consumed exactly once and then deleted or otherwise destroyed with its workspace PVC. It must not be replenished by returning the same pod, Sandbox, or PVC to the pool.
  5. Make lifecycle cleanup explicit instead of relying on defaults. The design should specify the SandboxClaim lifecycle fields OpenShell will set, how claim deletion cascades to the bound Sandbox, Pod, PVCs, network policies, and ownerReferences/finalizers, and what the gateway does if cleanup is partial or delayed. The upstream SandboxClaim default shutdownPolicy: Retain should be treated as unsafe for OpenShell unless the follow-up design proves that PVC reuse cannot occur.
  6. Add a workspace isolation e2e test. Claim a warm sandbox, write a marker and sensitive-looking file under /sandbox, delete the sandbox, claim again from the same pool, and assert the marker is absent and the backing PVC identity is not reused across claims.
  7. Restrict initial pool scope. Until the threat model is expanded, only operator-declared pools using trusted templates/images should be warm-pooled. User-supplied images or arbitrary per-request templates should fall back to the cold path unless the design adds per-tenant pool isolation and clear cleanup guarantees.
  8. Add adversarial tests with the implementation:
    • Pod annotation does not match gateway claim record.
    • Sandbox owner UID mismatch.
    • SandboxClaim owner UID mismatch or missing owner.
    • Claim status points at the wrong Sandbox.
    • Claim exists but no Store record exists on the serving gateway replica.
    • Stale Store record remains after claim deletion/recreation.
    • User-supplied annotations/labels try to set reserved identity or SPIFFE-related keys.
    • Claimed pod, Sandbox, and workspace PVC are not reused for a later sandbox.

Additional Notes

The RFC already identifies auth/k8s_sa.rs as the security-sensitive change and correctly avoids trusting per-claim environment variables on the warm path. The missing piece is turning those intentions into explicit invariants and tests before phase 2/3 proceeds.

@TaylorMutch

Copy link
Copy Markdown
Collaborator

In addition to the review above, I would suggest we move the RFC proposal to an issue and not move forward with the agent-sandbox extensions at this time as there is currently no feature on main that requires these. I think we need to establish the design for how a warm pool will look with respect to how the supervisor sets up the workspace, since that has implications for how that data is utilized. I think we really should avoid sharing that workspace data between agents, so I'm not sure how a warmpool can be implemented at this time with those considerations in place.

@rmalani-nv

Copy link
Copy Markdown
Contributor Author

Thanks @TaylorMutch — this review is right on both counts, and I agree with moving it to an issue.

Identity re-anchor: agreed, and the fail-closed chain you laid out matches the intended direction — I've captured it as a hard requirement. Confirmed there's no exploit in this PR: validate_sandbox_owner_reference fails closed against warm-pool Sandboxes (they carry agents.x-k8s.io/claim-uid, not the openshell.ai/sandbox-id label), so the current IssueSandboxToken path rejects warm pods today.

Workspace/PVC — confirmed empirically. I reproduced this on a local k3s + agent-sandbox v0.4.6 with a volumeClaimTemplates-backed pool:

  • Each warm Sandbox gets its own PVC (no sharing), and claiming replenishes the pool with a fresh Sandbox + PVC — the claimed one is not recycled.
  • But with the default shutdownPolicy: Retain, I claimed a sandbox, wrote TENANT-A-SECRET to its workspace, then deleted the claim: the Sandbox was deleted but the workspace PVC survived (Bound, marker intact). So you're exactly right — the default lifecycle orphans the workspace, and single-use + explicit PVC destruction (shutdownPolicy: Delete/DeleteForeground, never returning a claimed Pod/Sandbox/PVC to the pool) is mandatory.

So warm pooling is reconcilable with workspace isolation, but only under those invariants — not the upstream defaults.

Action taken: I've moved the proposal to #1879 with your remediation items as hard requirements (fail-closed validation chain, HA-safe Store-backed claim mapping, reserved warm-path metadata, single-use lifecycle + explicit PVC destruction, operator-only trusted pools, and workspace-isolation + adversarial e2e tests), plus the evidence above. Per your suggestion I'm closing this PR and not landing the extensions install — nothing on main consumes them yet, so the install belongs with its phase-2 driver consumer, gated by the design in #1879.

@rmalani-nv

Copy link
Copy Markdown
Contributor Author

Closing in favor of #1879 (issue-first scoping per the RFC process). The extensions install is deferred to land with its phase-2 driver consumer; rationale and the security follow-ups are in the comment above and in #1879.

@rmalani-nv rmalani-nv closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants