feat(llm): add `vertex_cached_content` config for explicit Vertex AI caching by juanmichelini · Pull Request #3583 · OpenHands/software-agent-sdk

juanmichelini · 2026-06-09T03:32:49Z

Why

This is the second of two PRs from the gemini-3.5-flash cost investigation in OpenHands/benchmarks#741. The companion is #3581 (strip Vertex thought signatures from history).

What I found in the data

Pulled the conversation event logs for the swebench-verified slice (10/10 resolved, $28.13 total / $2.81/instance / $11.11 outlier) and aggregated Metrics.usage_to_metrics["default"] across all 10 instances:

metric	total	note
prompt_tokens	19,390,285
cache_read_tokens	5,102,668	26% hit rate
cache_write_tokens	0	across every instance
completion_tokens	659,100
reasoning_tokens	562,315

So 74 % of prompt tokens are billed at the uncached $1.50/M rate.

Why the existing path doesn't write a cache

The SDK already marks the system message with cache_control: ephemeral (and the last user/tool, Anthropic-style) and the model name gemini-3.5-flash matches the "gemini-3" substring in PROMPT_CACHE_MODELS, so is_caching_prompt_active() returns True. LiteLLM's vertex_ai.context_caching.ContextCachingEndpoints.check_and_create_cache does the right thing for vertex_ai/ direct: it splits cached vs non-cached messages, checks if Google already has the resource, otherwise calls Vertex's cachedContents API and returns the resource name to reference.

But the benchmark runs through litellm_proxy/gemini-3.5-flash, not vertex_ai/gemini-3.5-flash. The SDK's local LiteLLM client just sends an OpenAI-style request body to the proxy URL, so whether the cache_control markers ever reach Vertex depends entirely on the proxy's translation layer — and the data shows it isn't translating them today.

What this PR adds

A first-class seam for users running against Vertex (directly or via a proxy that knows how to forward the kwarg) to pre-create a CachedContent resource and reference it by name on every request:

from openhands.sdk import LLM

llm = LLM(
    model="vertex_ai/gemini-3-flash",
    vertex_cached_content="cachedContents/1234567890",
)

The SDK threads it through select_chat_options as a top-level cached_content=... kwarg, which LiteLLM pops from optional_params in vertex_ai/gemini/transformation.py and forwards to the Vertex generateContent body as cachedContent.

Design choices

No CachedContent.create inside the SDK. Cache lifecycle (create, refresh TTL, delete) requires Vertex credentials, project/location config, and an async cache manager. Keeping that out of the SDK avoids pulling google-cloud-aiplatform in as a hard dependency and respects the user's existing auth setup. The user creates the resource (one gcloud call) and pastes its name into the LLM config.
Gemini-only gating. Emitting cached_content for a non-Gemini model would surface as an unknown-kwarg error from OpenAI / Anthropic / etc. We gate on "gemini" in model.lower() — covers vertex_ai/, gemini/, litellm_proxy/gemini-*, and bare gemini-*; everything else gets the silent no-op.
User kwarg wins. A caller-supplied cached_content in user_kwargs overrides the LLM config field, matching the precedence pattern used elsewhere in chat_options.py (extra_headers, etc.).

Limitations (called out so reviewers can shape this)

For the litellm_proxy/ users this kwarg only fires if the proxy itself forwards cached_content to its Vertex backend. The SDK can't fix proxy configuration; this PR just exposes the field so a fixed proxy has something to receive.
The Vertex minimum-cache-size constraint (1024–32K tokens depending on model) still applies — sub-threshold caches won't be created, but a pre-created cache always meets the bar by construction.
This PR does not change the existing cache_control marker placement. That's a separate question (the rolling-tail marker is effectively a no-op on Vertex's first-continuous-block split, but isn't actively harmful).

Files

Modified: openhands-sdk/openhands/sdk/llm/llm.py — new vertex_cached_content: str | None field with full docstring linking to Vertex docs.
Modified: openhands-sdk/openhands/sdk/llm/options/chat_options.py — _model_supports_vertex_cached_content gating + emission block at the bottom of select_chat_options.
Modified: tests/sdk/llm/test_chat_options.py — 7 new tests, plus the field added to the DummyLLM dataclass to match the existing convention.

Test plan

uv run pytest tests/sdk/llm/test_chat_options.py -v          # 22 passed
uv run pytest tests/sdk/llm/ tests/sdk/event/ -q             # 931 passed
uv run ruff format <files> && uv run ruff check <files>      # clean
uv run pyright openhands-sdk/openhands/sdk/llm/options/chat_options.py   # 0 errors

Tests cover:

vertex_ai/, gemini/, and litellm_proxy/gemini-* all receive the kwarg.
gpt-5-mini and claude-sonnet-4-5 never receive it even when the field is set.
Default None produces no kwarg.
A caller-supplied cached_content overrides the LLM config field.

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, following the cost analysis in OpenHands/benchmarks#741. Companion PR: #3581.

@juanmichelini can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:6f12ac8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-6f12ac8-python \
  ghcr.io/openhands/agent-server:6f12ac8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:6f12ac8-golang-amd64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-golang-amd64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-golang-amd64
ghcr.io/openhands/agent-server:6f12ac8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6f12ac8-golang-arm64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-golang-arm64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-golang-arm64
ghcr.io/openhands/agent-server:6f12ac8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6f12ac8-java-amd64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-java-amd64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-java-amd64
ghcr.io/openhands/agent-server:6f12ac8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6f12ac8-java-arm64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-java-arm64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-java-arm64
ghcr.io/openhands/agent-server:6f12ac8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6f12ac8-python-amd64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-python-amd64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-python-amd64
ghcr.io/openhands/agent-server:6f12ac8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:6f12ac8-python-arm64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-python-arm64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-python-arm64
ghcr.io/openhands/agent-server:6f12ac8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:6f12ac8-golang
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-golang
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-golang
ghcr.io/openhands/agent-server:6f12ac8-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:6f12ac8-java
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-java
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-java
ghcr.io/openhands/agent-server:6f12ac8-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:6f12ac8-python
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-python
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-python
ghcr.io/openhands/agent-server:6f12ac8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 6f12ac8-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 6f12ac8-python-amd64) are also available if needed

…ching Vertex AI Gemini exposes an explicit context-cache API: the caller creates a CachedContent resource (https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache) and references it by name on every subsequent generateContent request. LiteLLM already understands the kwarg (it pops 'cached_content' from optional_params in vertex_ai.gemini.transformation.sync_transform_request_body and forwards it to the API body) but the SDK had no first-class way to plumb it through — users had to fight with raw litellm_extra_body and a proxy that may or may not let it through. This commit adds: * LLM.vertex_cached_content: str | None -- optional resource name field. * select_chat_options() emits 'cached_content=<name>' on the LiteLLM call whenever the field is set AND the model name contains 'gemini' (so vertex_ai/, gemini/, litellm_proxy/gemini-* all route correctly). * The emission is gated by a Gemini-only check so non-Vertex providers (OpenAI, Anthropic, etc.) that reject unknown kwargs stay unaffected. * A caller-supplied 'cached_content' kwarg always wins over the LLM config field, matching the precedence we apply elsewhere. Cache lifecycle (create / refresh TTL / delete) stays with the caller, who has the Vertex credentials and project context. This keeps the SDK free of google-cloud-aiplatform as a hard dependency while still giving users a clean, type-checked seam for explicit caching. Tests cover: * Vertex / Gemini / litellm_proxy positive cases all emit the kwarg. * OpenAI and Claude negative cases never emit it. * Default None is silent. * User kwarg override wins. This is part of an SDK cost-reduction investigation triggered by the gemini-3.5-flash swebench run analysed in OpenHands/benchmarks#741 ($1,912 projected on 500 instances, dominated by uncached prompt tokens at litellm_proxy). PR #3581 covered the thought-signature side of that investigation; this PR gives a path to explicit caching for users running against vertex_ai/ directly. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-09T03:33:15Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Behavioral default changes detected

These public Field(default=...) changes differ from the latest released baseline, but they were already present on the base branch, so this PR was not auto-marked with the release-note-required label:

openhands.sdk.settings.model.OpenHandsAgentSettings.condenser: CondenserSettings → LLMSummarizingCondenserSettings

Action log

github-actions · 2026-06-09T03:33:23Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-09T03:35:41Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm
llm.py	757	108	85%	543, 567, 600, 885–886, 889–893, 895, 903–905, 909, 926–927, 931, 933–934, 936–938, 1061, 1184, 1377, 1386–1388, 1487, 1498, 1539, 1551–1553, 1556–1559, 1565, 1623, 1634, 1677, 1690–1692, 1695–1698, 1704, 1876–1881, 1997–1998, 2332–2333, 2342, 2348, 2353, 2393, 2395–2400, 2402–2419, 2422–2426, 2428–2429, 2435–2444, 2501, 2503
TOTAL	29668	8427	71%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add `vertex_cached_content` config for explicit Vertex AI caching#3583

feat(llm): add `vertex_cached_content` config for explicit Vertex AI caching#3583
juanmichelini wants to merge 1 commit into
mainfrom
feat/vertex-explicit-cached-content-config

juanmichelini commented Jun 9, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented Jun 9, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What I found in the data

Why the existing path doesn't write a cache

What this PR adds

Design choices

Limitations (called out so reviewers can shape this)

Files

Test plan

Uh oh!

github-actions Bot commented Jun 9, 2026

Python API breakage checks — ✅ PASSED

Behavioral default changes detected

Uh oh!

github-actions Bot commented Jun 9, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juanmichelini commented Jun 9, 2026 •

edited by github-actions Bot

Loading