Skip to content

feat(llm): add vertex_cached_content config for explicit Vertex AI caching#3583

Draft
juanmichelini wants to merge 1 commit into
mainfrom
feat/vertex-explicit-cached-content-config
Draft

feat(llm): add vertex_cached_content config for explicit Vertex AI caching#3583
juanmichelini wants to merge 1 commit into
mainfrom
feat/vertex-explicit-cached-content-config

Conversation

@juanmichelini

@juanmichelini juanmichelini commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Why

This is the second of two PRs from the gemini-3.5-flash cost investigation in OpenHands/benchmarks#741. The companion is #3581 (strip Vertex thought signatures from history).

What I found in the data

Pulled the conversation event logs for the swebench-verified slice (10/10 resolved, $28.13 total / $2.81/instance / $11.11 outlier) and aggregated Metrics.usage_to_metrics["default"] across all 10 instances:

metric total note
prompt_tokens 19,390,285
cache_read_tokens 5,102,668 26% hit rate
cache_write_tokens 0 across every instance
completion_tokens 659,100
reasoning_tokens 562,315

So 74 % of prompt tokens are billed at the uncached $1.50/M rate.

Why the existing path doesn't write a cache

The SDK already marks the system message with cache_control: ephemeral (and the last user/tool, Anthropic-style) and the model name gemini-3.5-flash matches the "gemini-3" substring in PROMPT_CACHE_MODELS, so is_caching_prompt_active() returns True. LiteLLM's vertex_ai.context_caching.ContextCachingEndpoints.check_and_create_cache does the right thing for vertex_ai/ direct: it splits cached vs non-cached messages, checks if Google already has the resource, otherwise calls Vertex's cachedContents API and returns the resource name to reference.

But the benchmark runs through litellm_proxy/gemini-3.5-flash, not vertex_ai/gemini-3.5-flash. The SDK's local LiteLLM client just sends an OpenAI-style request body to the proxy URL, so whether the cache_control markers ever reach Vertex depends entirely on the proxy's translation layer — and the data shows it isn't translating them today.

What this PR adds

A first-class seam for users running against Vertex (directly or via a proxy that knows how to forward the kwarg) to pre-create a CachedContent resource and reference it by name on every request:

from openhands.sdk import LLM

llm = LLM(
    model="vertex_ai/gemini-3-flash",
    vertex_cached_content="cachedContents/1234567890",
)

The SDK threads it through select_chat_options as a top-level cached_content=... kwarg, which LiteLLM pops from optional_params in vertex_ai/gemini/transformation.py and forwards to the Vertex generateContent body as cachedContent.

Design choices

  • No CachedContent.create inside the SDK. Cache lifecycle (create, refresh TTL, delete) requires Vertex credentials, project/location config, and an async cache manager. Keeping that out of the SDK avoids pulling google-cloud-aiplatform in as a hard dependency and respects the user's existing auth setup. The user creates the resource (one gcloud call) and pastes its name into the LLM config.
  • Gemini-only gating. Emitting cached_content for a non-Gemini model would surface as an unknown-kwarg error from OpenAI / Anthropic / etc. We gate on "gemini" in model.lower() — covers vertex_ai/, gemini/, litellm_proxy/gemini-*, and bare gemini-*; everything else gets the silent no-op.
  • User kwarg wins. A caller-supplied cached_content in user_kwargs overrides the LLM config field, matching the precedence pattern used elsewhere in chat_options.py (extra_headers, etc.).

Limitations (called out so reviewers can shape this)

  • For the litellm_proxy/ users this kwarg only fires if the proxy itself forwards cached_content to its Vertex backend. The SDK can't fix proxy configuration; this PR just exposes the field so a fixed proxy has something to receive.
  • The Vertex minimum-cache-size constraint (1024–32K tokens depending on model) still applies — sub-threshold caches won't be created, but a pre-created cache always meets the bar by construction.
  • This PR does not change the existing cache_control marker placement. That's a separate question (the rolling-tail marker is effectively a no-op on Vertex's first-continuous-block split, but isn't actively harmful).

Files

  • Modified: openhands-sdk/openhands/sdk/llm/llm.py — new vertex_cached_content: str | None field with full docstring linking to Vertex docs.
  • Modified: openhands-sdk/openhands/sdk/llm/options/chat_options.py_model_supports_vertex_cached_content gating + emission block at the bottom of select_chat_options.
  • Modified: tests/sdk/llm/test_chat_options.py — 7 new tests, plus the field added to the DummyLLM dataclass to match the existing convention.

Test plan

uv run pytest tests/sdk/llm/test_chat_options.py -v          # 22 passed
uv run pytest tests/sdk/llm/ tests/sdk/event/ -q             # 931 passed
uv run ruff format <files> && uv run ruff check <files>      # clean
uv run pyright openhands-sdk/openhands/sdk/llm/options/chat_options.py   # 0 errors

Tests cover:

  • vertex_ai/, gemini/, and litellm_proxy/gemini-* all receive the kwarg.
  • gpt-5-mini and claude-sonnet-4-5 never receive it even when the field is set.
  • Default None produces no kwarg.
  • A caller-supplied cached_content overrides the LLM config field.

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, following the cost analysis in OpenHands/benchmarks#741. Companion PR: #3581.

@juanmichelini can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:6f12ac8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-6f12ac8-python \
  ghcr.io/openhands/agent-server:6f12ac8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:6f12ac8-golang-amd64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-golang-amd64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-golang-amd64
ghcr.io/openhands/agent-server:6f12ac8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6f12ac8-golang-arm64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-golang-arm64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-golang-arm64
ghcr.io/openhands/agent-server:6f12ac8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6f12ac8-java-amd64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-java-amd64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-java-amd64
ghcr.io/openhands/agent-server:6f12ac8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6f12ac8-java-arm64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-java-arm64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-java-arm64
ghcr.io/openhands/agent-server:6f12ac8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6f12ac8-python-amd64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-python-amd64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-python-amd64
ghcr.io/openhands/agent-server:6f12ac8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:6f12ac8-python-arm64
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-python-arm64
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-python-arm64
ghcr.io/openhands/agent-server:6f12ac8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:6f12ac8-golang
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-golang
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-golang
ghcr.io/openhands/agent-server:6f12ac8-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:6f12ac8-java
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-java
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-java
ghcr.io/openhands/agent-server:6f12ac8-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:6f12ac8-python
ghcr.io/openhands/agent-server:6f12ac88ea944aeb5875579a1031cbf7a58ef2e5-python
ghcr.io/openhands/agent-server:feat-vertex-explicit-cached-content-config-python
ghcr.io/openhands/agent-server:6f12ac8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 6f12ac8-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 6f12ac8-python-amd64) are also available if needed

…ching

Vertex AI Gemini exposes an explicit context-cache API: the caller creates
a CachedContent resource (https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache)
and references it by name on every subsequent generateContent request.
LiteLLM already understands the kwarg (it pops 'cached_content' from
optional_params in vertex_ai.gemini.transformation.sync_transform_request_body
and forwards it to the API body) but the SDK had no first-class way to
plumb it through — users had to fight with raw litellm_extra_body and a
proxy that may or may not let it through.

This commit adds:

* LLM.vertex_cached_content: str | None  --  optional resource name field.
* select_chat_options() emits 'cached_content=<name>' on the LiteLLM call
  whenever the field is set AND the model name contains 'gemini' (so
  vertex_ai/, gemini/, litellm_proxy/gemini-* all route correctly).
* The emission is gated by a Gemini-only check so non-Vertex providers
  (OpenAI, Anthropic, etc.) that reject unknown kwargs stay unaffected.
* A caller-supplied 'cached_content' kwarg always wins over the LLM
  config field, matching the precedence we apply elsewhere.

Cache lifecycle (create / refresh TTL / delete) stays with the caller,
who has the Vertex credentials and project context. This keeps the SDK
free of google-cloud-aiplatform as a hard dependency while still giving
users a clean, type-checked seam for explicit caching.

Tests cover:
* Vertex / Gemini / litellm_proxy positive cases all emit the kwarg.
* OpenAI and Claude negative cases never emit it.
* Default None is silent.
* User kwarg override wins.

This is part of an SDK cost-reduction investigation triggered by the
gemini-3.5-flash swebench run analysed in OpenHands/benchmarks#741 ($1,912
projected on 500 instances, dominated by uncached prompt tokens at
litellm_proxy). PR #3581 covered the thought-signature side of that
investigation; this PR gives a path to explicit caching for users running
against vertex_ai/ directly.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Behavioral default changes detected

These public Field(default=...) changes differ from the latest released baseline, but they were already present on the base branch, so this PR was not auto-marked with the release-note-required label:

  • openhands.sdk.settings.model.OpenHandsAgentSettings.condenser: CondenserSettingsLLMSummarizingCondenserSettings

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py75710885%543, 567, 600, 885–886, 889–893, 895, 903–905, 909, 926–927, 931, 933–934, 936–938, 1061, 1184, 1377, 1386–1388, 1487, 1498, 1539, 1551–1553, 1556–1559, 1565, 1623, 1634, 1677, 1690–1692, 1695–1698, 1704, 1876–1881, 1997–1998, 2332–2333, 2342, 2348, 2353, 2393, 2395–2400, 2402–2419, 2422–2426, 2428–2429, 2435–2444, 2501, 2503
TOTAL29668842771% 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants