feat(llm): add vertex_cached_content config for explicit Vertex AI caching#3583
Draft
juanmichelini wants to merge 1 commit into
Draft
feat(llm): add vertex_cached_content config for explicit Vertex AI caching#3583juanmichelini wants to merge 1 commit into
vertex_cached_content config for explicit Vertex AI caching#3583juanmichelini wants to merge 1 commit into
Conversation
…ching Vertex AI Gemini exposes an explicit context-cache API: the caller creates a CachedContent resource (https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache) and references it by name on every subsequent generateContent request. LiteLLM already understands the kwarg (it pops 'cached_content' from optional_params in vertex_ai.gemini.transformation.sync_transform_request_body and forwards it to the API body) but the SDK had no first-class way to plumb it through — users had to fight with raw litellm_extra_body and a proxy that may or may not let it through. This commit adds: * LLM.vertex_cached_content: str | None -- optional resource name field. * select_chat_options() emits 'cached_content=<name>' on the LiteLLM call whenever the field is set AND the model name contains 'gemini' (so vertex_ai/, gemini/, litellm_proxy/gemini-* all route correctly). * The emission is gated by a Gemini-only check so non-Vertex providers (OpenAI, Anthropic, etc.) that reject unknown kwargs stay unaffected. * A caller-supplied 'cached_content' kwarg always wins over the LLM config field, matching the precedence we apply elsewhere. Cache lifecycle (create / refresh TTL / delete) stays with the caller, who has the Vertex credentials and project context. This keeps the SDK free of google-cloud-aiplatform as a hard dependency while still giving users a clean, type-checked seam for explicit caching. Tests cover: * Vertex / Gemini / litellm_proxy positive cases all emit the kwarg. * OpenAI and Claude negative cases never emit it. * Default None is silent. * User kwarg override wins. This is part of an SDK cost-reduction investigation triggered by the gemini-3.5-flash swebench run analysed in OpenHands/benchmarks#741 ($1,912 projected on 500 instances, dominated by uncached prompt tokens at litellm_proxy). PR #3581 covered the thought-signature side of that investigation; this PR gives a path to explicit caching for users running against vertex_ai/ directly. Co-authored-by: openhands <openhands@all-hands.dev>
Contributor
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED Behavioral default changes detectedThese public
|
Contributor
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Contributor
Coverage Report •
|
||||||||||||||||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
This is the second of two PRs from the gemini-3.5-flash cost investigation in OpenHands/benchmarks#741. The companion is #3581 (strip Vertex thought signatures from history).
What I found in the data
Pulled the conversation event logs for the swebench-verified slice (10/10 resolved, $28.13 total / $2.81/instance / $11.11 outlier) and aggregated
Metrics.usage_to_metrics["default"]across all 10 instances:So 74 % of prompt tokens are billed at the uncached
$1.50/Mrate.Why the existing path doesn't write a cache
The SDK already marks the system message with
cache_control: ephemeral(and the last user/tool, Anthropic-style) and the model namegemini-3.5-flashmatches the"gemini-3"substring inPROMPT_CACHE_MODELS, sois_caching_prompt_active()returns True. LiteLLM'svertex_ai.context_caching.ContextCachingEndpoints.check_and_create_cachedoes the right thing forvertex_ai/direct: it splits cached vs non-cached messages, checks if Google already has the resource, otherwise calls Vertex'scachedContentsAPI and returns the resource name to reference.But the benchmark runs through
litellm_proxy/gemini-3.5-flash, notvertex_ai/gemini-3.5-flash. The SDK's local LiteLLM client just sends an OpenAI-style request body to the proxy URL, so whether thecache_controlmarkers ever reach Vertex depends entirely on the proxy's translation layer — and the data shows it isn't translating them today.What this PR adds
A first-class seam for users running against Vertex (directly or via a proxy that knows how to forward the kwarg) to pre-create a
CachedContentresource and reference it by name on every request:The SDK threads it through
select_chat_optionsas a top-levelcached_content=...kwarg, which LiteLLM pops fromoptional_paramsinvertex_ai/gemini/transformation.pyand forwards to the VertexgenerateContentbody ascachedContent.Design choices
google-cloud-aiplatformin as a hard dependency and respects the user's existing auth setup. The user creates the resource (onegcloudcall) and pastes its name into the LLM config.cached_contentfor a non-Gemini model would surface as an unknown-kwarg error from OpenAI / Anthropic / etc. We gate on"gemini" in model.lower()— coversvertex_ai/,gemini/,litellm_proxy/gemini-*, and baregemini-*; everything else gets the silent no-op.cached_contentinuser_kwargsoverrides the LLM config field, matching the precedence pattern used elsewhere inchat_options.py(extra_headers, etc.).Limitations (called out so reviewers can shape this)
litellm_proxy/users this kwarg only fires if the proxy itself forwardscached_contentto its Vertex backend. The SDK can't fix proxy configuration; this PR just exposes the field so a fixed proxy has something to receive.cache_controlmarker placement. That's a separate question (the rolling-tail marker is effectively a no-op on Vertex's first-continuous-block split, but isn't actively harmful).Files
openhands-sdk/openhands/sdk/llm/llm.py— newvertex_cached_content: str | Nonefield with full docstring linking to Vertex docs.openhands-sdk/openhands/sdk/llm/options/chat_options.py—_model_supports_vertex_cached_contentgating + emission block at the bottom ofselect_chat_options.tests/sdk/llm/test_chat_options.py— 7 new tests, plus the field added to theDummyLLMdataclass to match the existing convention.Test plan
Tests cover:
vertex_ai/,gemini/, andlitellm_proxy/gemini-*all receive the kwarg.gpt-5-miniandclaude-sonnet-4-5never receive it even when the field is set.Noneproduces no kwarg.cached_contentoverrides the LLM config field.This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, following the cost analysis in OpenHands/benchmarks#741. Companion PR: #3581.
@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:6f12ac8-pythonRun
All tags pushed for this build
About Multi-Architecture Support
6f12ac8-python) is a multi-arch manifest supporting both amd64 and arm646f12ac8-python-amd64) are also available if needed