Add per-instance cost cap to swe-bench runner#741
Draft
juanmichelini wants to merge 1 commit into
Draft
Conversation
Some evaluations have a small minority of instances that consume disproportionately large amounts of money. For example, the Gemini-3.5-Flash swe-bench-verified run on PR OpenHands/openhands-index-results#1167 spent $1912 total across 500 instances ($3.82 mean), but 22 instances cost >$10 each and accounted for ~20% of the total spend, with a worst-case of $44.24 for a single instance. Root cause for those 22 instances: they triggered the LLMSummarisingCondenser multiple times (4x for the worst case). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache. Their cache-read ratio averaged 10% versus 27% for typical instances and 45% for cheap ones, so the bulk of their tokens were billed at the full uncached price. Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to ~$44 on the worst instance. This adds a defence-in-depth measure: a `--max-cost-per-instance` flag (also exposed as `EvalMetadata.max_cost_per_instance`, default None = disabled). When set, a small callback pauses the conversation once the per-instance accumulated_cost exceeds the cap, mirroring the existing behaviour of `max_iteration_per_run`. The patch produced up to that point is still collected and submitted. This does not fix the underlying condenser cache-invalidation issue (which would need SDK-level changes), but it does cap the blast radius for any single instance across all models. Wired into the swe-bench runner first since that is where the regression surfaced; can be plumbed into the other benchmark runners in a follow-up. Co-authored-by: openhands <openhands@all-hands.dev>
This was referenced Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Triggered by review of OpenHands/openhands-index-results#1167 — the Gemini-3.5-Flash swe-bench-verified run spent $1,912 across 500 instances (mean $3.82), but 22 instances cost more than $10 each and accounted for ~20% of the total spend. The worst single instance cost $44.24.
Root cause for the 22 expensive instances
All 22 had the same fingerprint compared to typical runs:
The worst case (
django__django-16116, $44.24) fired theLLMSummarizingCondenser4 times during its 1069 events (at events 367, 552, 737, 922). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache, so the bulk of subsequent prompt tokens are billed at the full uncached price (~10× the cached rate). Combined withreasoning_effort=high(which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to:20 of the 22 high-cost instances were resolved, so the agent was making progress — it just took too many iterations and burned too much money doing so.
What this PR does
Adds a small, opt-in defence-in-depth measure: a
--max-cost-per-instanceCLI flag (also exposed asEvalMetadata.max_cost_per_instance, defaultNone= disabled). When set, a callback pauses the conversation as soon as its accumulated cost exceeds the cap, mirroring the existing behaviour ofmax_iteration_per_run. The patch produced up to that point is still collected and submitted.Scope
benchmarks/swebench/run_infer.py) only, since that's where the regression surfaced.max_cost_per_instancefield lives on the sharedEvalMetadata, so plumbing it into the other benchmark runners is a one-line change per runner in a follow-up.What this does not fix
The underlying condenser cache-invalidation issue. Fixing that properly would need SDK-level changes (e.g. a condenser that keeps a stable cache prefix, or enforcement of
Metrics.max_budget_per_taskinside the run loop). Both are larger changes worth doing separately.Files
benchmarks/utils/cost_cap.py—CostCapCallbackclass with deferred binding (the callback needs a reference to the conversation, which can only be obtained after construction). Defensive error handling so a misbehaving metrics orpause()call can never take an instance down.tests/test_cost_cap.py— 9 unit tests using a fake conversation: rejects non-positive caps, no-op below cap, pauses at/above cap, idempotent once triggered, safe before binding, swallows metrics/pause failures.benchmarks/utils/models.py— addmax_cost_per_instance: float | None(gt=0).benchmarks/utils/args_parser.py— add--max-cost-per-instance.benchmarks/swebench/run_infer.py— construct the callback beforeConversation, bind it after.Test plan
pytest tests/test_cost_cap.py -v→ 9 passed.EvalMetadata(... max_cost_per_instance=0)is rejected by Pydantic and that the default value isNone.--max-cost-per-instance 7.5parses to7.5, absence parses toNone.Usage
This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, in response to a review comment on OpenHands/openhands-index-results#1167.
@juanmichelini can click here to continue refining the PR