Add per-instance cost cap to swe-bench runner by juanmichelini · Pull Request #741 · OpenHands/benchmarks

juanmichelini · 2026-06-08T18:02:50Z

Why

Triggered by review of OpenHands/openhands-index-results#1167 — the Gemini-3.5-Flash swe-bench-verified run spent $1,912 across 500 instances (mean $3.82), but 22 instances cost more than $10 each and accounted for ~20% of the total spend. The worst single instance cost $44.24.

Root cause for the 22 expensive instances

All 22 had the same fingerprint compared to typical runs:

bucket	n	cache_read / prompt_tokens	mean events	mean prompt_tokens
HIGH (>$10)	22	10.3%	342	11.5M
MID ($1–10)	453	27.5%	195	2.6M
LOW (≤$1)	24	45.5%	122	0.8M

The worst case (django__django-16116, $44.24) fired the LLMSummarizingCondenser 4 times during its 1069 events (at events 367, 552, 737, 922). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache, so the bulk of subsequent prompt tokens are billed at the full uncached price (~10× the cached rate). Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to:

uncached prompt: (32.0M − 3.6M) × $1.50/M = $42.68
cache reads: 3.6M × $0.15/M = $0.53
output (completion + reasoning): 196K × $9.00/M = $1.76
total ≈ $44.97 ✓ (matches the observed $44.24)

20 of the 22 high-cost instances were resolved, so the agent was making progress — it just took too many iterations and burned too much money doing so.

What this PR does

Adds a small, opt-in defence-in-depth measure: a --max-cost-per-instance CLI flag (also exposed as EvalMetadata.max_cost_per_instance, default None = disabled). When set, a callback pauses the conversation as soon as its accumulated cost exceeds the cap, mirroring the existing behaviour of max_iteration_per_run. The patch produced up to that point is still collected and submitted.

Scope

Wired into the swe-bench runner (benchmarks/swebench/run_infer.py) only, since that's where the regression surfaced.
The new max_cost_per_instance field lives on the shared EvalMetadata, so plumbing it into the other benchmark runners is a one-line change per runner in a follow-up.

What this does not fix

The underlying condenser cache-invalidation issue. Fixing that properly would need SDK-level changes (e.g. a condenser that keeps a stable cache prefix, or enforcement of Metrics.max_budget_per_task inside the run loop). Both are larger changes worth doing separately.

Files

New: benchmarks/utils/cost_cap.py — CostCapCallback class with deferred binding (the callback needs a reference to the conversation, which can only be obtained after construction). Defensive error handling so a misbehaving metrics or pause() call can never take an instance down.
New: tests/test_cost_cap.py — 9 unit tests using a fake conversation: rejects non-positive caps, no-op below cap, pauses at/above cap, idempotent once triggered, safe before binding, swallows metrics/pause failures.
Modified: benchmarks/utils/models.py — add max_cost_per_instance: float | None (gt=0).
Modified: benchmarks/utils/args_parser.py — add --max-cost-per-instance.
Modified: benchmarks/swebench/run_infer.py — construct the callback before Conversation, bind it after.

Test plan

pytest tests/test_cost_cap.py -v → 9 passed.
Smoke-tested that EvalMetadata(... max_cost_per_instance=0) is rejected by Pydantic and that the default value is None.
Smoke-tested the argparse plumbing: --max-cost-per-instance 7.5 parses to 7.5, absence parses to None.

Usage

# Default: no cap, behaviour unchanged.
python -m benchmarks.swebench.run_infer ...

# Cap per-instance cost at $10 (would have saved ~$240 on the
# Gemini-3.5-Flash run referenced above, with no impact on the
# 478 instances that finished under $10).
python -m benchmarks.swebench.run_infer ... --max-cost-per-instance 10

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, in response to a review comment on OpenHands/openhands-index-results#1167.

@juanmichelini can click here to continue refining the PR

Some evaluations have a small minority of instances that consume disproportionately large amounts of money. For example, the Gemini-3.5-Flash swe-bench-verified run on PR OpenHands/openhands-index-results#1167 spent $1912 total across 500 instances ($3.82 mean), but 22 instances cost >$10 each and accounted for ~20% of the total spend, with a worst-case of $44.24 for a single instance. Root cause for those 22 instances: they triggered the LLMSummarisingCondenser multiple times (4x for the worst case). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache. Their cache-read ratio averaged 10% versus 27% for typical instances and 45% for cheap ones, so the bulk of their tokens were billed at the full uncached price. Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to ~$44 on the worst instance. This adds a defence-in-depth measure: a `--max-cost-per-instance` flag (also exposed as `EvalMetadata.max_cost_per_instance`, default None = disabled). When set, a small callback pauses the conversation once the per-instance accumulated_cost exceeds the cap, mirroring the existing behaviour of `max_iteration_per_run`. The patch produced up to that point is still collected and submitted. This does not fix the underlying condenser cache-invalidation issue (which would need SDK-level changes), but it does cap the blast radius for any single instance across all models. Wired into the swe-bench runner first since that is where the regression surfaced; can be plumbed into the other benchmark runners in a follow-up. Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-instance cost cap to swe-bench runner#741

Add per-instance cost cap to swe-bench runner#741
juanmichelini wants to merge 1 commit into
mainfrom
fix/per-instance-cost-cap

juanmichelini commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented Jun 8, 2026

Why

Root cause for the 22 expensive instances

What this PR does

Scope

What this does not fix

Files

Test plan

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants