Skip to content

Add per-instance cost cap to swe-bench runner#741

Draft
juanmichelini wants to merge 1 commit into
mainfrom
fix/per-instance-cost-cap
Draft

Add per-instance cost cap to swe-bench runner#741
juanmichelini wants to merge 1 commit into
mainfrom
fix/per-instance-cost-cap

Conversation

@juanmichelini

Copy link
Copy Markdown
Collaborator

Why

Triggered by review of OpenHands/openhands-index-results#1167 — the Gemini-3.5-Flash swe-bench-verified run spent $1,912 across 500 instances (mean $3.82), but 22 instances cost more than $10 each and accounted for ~20% of the total spend. The worst single instance cost $44.24.

Root cause for the 22 expensive instances

All 22 had the same fingerprint compared to typical runs:

bucket n cache_read / prompt_tokens mean events mean prompt_tokens
HIGH (>$10) 22 10.3% 342 11.5M
MID ($1–10) 453 27.5% 195 2.6M
LOW (≤$1) 24 45.5% 122 0.8M

The worst case (django__django-16116, $44.24) fired the LLMSummarizingCondenser 4 times during its 1069 events (at events 367, 552, 737, 922). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache, so the bulk of subsequent prompt tokens are billed at the full uncached price (~10× the cached rate). Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to:

  • uncached prompt: (32.0M − 3.6M) × $1.50/M = $42.68
  • cache reads: 3.6M × $0.15/M = $0.53
  • output (completion + reasoning): 196K × $9.00/M = $1.76
  • total ≈ $44.97 ✓ (matches the observed $44.24)

20 of the 22 high-cost instances were resolved, so the agent was making progress — it just took too many iterations and burned too much money doing so.

What this PR does

Adds a small, opt-in defence-in-depth measure: a --max-cost-per-instance CLI flag (also exposed as EvalMetadata.max_cost_per_instance, default None = disabled). When set, a callback pauses the conversation as soon as its accumulated cost exceeds the cap, mirroring the existing behaviour of max_iteration_per_run. The patch produced up to that point is still collected and submitted.

Scope

  • Wired into the swe-bench runner (benchmarks/swebench/run_infer.py) only, since that's where the regression surfaced.
  • The new max_cost_per_instance field lives on the shared EvalMetadata, so plumbing it into the other benchmark runners is a one-line change per runner in a follow-up.

What this does not fix

The underlying condenser cache-invalidation issue. Fixing that properly would need SDK-level changes (e.g. a condenser that keeps a stable cache prefix, or enforcement of Metrics.max_budget_per_task inside the run loop). Both are larger changes worth doing separately.

Files

  • New: benchmarks/utils/cost_cap.pyCostCapCallback class with deferred binding (the callback needs a reference to the conversation, which can only be obtained after construction). Defensive error handling so a misbehaving metrics or pause() call can never take an instance down.
  • New: tests/test_cost_cap.py — 9 unit tests using a fake conversation: rejects non-positive caps, no-op below cap, pauses at/above cap, idempotent once triggered, safe before binding, swallows metrics/pause failures.
  • Modified: benchmarks/utils/models.py — add max_cost_per_instance: float | None (gt=0).
  • Modified: benchmarks/utils/args_parser.py — add --max-cost-per-instance.
  • Modified: benchmarks/swebench/run_infer.py — construct the callback before Conversation, bind it after.

Test plan

  • pytest tests/test_cost_cap.py -v → 9 passed.
  • Smoke-tested that EvalMetadata(... max_cost_per_instance=0) is rejected by Pydantic and that the default value is None.
  • Smoke-tested the argparse plumbing: --max-cost-per-instance 7.5 parses to 7.5, absence parses to None.

Usage

# Default: no cap, behaviour unchanged.
python -m benchmarks.swebench.run_infer ...

# Cap per-instance cost at $10 (would have saved ~$240 on the
# Gemini-3.5-Flash run referenced above, with no impact on the
# 478 instances that finished under $10).
python -m benchmarks.swebench.run_infer ... --max-cost-per-instance 10

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, in response to a review comment on OpenHands/openhands-index-results#1167.

@juanmichelini can click here to continue refining the PR

Some evaluations have a small minority of instances that consume
disproportionately large amounts of money. For example, the Gemini-3.5-Flash
swe-bench-verified run on PR OpenHands/openhands-index-results#1167 spent
$1912 total across 500 instances ($3.82 mean), but 22 instances cost
>$10 each and accounted for ~20% of the total spend, with a worst-case
of $44.24 for a single instance.

Root cause for those 22 instances: they triggered the LLMSummarisingCondenser
multiple times (4x for the worst case). Each condensation rewrites the
prompt prefix and therefore invalidates the provider's prompt cache.
Their cache-read ratio averaged 10% versus 27% for typical instances and
45% for cheap ones, so the bulk of their tokens were billed at the full
uncached price. Combined with reasoning_effort=high (which adds reasoning
tokens to every uncached call) and 300+ iterations, this multiplied out
to ~$44 on the worst instance.

This adds a defence-in-depth measure: a `--max-cost-per-instance` flag
(also exposed as `EvalMetadata.max_cost_per_instance`, default None =
disabled). When set, a small callback pauses the conversation once the
per-instance accumulated_cost exceeds the cap, mirroring the existing
behaviour of `max_iteration_per_run`. The patch produced up to that
point is still collected and submitted.

This does not fix the underlying condenser cache-invalidation issue
(which would need SDK-level changes), but it does cap the blast radius
for any single instance across all models.

Wired into the swe-bench runner first since that is where the regression
surfaced; can be plumbed into the other benchmark runners in a follow-up.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants