feat: per-instance intelligent model routing across all 5 default-agent benchmarks by juanmichelini · Pull Request #742 · OpenHands/benchmarks

juanmichelini · 2026-06-09T03:59:14Z

Summary

Adds opt-in per-instance intelligent routing to all five default-agent benchmarks (swebench, swebenchmultimodal, swtbench, gaia, commit0).

When the user passes an intelligent-router-v0-shaped JSON to --llm-config-path, each instance is classified once via a small classifier LLM and the agent conversation is routed to the matching tier model. A plain LLM config preserves today's behavior byte-for-byte — this is strictly additive.

The default routing table — based on the iter5 classifier prompt and a 3-tier mapping — is:

Classifier category	Tier model
Frontend	`kimi-k2.6`
Issue Resolution (other)	`minimax-m2.7`
Greenfield / Testing / Information Gathering	`gpt-5.5`

Classifier = minimax-m2.7 (one call per instance, ~5–20 cents on a SWE-bench prompt).

What's in this PR

File	Purpose
`benchmarks/utils/intelligent_routing.py` (new, ~330 LOC)	iter5 classifier prompt, `RouterSpec`, `classify_and_route`, `parse_classifier_output`, per-benchmark task-text extractors, vision-fallback
`benchmarks/utils/llm_config.py`	`load_llm_config` now transparently accepts a router config (returns the classifier LLM as the "primary"); adds `maybe_load_router_spec`
`benchmarks/utils/models.py`	`EvalMetadata` gains optional `routing: RouterSpec \| None`
`benchmarks/{swebench,swebenchmultimodal,swtbench,gaia,commit0}/run_infer.py`	each gains the same ~25-line block before `Agent(...)` — only the `benchmark="..."` string differs
`benchmarks/utils/sample_configs/intelligent_router_3tier.example.json` (new)	reference config; copy and fill in `api_key` / `base_url` to use locally
`tests/test_intelligent_routing.py` (new, 33 tests)	classifier-output parsing (incl. chatty / case-insensitive / bare-keyword), router config loading and validation, per-benchmark task extraction, end-to-end dispatch with a stub classifier, classifier-exception fallback, vision-fallback path

Design choices

All routing knowledge lives in intelligent_routing.py. Per-benchmark run_infer.py edits are tiny and uniform — only the benchmark="..." string differs across the five files. Adding more routing strategies later only touches that one module.
load_llm_config returns the classifier LLM when given a router config, so ACP agents, condensers, and any path that just needs an LLM continue to work. Routing is gated by metadata.routing is not None and agent_type == "default".
Per-instance virtual-key cost tracking still works: the helper passes the chosen LLM through build_eval_llm() exactly like the existing code path.
Vision-fallback safety: for swebenchmultimodal, image-bearing instances classified into a text-only tier (e.g. minimax-m2.7) are auto-rerouted to a vision-capable tier (gpt-5.5 by default). Logged in the per-instance routing line.
Robust classifier-output parsing: the iter5 prompt asks for an exact category string, but real models hedge with markdown, prose, or drop "(other)". parse_classifier_output handles those.

Honest caveats

commit0 and gaia will mostly degenerate to gpt-5.5. The iter5 prompt was tuned on a SWE-bench-style 200-task sample; commit0 tasks classify as Greenfield, GAIA tasks as Information Gathering. Both route to gpt-5.5. On those benchmarks the router likely won't beat just running gpt-5.5 directly — but the data will tell you that, and the next iteration can short-circuit per benchmark.
swtbench is also dominated by Issue Resolution (other) because the classifier reads the issue body, not the "write tests" instruction.
Classifier cost is paid in minimax tokens per instance. Not free; not bad. Expect ~$0.005–$0.02 per classification depending on prompt length.
The matching SDK + workflow changes are out of scope for this PR. This PR makes the receiving end ready. To dispatch a router run through run-eval.yml, the matching software-agent-sdk PR will need to add a router-classified-3tier entry to resolve_model_config.MODELS, teach preflight to recurse into the tier sub-models, and patch evaluation/eval-job/scripts/build_matrix.py to accept the router/... slug. Until then, the router is usable by passing the config file directly via --llm-config-path to a benchmark's run_infer.py.

How to test locally

# 1. Fill in api_key / base_url in the sample config
cp benchmarks/utils/sample_configs/intelligent_router_3tier.example.json /tmp/router.json
$EDITOR /tmp/router.json

# 2. Dry-run a single SWE-bench instance
uv run python -m benchmarks.swebench.run_infer \
    --llm-config-path /tmp/router.json \
    --n-limit 1 \
    --dataset princeton-nlp/SWE-bench_Verified

# 3. Look for the per-instance routing log line:
# "intelligent-routing instance=... category=Issue Resolution (other) model=minimax-m2.7 vision_fallback=False raw=..."

For the 5×4 comparison run described in the design discussion, run each of the five benchmarks four times (one per model_id: minimax-m2.7, kimi-k2.6, gpt-5.5, plus a config pointing at this router) on the same --select instance ID list. The interesting metric is cost-per-resolved-instance.

Verification

uv run ruff format / uv run ruff check — clean
uv run pyright (strict, the modified + new files) — 0 errors, 0 warnings, 0 informations
uv run pytest tests/ — 528 existing tests pass + 33 new routing tests pass

This PR was prepared by an AI agent (OpenHands) on behalf of @rajshah4.

@juanmichelini can click here to continue refining the PR

@rajshah4

…nt benchmarks Adds a benchmark-agnostic 'intelligent-router-v0' config shape that classifies each evaluation instance via a single LLM call and routes the agent conversation to the matching tier model. The router is opt-in: passing a plain LLM config to --llm-config-path preserves today's behavior byte-for-byte. What's new ---------- * New module benchmarks/utils/intelligent_routing.py: - Frozen iter5 classifier prompt (5 categories). - RouterSpec pydantic model, classify_and_route(), parse_classifier_output(). - Per-benchmark task-text extractors for swebench, swebenchmultimodal, swtbench, gaia, commit0. - Vision-fallback safety: image-bearing instances classified into a text-only tier are auto-rerouted to a vision-capable fallback. * benchmarks/utils/llm_config.py: load_llm_config still returns an LLM for plain configs; for router configs it returns the classifier LLM so ACP agents and condensers continue to work. New maybe_load_router_spec() returns the parsed RouterSpec or None. * benchmarks/utils/models.py: EvalMetadata gains an optional 'routing' field. Existing fields and defaults are unchanged. * benchmarks/{swebench,swebenchmultimodal,swtbench,gaia,commit0}/run_infer.py: before the agent is constructed, each benchmark calls classify_and_route with its own benchmark name and uses the chosen LLM for both the agent and the condenser. Per-instance routing decisions are logged. * benchmarks/utils/sample_configs/intelligent_router_3tier.example.json: reference config — Frontend -> kimi-k2.6, Issue Resolution (other) -> minimax-m2.7, all other categories -> gpt-5.5, classifier = minimax-m2.7. Tests ----- * tests/test_intelligent_routing.py: 33 tests covering classifier output parsing (including chatty/lowercase/bare-keyword variants), router config loading and validation errors, per-benchmark task-text extraction, end-to-end classify_and_route dispatch with a stub classifier, classifier exception handling, and the vision-capable fallback path. * Existing 528 tests still pass. Design notes ------------ * All routing knowledge lives in benchmarks/utils/intelligent_routing.py; per-benchmark run_infer.py changes are ~25 lines and identical except for the benchmark name string passed to classify_and_route. * No SDK changes required — the router config is loaded and applied entirely on the benchmarks side. * Out of scope for this PR: SDK-side RouterLLM polymorphism, the matching software-agent-sdk MODELS entry, and preflight tier checks. This change was prepared by an AI agent (OpenHands) on behalf of @rajshah4. Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: per-instance intelligent model routing across all 5 default-agent benchmarks#742

feat: per-instance intelligent model routing across all 5 default-agent benchmarks#742
juanmichelini wants to merge 1 commit into
mainfrom
feat/intelligent-routing-3tier

juanmichelini commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented Jun 9, 2026

Summary

What's in this PR

Design choices

Honest caveats

How to test locally

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants