Skip to content

feat: per-instance intelligent model routing across all 5 default-agent benchmarks#742

Draft
juanmichelini wants to merge 1 commit into
mainfrom
feat/intelligent-routing-3tier
Draft

feat: per-instance intelligent model routing across all 5 default-agent benchmarks#742
juanmichelini wants to merge 1 commit into
mainfrom
feat/intelligent-routing-3tier

Conversation

@juanmichelini

Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in per-instance intelligent routing to all five default-agent benchmarks (swebench, swebenchmultimodal, swtbench, gaia, commit0).

When the user passes an intelligent-router-v0-shaped JSON to --llm-config-path, each instance is classified once via a small classifier LLM and the agent conversation is routed to the matching tier model. A plain LLM config preserves today's behavior byte-for-byte — this is strictly additive.

The default routing table — based on the iter5 classifier prompt and a 3-tier mapping — is:

Classifier category Tier model
Frontend kimi-k2.6
Issue Resolution (other) minimax-m2.7
Greenfield / Testing / Information Gathering gpt-5.5

Classifier = minimax-m2.7 (one call per instance, ~5–20 cents on a SWE-bench prompt).

What's in this PR

File Purpose
benchmarks/utils/intelligent_routing.py (new, ~330 LOC) iter5 classifier prompt, RouterSpec, classify_and_route, parse_classifier_output, per-benchmark task-text extractors, vision-fallback
benchmarks/utils/llm_config.py load_llm_config now transparently accepts a router config (returns the classifier LLM as the "primary"); adds maybe_load_router_spec
benchmarks/utils/models.py EvalMetadata gains optional routing: RouterSpec | None
benchmarks/{swebench,swebenchmultimodal,swtbench,gaia,commit0}/run_infer.py each gains the same ~25-line block before Agent(...) — only the benchmark="..." string differs
benchmarks/utils/sample_configs/intelligent_router_3tier.example.json (new) reference config; copy and fill in api_key / base_url to use locally
tests/test_intelligent_routing.py (new, 33 tests) classifier-output parsing (incl. chatty / case-insensitive / bare-keyword), router config loading and validation, per-benchmark task extraction, end-to-end dispatch with a stub classifier, classifier-exception fallback, vision-fallback path

Design choices

  • All routing knowledge lives in intelligent_routing.py. Per-benchmark run_infer.py edits are tiny and uniform — only the benchmark="..." string differs across the five files. Adding more routing strategies later only touches that one module.
  • load_llm_config returns the classifier LLM when given a router config, so ACP agents, condensers, and any path that just needs an LLM continue to work. Routing is gated by metadata.routing is not None and agent_type == "default".
  • Per-instance virtual-key cost tracking still works: the helper passes the chosen LLM through build_eval_llm() exactly like the existing code path.
  • Vision-fallback safety: for swebenchmultimodal, image-bearing instances classified into a text-only tier (e.g. minimax-m2.7) are auto-rerouted to a vision-capable tier (gpt-5.5 by default). Logged in the per-instance routing line.
  • Robust classifier-output parsing: the iter5 prompt asks for an exact category string, but real models hedge with markdown, prose, or drop "(other)". parse_classifier_output handles those.

Honest caveats

  • commit0 and gaia will mostly degenerate to gpt-5.5. The iter5 prompt was tuned on a SWE-bench-style 200-task sample; commit0 tasks classify as Greenfield, GAIA tasks as Information Gathering. Both route to gpt-5.5. On those benchmarks the router likely won't beat just running gpt-5.5 directly — but the data will tell you that, and the next iteration can short-circuit per benchmark.
  • swtbench is also dominated by Issue Resolution (other) because the classifier reads the issue body, not the "write tests" instruction.
  • Classifier cost is paid in minimax tokens per instance. Not free; not bad. Expect ~$0.005–$0.02 per classification depending on prompt length.
  • The matching SDK + workflow changes are out of scope for this PR. This PR makes the receiving end ready. To dispatch a router run through run-eval.yml, the matching software-agent-sdk PR will need to add a router-classified-3tier entry to resolve_model_config.MODELS, teach preflight to recurse into the tier sub-models, and patch evaluation/eval-job/scripts/build_matrix.py to accept the router/... slug. Until then, the router is usable by passing the config file directly via --llm-config-path to a benchmark's run_infer.py.

How to test locally

# 1. Fill in api_key / base_url in the sample config
cp benchmarks/utils/sample_configs/intelligent_router_3tier.example.json /tmp/router.json
$EDITOR /tmp/router.json

# 2. Dry-run a single SWE-bench instance
uv run python -m benchmarks.swebench.run_infer \
    --llm-config-path /tmp/router.json \
    --n-limit 1 \
    --dataset princeton-nlp/SWE-bench_Verified

# 3. Look for the per-instance routing log line:
# "intelligent-routing instance=... category=Issue Resolution (other) model=minimax-m2.7 vision_fallback=False raw=..."

For the 5×4 comparison run described in the design discussion, run each of the five benchmarks four times (one per model_id: minimax-m2.7, kimi-k2.6, gpt-5.5, plus a config pointing at this router) on the same --select instance ID list. The interesting metric is cost-per-resolved-instance.

Verification

  • uv run ruff format / uv run ruff check — clean
  • uv run pyright (strict, the modified + new files) — 0 errors, 0 warnings, 0 informations
  • uv run pytest tests/528 existing tests pass + 33 new routing tests pass

This PR was prepared by an AI agent (OpenHands) on behalf of @rajshah4.

@juanmichelini can click here to continue refining the PR

…nt benchmarks

Adds a benchmark-agnostic 'intelligent-router-v0' config shape that classifies
each evaluation instance via a single LLM call and routes the agent
conversation to the matching tier model. The router is opt-in: passing a
plain LLM config to --llm-config-path preserves today's behavior byte-for-byte.

What's new
----------
* New module benchmarks/utils/intelligent_routing.py:
  - Frozen iter5 classifier prompt (5 categories).
  - RouterSpec pydantic model, classify_and_route(), parse_classifier_output().
  - Per-benchmark task-text extractors for swebench, swebenchmultimodal,
    swtbench, gaia, commit0.
  - Vision-fallback safety: image-bearing instances classified into a
    text-only tier are auto-rerouted to a vision-capable fallback.
* benchmarks/utils/llm_config.py: load_llm_config still returns an LLM for
  plain configs; for router configs it returns the classifier LLM so ACP
  agents and condensers continue to work. New maybe_load_router_spec()
  returns the parsed RouterSpec or None.
* benchmarks/utils/models.py: EvalMetadata gains an optional 'routing'
  field. Existing fields and defaults are unchanged.
* benchmarks/{swebench,swebenchmultimodal,swtbench,gaia,commit0}/run_infer.py:
  before the agent is constructed, each benchmark calls classify_and_route
  with its own benchmark name and uses the chosen LLM for both the agent
  and the condenser. Per-instance routing decisions are logged.
* benchmarks/utils/sample_configs/intelligent_router_3tier.example.json:
  reference config — Frontend -> kimi-k2.6, Issue Resolution (other) ->
  minimax-m2.7, all other categories -> gpt-5.5, classifier = minimax-m2.7.

Tests
-----
* tests/test_intelligent_routing.py: 33 tests covering classifier output
  parsing (including chatty/lowercase/bare-keyword variants), router config
  loading and validation errors, per-benchmark task-text extraction,
  end-to-end classify_and_route dispatch with a stub classifier, classifier
  exception handling, and the vision-capable fallback path.
* Existing 528 tests still pass.

Design notes
------------
* All routing knowledge lives in benchmarks/utils/intelligent_routing.py;
  per-benchmark run_infer.py changes are ~25 lines and identical except for
  the benchmark name string passed to classify_and_route.
* No SDK changes required — the router config is loaded and applied entirely
  on the benchmarks side.
* Out of scope for this PR: SDK-side RouterLLM polymorphism, the matching
  software-agent-sdk MODELS entry, and preflight tier checks.

This change was prepared by an AI agent (OpenHands) on behalf of @rajshah4.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants