feat: per-instance intelligent model routing across all 5 default-agent benchmarks#742
Draft
juanmichelini wants to merge 1 commit into
Draft
feat: per-instance intelligent model routing across all 5 default-agent benchmarks#742juanmichelini wants to merge 1 commit into
juanmichelini wants to merge 1 commit into
Conversation
…nt benchmarks
Adds a benchmark-agnostic 'intelligent-router-v0' config shape that classifies
each evaluation instance via a single LLM call and routes the agent
conversation to the matching tier model. The router is opt-in: passing a
plain LLM config to --llm-config-path preserves today's behavior byte-for-byte.
What's new
----------
* New module benchmarks/utils/intelligent_routing.py:
- Frozen iter5 classifier prompt (5 categories).
- RouterSpec pydantic model, classify_and_route(), parse_classifier_output().
- Per-benchmark task-text extractors for swebench, swebenchmultimodal,
swtbench, gaia, commit0.
- Vision-fallback safety: image-bearing instances classified into a
text-only tier are auto-rerouted to a vision-capable fallback.
* benchmarks/utils/llm_config.py: load_llm_config still returns an LLM for
plain configs; for router configs it returns the classifier LLM so ACP
agents and condensers continue to work. New maybe_load_router_spec()
returns the parsed RouterSpec or None.
* benchmarks/utils/models.py: EvalMetadata gains an optional 'routing'
field. Existing fields and defaults are unchanged.
* benchmarks/{swebench,swebenchmultimodal,swtbench,gaia,commit0}/run_infer.py:
before the agent is constructed, each benchmark calls classify_and_route
with its own benchmark name and uses the chosen LLM for both the agent
and the condenser. Per-instance routing decisions are logged.
* benchmarks/utils/sample_configs/intelligent_router_3tier.example.json:
reference config — Frontend -> kimi-k2.6, Issue Resolution (other) ->
minimax-m2.7, all other categories -> gpt-5.5, classifier = minimax-m2.7.
Tests
-----
* tests/test_intelligent_routing.py: 33 tests covering classifier output
parsing (including chatty/lowercase/bare-keyword variants), router config
loading and validation errors, per-benchmark task-text extraction,
end-to-end classify_and_route dispatch with a stub classifier, classifier
exception handling, and the vision-capable fallback path.
* Existing 528 tests still pass.
Design notes
------------
* All routing knowledge lives in benchmarks/utils/intelligent_routing.py;
per-benchmark run_infer.py changes are ~25 lines and identical except for
the benchmark name string passed to classify_and_route.
* No SDK changes required — the router config is loaded and applied entirely
on the benchmarks side.
* Out of scope for this PR: SDK-side RouterLLM polymorphism, the matching
software-agent-sdk MODELS entry, and preflight tier checks.
This change was prepared by an AI agent (OpenHands) on behalf of @rajshah4.
Co-authored-by: openhands <openhands@all-hands.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in per-instance intelligent routing to all five default-agent benchmarks (
swebench,swebenchmultimodal,swtbench,gaia,commit0).When the user passes an
intelligent-router-v0-shaped JSON to--llm-config-path, each instance is classified once via a small classifier LLM and the agent conversation is routed to the matching tier model. A plain LLM config preserves today's behavior byte-for-byte — this is strictly additive.The default routing table — based on the iter5 classifier prompt and a 3-tier mapping — is:
kimi-k2.6minimax-m2.7gpt-5.5Classifier =
minimax-m2.7(one call per instance, ~5–20 cents on a SWE-bench prompt).What's in this PR
benchmarks/utils/intelligent_routing.py(new, ~330 LOC)RouterSpec,classify_and_route,parse_classifier_output, per-benchmark task-text extractors, vision-fallbackbenchmarks/utils/llm_config.pyload_llm_confignow transparently accepts a router config (returns the classifier LLM as the "primary"); addsmaybe_load_router_specbenchmarks/utils/models.pyEvalMetadatagains optionalrouting: RouterSpec | Nonebenchmarks/{swebench,swebenchmultimodal,swtbench,gaia,commit0}/run_infer.pyAgent(...)— only thebenchmark="..."string differsbenchmarks/utils/sample_configs/intelligent_router_3tier.example.json(new)api_key/base_urlto use locallytests/test_intelligent_routing.py(new, 33 tests)Design choices
intelligent_routing.py. Per-benchmarkrun_infer.pyedits are tiny and uniform — only thebenchmark="..."string differs across the five files. Adding more routing strategies later only touches that one module.load_llm_configreturns the classifier LLM when given a router config, so ACP agents, condensers, and any path that just needs an LLM continue to work. Routing is gated bymetadata.routing is not None and agent_type == "default".build_eval_llm()exactly like the existing code path.swebenchmultimodal, image-bearing instances classified into a text-only tier (e.g.minimax-m2.7) are auto-rerouted to a vision-capable tier (gpt-5.5by default). Logged in the per-instance routing line.parse_classifier_outputhandles those.Honest caveats
Greenfield, GAIA tasks asInformation Gathering. Both route togpt-5.5. On those benchmarks the router likely won't beat just runninggpt-5.5directly — but the data will tell you that, and the next iteration can short-circuit per benchmark.Issue Resolution (other)because the classifier reads the issue body, not the "write tests" instruction.run-eval.yml, the matchingsoftware-agent-sdkPR will need to add arouter-classified-3tierentry toresolve_model_config.MODELS, teach preflight to recurse into the tier sub-models, and patchevaluation/eval-job/scripts/build_matrix.pyto accept therouter/...slug. Until then, the router is usable by passing the config file directly via--llm-config-pathto a benchmark'srun_infer.py.How to test locally
For the 5×4 comparison run described in the design discussion, run each of the five benchmarks four times (one per model_id:
minimax-m2.7,kimi-k2.6,gpt-5.5, plus a config pointing at this router) on the same--selectinstance ID list. The interesting metric is cost-per-resolved-instance.Verification
uv run ruff format/uv run ruff check— cleanuv run pyright(strict, the modified + new files) —0 errors, 0 warnings, 0 informationsuv run pytest tests/— 528 existing tests pass + 33 new routing tests passThis PR was prepared by an AI agent (OpenHands) on behalf of @rajshah4.
@juanmichelini can click here to continue refining the PR