test(eval): cover scorecard, audit, runner, and claude judge#1243
test(eval): cover scorecard, audit, runner, and claude judge#1243kovtcharov-amd wants to merge 4 commits into
Conversation
|
129 tests with zero blocking issues — the eval toolchain's most important modules now have real coverage. One timeout test doesn't actually exercise its stated formula; one dead SummaryFour previously untested Issues Found🟢 Minor —
|
|
Good addition — the eval toolchain ( Issues🟢 Minor —
|
|
The mypy lint failure ( |
itomek
left a comment
There was a problem hiding this comment.
Good coverage of four previously-untested eval modules, fully isolated (mocked anthropic, monkeypatched module attrs, no real I/O). Head already fixed the earlier nits (the scaling test now asserts == 820, import yaml is module-level, the constants test writes to the real _chat_helpers.py path). Two test-fidelity items inline, both one-line fixes: test_blocked_scenarios_on_low_msg_chars passes via the wrong trigger (the not-tool-results branch), and test_base_timeout_when_no_turns_or_docs has a vacuous >= assertion. Neither blocks merge; #1 is the one giving false confidence in a real code path. Approving.
Generated by Claude Code
There was a problem hiding this comment.
test_blocked_scenarios_on_low_msg_chars writes only _MAX_MSG_CHARS = 500 with no agent_steps/role markers, so audit_tool_results_in_history() returns False and the "not tool_results_in_history" trigger fires cross_turn_file_recall — meaning the test passes even if the max_msg_chars path were deleted. Add a tool-result pattern to the fixture so only the msg-chars path can satisfy the assertion.
Generated by Claude Code
There was a problem hiding this comment.
test_base_timeout_when_no_turns_or_docs asserts result >= 240, but with base=900 the function returns 900 — the assertion would also pass if it returned 999999. Pin it to == 900 so it actually guards "base timeout wins when no turns/docs."
Generated by Claude Code
The eval toolchain at src/gaia/eval/ had only 3 test files covering analyze_failures, iterations, and MCP reliability — the four core modules (runner, scorecard, audit, claude judge) had zero unit tests. Adds 91 tests across 4 new files: - test_scorecard.py — build_scorecard aggregation, write_summary_md, write_junit_xml, status counting, score capping, performance rollup - test_audit.py — AST constant extraction, agent persistence detection, tool-results-in-history pattern matching, run_audit recommendations - test_runner.py — validate_scenario schema checks, recompute_turn_score weighting, _aggregate_performance, _compute_effective_timeout, find_scenarios filtering, build_scenario_prompt assembly, compare_scorecards regression detection, AgentEvalRunner init - test_claude_judge.py — ClaudeClient init validation, cost calculation, get_completion, get_completion_with_usage, count_tokens (all mocked)
Drop unused Path (test_audit), patch (test_claude_judge), textwrap/MagicMock/patch (test_runner), and a dead no-op fixture + unused helper function in test_claude_judge.
- Fix test_scales_with_turns_and_docs: use base_timeout=100 so the scaling formula (820) actually wins the max(), and assert == instead of >= to verify the exact computed value. - Remove dead _write_helpers call in test_extracts_max_constants that wrote to the wrong path (returned p was unused). - Move `import yaml` from inside _write_scenario to module-level.
…ernance Suppress no-any-return in factory.py (dynamic provider instantiation), fix union-attr in openai/claude providers (stream response type), and cast CheckpointStatus in checkpoint_bridge.py.
bb2d370 to
cbc5d75
Compare
The eval toolchain (
src/gaia/eval/) had only 3 test files covering analyze_failures, iterations, and MCP scenario validation. The core modules — scorecard computation, audit trail, eval runner orchestration, and Claude judge client — had zero unit tests. Now they have 129 tests across 4 new files, all with mocked external dependencies (no real inference or network calls).Test plan
python -m pytest tests/unit/eval/ -xvs— all 129 tests passCloses #1151