test(eval): cover scorecard, audit, runner, and claude judge by kovtcharov-amd · Pull Request #1243 · amd/gaia

kovtcharov-amd · 2026-05-29T03:13:19Z

The eval toolchain (src/gaia/eval/) had only 3 test files covering analyze_failures, iterations, and MCP scenario validation. The core modules — scorecard computation, audit trail, eval runner orchestration, and Claude judge client — had zero unit tests. Now they have 129 tests across 4 new files, all with mocked external dependencies (no real inference or network calls).

Test plan

python -m pytest tests/unit/eval/ -xvs — all 129 tests pass
No new dependencies required (uses existing pytest-mock)

Closes #1151

github-actions · 2026-05-29T03:18:36Z

129 tests with zero blocking issues — the eval toolchain's most important modules now have real coverage. One timeout test doesn't actually exercise its stated formula; one dead _write_helpers call in test_audit.py should be removed.

Summary

Four previously untested src/gaia/eval/ modules (scorecard, audit, runner, claude) are now covered by 129 unit tests with all external dependencies properly mocked. The tests correctly isolate from Lemonade, Anthropic, and the filesystem; edge cases (empty inputs, unknown statuses, missing dimensions, duplicate turns) are well covered. The PR delivers on its stated goal.

Issues Found

🟢 Minor — `test_scales_with_turns_and_docs` does not exercise the scaling formula (`tests/unit/eval/test_runner.py:726-737`)

_compute_effective_timeout(900, ...) uses max(base_timeout, _STARTUP_OVERHEAD_S + …). With base_timeout=900 and the computed value 240 + 2×90 + 2×200 = 820, max(900, 820) returns 900 — the scaling branch is never taken. The assertion result >= 820 passes trivially (900 ≥ 820) whether or not the formula is wired up.

    def test_scales_with_turns_and_docs(self):
        scenario = {
            "turns": [{"turn": 1}, {"turn": 2}],
            "setup": {"index_documents": [{"path": "a.pdf"}, {"path": "b.pdf"}]},
        }
        # Use a base_timeout smaller than the computed value so the scaling branch wins.
        result = _compute_effective_timeout(100, scenario)
        expected = 240 + 2 * 90 + 2 * 200  # startup + docs + turns
        assert result == expected

🟢 Nit — dead `_write_helpers` call in `test_extracts_max_constants` (`tests/unit/eval/test_audit.py:51-56`)

_write_helpers(tmp_path, src) writes to tmp_path/_chat_helpers.py. audit_chat_helpers() reads from tmp_path/src/gaia/ui/_chat_helpers.py. The returned p is unused; the call does nothing useful and may mislead future readers.

    def test_extracts_max_constants(self, tmp_path, monkeypatch):
        src = """\
            _MAX_HISTORY_PAIRS = 10
            _MAX_MSG_CHARS = 2000
            _OTHER = 42
        """
        monkeypatch.setattr("gaia.eval.audit.GAIA_ROOT", tmp_path)
        target = tmp_path / "src" / "gaia" / "ui"
        target.mkdir(parents=True)
        (target / "_chat_helpers.py").write_text(textwrap.dedent(src), encoding="utf-8")
        result = audit_chat_helpers()
        assert result["_MAX_HISTORY_PAIRS"] == 10
        assert result["_MAX_MSG_CHARS"] == 2000
        assert "_OTHER" not in result

🟢 Nit — `import yaml` inside `_write_scenario` helper (`tests/unit/eval/test_runner.py:752`)

yaml is a top-level dependency of gaia.eval.runner and reliably available, so the deferred import adds noise without benefit. Move it to the module-level imports with the rest.

Strengths

Isolation is clean: monkeypatch.setattr on module-level attributes (gaia.eval.claude.anthropic, gaia.eval.runner.SCENARIOS_DIR, gaia.eval.audit.GAIA_ROOT) correctly intercepts the production paths without any real I/O, network, or subprocess calls.
Edge cases are thorough: unknown statuses, None scores, empty result lists, missing dimensions, duplicate/non-sequential turns, and FileNotFoundError on missing baselines are all exercised.
Scorecard semantics are pinned: test_fail_scores_capped_at_5_99, test_avg_score_excludes_infra, and test_judged_pass_rate directly encode the business rules in scorecard.py and will catch regressions if those rules are accidentally changed.

Verdict

Approve with suggestions. The two nits are one-line fixes; none of the issues block merge. Once the timeout test is tightened it will actually guard the scaling formula rather than check a vacuous bound.

github-actions · 2026-05-29T06:47:06Z

Good addition — the eval toolchain (scorecard, audit, runner, claude) had zero unit coverage and this PR fills that gap with 129 well-isolated tests. One nit on test fidelity for test_base_timeout_when_no_turns_or_docs and a couple of spots where hardcoded literals will require two-file updates on future model bumps.

Issues

🟢 Minor — `test_defaults` pins model/budget/timeout as string literals (`test_runner.py:1000–1006`)

runner.model == "claude-sonnet-4-6" will fail with an unhelpful assertion error the first time DEFAULT_MODEL is bumped, requiring updates in both source and test. Import the constants instead:

    def test_defaults(self):
        from gaia.eval.runner import AgentEvalRunner, DEFAULT_BACKEND, DEFAULT_BUDGET, DEFAULT_MODEL, DEFAULT_TIMEOUT

        runner = AgentEvalRunner()
        assert runner.backend_url == DEFAULT_BACKEND
        assert runner.model == DEFAULT_MODEL
        assert runner.budget == DEFAULT_BUDGET
        assert runner.timeout == DEFAULT_TIMEOUT

Same issue in TestCalculateCost / TestGetCompletion / TestGetCompletionWithUsage / TestCountTokens fixtures in test_claude_judge.py — each constructs ClaudeClient(model="claude-sonnet-4-6"). These fixtures pin a specific model for pricing assertions, so hardcoding is intentional there; just worth noting that test_unknown_model_uses_default comment "Default pricing matches Sonnet: $3/$15" will silently become wrong if the default pricing table in claude.py changes.

🟢 Minor — overly loose timeout assertion (`test_runner.py:714–718`)

result = _compute_effective_timeout(
    900, {"turns": [], "setup": {"index_documents": []}}
)
assert result >= 240  # at most startup overhead

With base_timeout=900 and zero turns/docs, max(900, 240) == 900. The assertion >= 240 passes trivially — it would also pass if the function returned 999999. The test name says "base timeout when no turns or docs"; pin it to the actual expected value:

    def test_base_timeout_when_no_turns_or_docs(self):
        result = _compute_effective_timeout(
            900, {"turns": [], "setup": {"index_documents": []}}
        )
        assert result == 900  # base timeout wins when there are no turns or docs

🟢 Minor — `test_blocked_scenarios_on_low_msg_chars` passes for the wrong reason (`test_runner.py:149–158`)

The intent is to verify that _MAX_MSG_CHARS = 500 < 1000 triggers cross_turn_file_recall. But run_audit() also appends cross_turn_file_recall when not tool_results_in_history — and the test file (_MAX_MSG_CHARS = 500\n) has no agent_steps or role, so audit_tool_results_in_history() returns False and fires the second trigger. The max_msg_chars code path could be deleted entirely and the test would still pass.

Add the patterns that satisfy audit_tool_results_in_history so only the max_msg_chars path fires:

    def test_blocked_scenarios_on_low_msg_chars(self, tmp_path, monkeypatch):
        target = tmp_path / "src" / "gaia" / "ui"
        target.mkdir(parents=True)
        src = textwrap.dedent("""\
            _MAX_MSG_CHARS = 500
            agent_steps = agent.run()
            messages.append({"role": "tool", "content": agent_steps})
        """)
        (target / "_chat_helpers.py").write_text(src, encoding="utf-8")
        monkeypatch.setattr("gaia.eval.audit.GAIA_ROOT", tmp_path)

        result = run_audit()
        blocked = result["architecture_audit"]["blocked_scenarios"]
        assert any(b["scenario"] == "cross_turn_file_recall" for b in blocked)

Strengths

Zero external I/O. Every anthropic call, filesystem write, and subprocess is mocked via monkeypatch or SimpleNamespace — tests run offline and fast.
Class-per-function structure (TestAuditChatHelpers, TestRecomputeTurnScore, etc.) makes the intent of each test group obvious and keeps failure output scoped.
Scoring invariants are explicit. test_fail_scores_capped_at_5_99, test_avg_score_excludes_infra, test_judged_pass_rate each pin a non-obvious business rule that would otherwise only be discoverable from the scorecard source.

Verdict

Approve with suggestions. The Minor items above are all fixable in place — none block the merge, but the test_blocked_scenarios_on_low_msg_chars fix is worth landing because the test currently gives false confidence in that code path.

kovtcharov-amd · 2026-05-29T08:41:25Z

The mypy lint failure (Run Code Quality Checks) is a pre-existing issue in untouched files (shell/prompt.py, mcp/client/transports/base.py, perf_analysis.py, etc.). PR #1253 fixes all these type errors — once it merges and this PR rebases, the failure will resolve.

itomek

Good coverage of four previously-untested eval modules, fully isolated (mocked anthropic, monkeypatched module attrs, no real I/O). Head already fixed the earlier nits (the scaling test now asserts == 820, import yaml is module-level, the constants test writes to the real _chat_helpers.py path). Two test-fidelity items inline, both one-line fixes: test_blocked_scenarios_on_low_msg_chars passes via the wrong trigger (the not-tool-results branch), and test_base_timeout_when_no_turns_or_docs has a vacuous >= assertion. Neither blocks merge; #1 is the one giving false confidence in a real code path. Approving.

Generated by Claude Code

itomek · 2026-05-29T13:32:43Z

test_blocked_scenarios_on_low_msg_chars writes only _MAX_MSG_CHARS = 500 with no agent_steps/role markers, so audit_tool_results_in_history() returns False and the "not tool_results_in_history" trigger fires cross_turn_file_recall — meaning the test passes even if the max_msg_chars path were deleted. Add a tool-result pattern to the fixture so only the msg-chars path can satisfy the assertion.

Generated by Claude Code

itomek · 2026-05-29T13:32:46Z

test_base_timeout_when_no_turns_or_docs asserts result >= 240, but with base=900 the function returns 900 — the assertion would also pass if it returned 999999. Pin it to == 900 so it actually guards "base timeout wins when no turns/docs."

Generated by Claude Code

The eval toolchain at src/gaia/eval/ had only 3 test files covering analyze_failures, iterations, and MCP reliability — the four core modules (runner, scorecard, audit, claude judge) had zero unit tests. Adds 91 tests across 4 new files: - test_scorecard.py — build_scorecard aggregation, write_summary_md, write_junit_xml, status counting, score capping, performance rollup - test_audit.py — AST constant extraction, agent persistence detection, tool-results-in-history pattern matching, run_audit recommendations - test_runner.py — validate_scenario schema checks, recompute_turn_score weighting, _aggregate_performance, _compute_effective_timeout, find_scenarios filtering, build_scenario_prompt assembly, compare_scorecards regression detection, AgentEvalRunner init - test_claude_judge.py — ClaudeClient init validation, cost calculation, get_completion, get_completion_with_usage, count_tokens (all mocked)

Drop unused Path (test_audit), patch (test_claude_judge), textwrap/MagicMock/patch (test_runner), and a dead no-op fixture + unused helper function in test_claude_judge.

- Fix test_scales_with_turns_and_docs: use base_timeout=100 so the scaling formula (820) actually wins the max(), and assert == instead of >= to verify the exact computed value. - Remove dead _write_helpers call in test_extracts_max_constants that wrote to the wrong path (returned p was unused). - Move `import yaml` from inside _write_scenario to module-level.

…ernance Suppress no-any-return in factory.py (dynamic provider instantiation), fix union-attr in openai/claude providers (stream response type), and cast CheckpointStatus in checkpoint_bridge.py.

github-actions Bot added the tests Test changes label May 29, 2026

kovtcharov-amd mentioned this pull request May 29, 2026

test(llm): cover llama.cpp backend integration paths #1248

Open

2 tasks

kovtcharov-amd added the devops DevOps/infrastructure changes label May 29, 2026

kovtcharov-amd self-assigned this May 29, 2026

kovtcharov-amd added the p1 medium priority label May 29, 2026

itomek previously approved these changes May 29, 2026

View reviewed changes

kovtcharov-amd dismissed itomek’s stale review via 706a721 May 29, 2026 15:35

github-actions Bot added mcp MCP integration changes eval Evaluation framework changes performance Performance-critical changes agents llm LLM backend changes labels May 29, 2026

Ovtcharov added 4 commits May 29, 2026 09:27

test(eval): remove unused imports from eval test files

862d755

Drop unused Path (test_audit), patch (test_claude_judge), textwrap/MagicMock/patch (test_runner), and a dead no-op fixture + unused helper function in test_claude_judge.

fix(types): resolve additional mypy errors in factory, providers, gov…

cbc5d75

…ernance Suppress no-any-return in factory.py (dynamic provider instantiation), fix union-attr in openai/claude providers (stream response type), and cast CheckpointStatus in checkpoint_bridge.py.

kovtcharov-amd force-pushed the test/eval-toolchain-1151 branch from bb2d370 to cbc5d75 Compare May 29, 2026 16:27

kovtcharov-amd requested a review from itomek May 29, 2026 17:06

kovtcharov-amd mentioned this pull request May 29, 2026

fix(types): resolve remaining mypy errors missed by #1253 #1296

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(eval): cover scorecard, audit, runner, and claude judge#1243

test(eval): cover scorecard, audit, runner, and claude judge#1243
kovtcharov-amd wants to merge 4 commits into
mainfrom
test/eval-toolchain-1151

kovtcharov-amd commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

itomek left a comment

Uh oh!

itomek May 29, 2026

Uh oh!

itomek May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kovtcharov-amd commented May 29, 2026

Test plan

Uh oh!

github-actions Bot commented May 29, 2026

Summary

Issues Found

🟢 Minor — test_scales_with_turns_and_docs does not exercise the scaling formula (tests/unit/eval/test_runner.py:726-737)

🟢 Nit — dead _write_helpers call in test_extracts_max_constants (tests/unit/eval/test_audit.py:51-56)

🟢 Nit — import yaml inside _write_scenario helper (tests/unit/eval/test_runner.py:752)

Strengths

Verdict

Uh oh!

github-actions Bot commented May 29, 2026

Issues

🟢 Minor — test_defaults pins model/budget/timeout as string literals (test_runner.py:1000–1006)

🟢 Minor — overly loose timeout assertion (test_runner.py:714–718)

🟢 Minor — test_blocked_scenarios_on_low_msg_chars passes for the wrong reason (test_runner.py:149–158)

Strengths

Verdict

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

itomek left a comment

Choose a reason for hiding this comment

Uh oh!

itomek May 29, 2026

Choose a reason for hiding this comment

Uh oh!

itomek May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🟢 Minor — `test_scales_with_turns_and_docs` does not exercise the scaling formula (`tests/unit/eval/test_runner.py:726-737`)

🟢 Nit — dead `_write_helpers` call in `test_extracts_max_constants` (`tests/unit/eval/test_audit.py:51-56`)

🟢 Nit — `import yaml` inside `_write_scenario` helper (`tests/unit/eval/test_runner.py:752`)

🟢 Minor — `test_defaults` pins model/budget/timeout as string literals (`test_runner.py:1000–1006`)

🟢 Minor — overly loose timeout assertion (`test_runner.py:714–718`)

🟢 Minor — `test_blocked_scenarios_on_low_msg_chars` passes for the wrong reason (`test_runner.py:149–158`)