Skip to content

test(eval): cover scorecard, audit, runner, and claude judge#1243

Open
kovtcharov-amd wants to merge 4 commits into
mainfrom
test/eval-toolchain-1151
Open

test(eval): cover scorecard, audit, runner, and claude judge#1243
kovtcharov-amd wants to merge 4 commits into
mainfrom
test/eval-toolchain-1151

Conversation

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

The eval toolchain (src/gaia/eval/) had only 3 test files covering analyze_failures, iterations, and MCP scenario validation. The core modules — scorecard computation, audit trail, eval runner orchestration, and Claude judge client — had zero unit tests. Now they have 129 tests across 4 new files, all with mocked external dependencies (no real inference or network calls).

Test plan

  • python -m pytest tests/unit/eval/ -xvs — all 129 tests pass
  • No new dependencies required (uses existing pytest-mock)

Closes #1151

@github-actions github-actions Bot added the tests Test changes label May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

129 tests with zero blocking issues — the eval toolchain's most important modules now have real coverage. One timeout test doesn't actually exercise its stated formula; one dead _write_helpers call in test_audit.py should be removed.

Summary

Four previously untested src/gaia/eval/ modules (scorecard, audit, runner, claude) are now covered by 129 unit tests with all external dependencies properly mocked. The tests correctly isolate from Lemonade, Anthropic, and the filesystem; edge cases (empty inputs, unknown statuses, missing dimensions, duplicate turns) are well covered. The PR delivers on its stated goal.

Issues Found

🟢 Minor — test_scales_with_turns_and_docs does not exercise the scaling formula (tests/unit/eval/test_runner.py:726-737)

_compute_effective_timeout(900, ...) uses max(base_timeout, _STARTUP_OVERHEAD_S + …). With base_timeout=900 and the computed value 240 + 2×90 + 2×200 = 820, max(900, 820) returns 900 — the scaling branch is never taken. The assertion result >= 820 passes trivially (900 ≥ 820) whether or not the formula is wired up.

    def test_scales_with_turns_and_docs(self):
        scenario = {
            "turns": [{"turn": 1}, {"turn": 2}],
            "setup": {"index_documents": [{"path": "a.pdf"}, {"path": "b.pdf"}]},
        }
        # Use a base_timeout smaller than the computed value so the scaling branch wins.
        result = _compute_effective_timeout(100, scenario)
        expected = 240 + 2 * 90 + 2 * 200  # startup + docs + turns
        assert result == expected

🟢 Nit — dead _write_helpers call in test_extracts_max_constants (tests/unit/eval/test_audit.py:51-56)

_write_helpers(tmp_path, src) writes to tmp_path/_chat_helpers.py. audit_chat_helpers() reads from tmp_path/src/gaia/ui/_chat_helpers.py. The returned p is unused; the call does nothing useful and may mislead future readers.

    def test_extracts_max_constants(self, tmp_path, monkeypatch):
        src = """\
            _MAX_HISTORY_PAIRS = 10
            _MAX_MSG_CHARS = 2000
            _OTHER = 42
        """
        monkeypatch.setattr("gaia.eval.audit.GAIA_ROOT", tmp_path)
        target = tmp_path / "src" / "gaia" / "ui"
        target.mkdir(parents=True)
        (target / "_chat_helpers.py").write_text(textwrap.dedent(src), encoding="utf-8")
        result = audit_chat_helpers()
        assert result["_MAX_HISTORY_PAIRS"] == 10
        assert result["_MAX_MSG_CHARS"] == 2000
        assert "_OTHER" not in result

🟢 Nit — import yaml inside _write_scenario helper (tests/unit/eval/test_runner.py:752)

yaml is a top-level dependency of gaia.eval.runner and reliably available, so the deferred import adds noise without benefit. Move it to the module-level imports with the rest.

Strengths

  • Isolation is clean: monkeypatch.setattr on module-level attributes (gaia.eval.claude.anthropic, gaia.eval.runner.SCENARIOS_DIR, gaia.eval.audit.GAIA_ROOT) correctly intercepts the production paths without any real I/O, network, or subprocess calls.
  • Edge cases are thorough: unknown statuses, None scores, empty result lists, missing dimensions, duplicate/non-sequential turns, and FileNotFoundError on missing baselines are all exercised.
  • Scorecard semantics are pinned: test_fail_scores_capped_at_5_99, test_avg_score_excludes_infra, and test_judged_pass_rate directly encode the business rules in scorecard.py and will catch regressions if those rules are accidentally changed.

Verdict

Approve with suggestions. The two nits are one-line fixes; none of the issues block merge. Once the timeout test is tightened it will actually guard the scaling formula rather than check a vacuous bound.

@github-actions
Copy link
Copy Markdown
Contributor

Good addition — the eval toolchain (scorecard, audit, runner, claude) had zero unit coverage and this PR fills that gap with 129 well-isolated tests. One nit on test fidelity for test_base_timeout_when_no_turns_or_docs and a couple of spots where hardcoded literals will require two-file updates on future model bumps.


Issues

🟢 Minor — test_defaults pins model/budget/timeout as string literals (test_runner.py:1000–1006)

runner.model == "claude-sonnet-4-6" will fail with an unhelpful assertion error the first time DEFAULT_MODEL is bumped, requiring updates in both source and test. Import the constants instead:

    def test_defaults(self):
        from gaia.eval.runner import AgentEvalRunner, DEFAULT_BACKEND, DEFAULT_BUDGET, DEFAULT_MODEL, DEFAULT_TIMEOUT

        runner = AgentEvalRunner()
        assert runner.backend_url == DEFAULT_BACKEND
        assert runner.model == DEFAULT_MODEL
        assert runner.budget == DEFAULT_BUDGET
        assert runner.timeout == DEFAULT_TIMEOUT

Same issue in TestCalculateCost / TestGetCompletion / TestGetCompletionWithUsage / TestCountTokens fixtures in test_claude_judge.py — each constructs ClaudeClient(model="claude-sonnet-4-6"). These fixtures pin a specific model for pricing assertions, so hardcoding is intentional there; just worth noting that test_unknown_model_uses_default comment "Default pricing matches Sonnet: $3/$15" will silently become wrong if the default pricing table in claude.py changes.


🟢 Minor — overly loose timeout assertion (test_runner.py:714–718)

result = _compute_effective_timeout(
    900, {"turns": [], "setup": {"index_documents": []}}
)
assert result >= 240  # at most startup overhead

With base_timeout=900 and zero turns/docs, max(900, 240) == 900. The assertion >= 240 passes trivially — it would also pass if the function returned 999999. The test name says "base timeout when no turns or docs"; pin it to the actual expected value:

    def test_base_timeout_when_no_turns_or_docs(self):
        result = _compute_effective_timeout(
            900, {"turns": [], "setup": {"index_documents": []}}
        )
        assert result == 900  # base timeout wins when there are no turns or docs

🟢 Minor — test_blocked_scenarios_on_low_msg_chars passes for the wrong reason (test_runner.py:149–158)

The intent is to verify that _MAX_MSG_CHARS = 500 < 1000 triggers cross_turn_file_recall. But run_audit() also appends cross_turn_file_recall when not tool_results_in_history — and the test file (_MAX_MSG_CHARS = 500\n) has no agent_steps or role, so audit_tool_results_in_history() returns False and fires the second trigger. The max_msg_chars code path could be deleted entirely and the test would still pass.

Add the patterns that satisfy audit_tool_results_in_history so only the max_msg_chars path fires:

    def test_blocked_scenarios_on_low_msg_chars(self, tmp_path, monkeypatch):
        target = tmp_path / "src" / "gaia" / "ui"
        target.mkdir(parents=True)
        src = textwrap.dedent("""\
            _MAX_MSG_CHARS = 500
            agent_steps = agent.run()
            messages.append({"role": "tool", "content": agent_steps})
        """)
        (target / "_chat_helpers.py").write_text(src, encoding="utf-8")
        monkeypatch.setattr("gaia.eval.audit.GAIA_ROOT", tmp_path)

        result = run_audit()
        blocked = result["architecture_audit"]["blocked_scenarios"]
        assert any(b["scenario"] == "cross_turn_file_recall" for b in blocked)

Strengths

  • Zero external I/O. Every anthropic call, filesystem write, and subprocess is mocked via monkeypatch or SimpleNamespace — tests run offline and fast.
  • Class-per-function structure (TestAuditChatHelpers, TestRecomputeTurnScore, etc.) makes the intent of each test group obvious and keeps failure output scoped.
  • Scoring invariants are explicit. test_fail_scores_capped_at_5_99, test_avg_score_excludes_infra, test_judged_pass_rate each pin a non-obvious business rule that would otherwise only be discoverable from the scorecard source.

Verdict

Approve with suggestions. The Minor items above are all fixable in place — none block the merge, but the test_blocked_scenarios_on_low_msg_chars fix is worth landing because the test currently gives false confidence in that code path.

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator Author

The mypy lint failure (Run Code Quality Checks) is a pre-existing issue in untouched files (shell/prompt.py, mcp/client/transports/base.py, perf_analysis.py, etc.). PR #1253 fixes all these type errors — once it merges and this PR rebases, the failure will resolve.

@kovtcharov-amd kovtcharov-amd added the devops DevOps/infrastructure changes label May 29, 2026
@kovtcharov-amd kovtcharov-amd self-assigned this May 29, 2026
@kovtcharov-amd kovtcharov-amd added the p1 medium priority label May 29, 2026
itomek
itomek previously approved these changes May 29, 2026
Copy link
Copy Markdown
Collaborator

@itomek itomek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good coverage of four previously-untested eval modules, fully isolated (mocked anthropic, monkeypatched module attrs, no real I/O). Head already fixed the earlier nits (the scaling test now asserts == 820, import yaml is module-level, the constants test writes to the real _chat_helpers.py path). Two test-fidelity items inline, both one-line fixes: test_blocked_scenarios_on_low_msg_chars passes via the wrong trigger (the not-tool-results branch), and test_base_timeout_when_no_turns_or_docs has a vacuous >= assertion. Neither blocks merge; #1 is the one giving false confidence in a real code path. Approving.


Generated by Claude Code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_blocked_scenarios_on_low_msg_chars writes only _MAX_MSG_CHARS = 500 with no agent_steps/role markers, so audit_tool_results_in_history() returns False and the "not tool_results_in_history" trigger fires cross_turn_file_recall — meaning the test passes even if the max_msg_chars path were deleted. Add a tool-result pattern to the fixture so only the msg-chars path can satisfy the assertion.


Generated by Claude Code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_base_timeout_when_no_turns_or_docs asserts result >= 240, but with base=900 the function returns 900 — the assertion would also pass if it returned 999999. Pin it to == 900 so it actually guards "base timeout wins when no turns/docs."


Generated by Claude Code

@github-actions github-actions Bot added mcp MCP integration changes eval Evaluation framework changes performance Performance-critical changes agents llm LLM backend changes labels May 29, 2026
Ovtcharov added 4 commits May 29, 2026 09:27
The eval toolchain at src/gaia/eval/ had only 3 test files covering
analyze_failures, iterations, and MCP reliability — the four core
modules (runner, scorecard, audit, claude judge) had zero unit tests.

Adds 91 tests across 4 new files:
- test_scorecard.py — build_scorecard aggregation, write_summary_md,
  write_junit_xml, status counting, score capping, performance rollup
- test_audit.py — AST constant extraction, agent persistence detection,
  tool-results-in-history pattern matching, run_audit recommendations
- test_runner.py — validate_scenario schema checks, recompute_turn_score
  weighting, _aggregate_performance, _compute_effective_timeout,
  find_scenarios filtering, build_scenario_prompt assembly,
  compare_scorecards regression detection, AgentEvalRunner init
- test_claude_judge.py — ClaudeClient init validation, cost calculation,
  get_completion, get_completion_with_usage, count_tokens (all mocked)
Drop unused Path (test_audit), patch (test_claude_judge),
textwrap/MagicMock/patch (test_runner), and a dead no-op
fixture + unused helper function in test_claude_judge.
- Fix test_scales_with_turns_and_docs: use base_timeout=100 so the
  scaling formula (820) actually wins the max(), and assert == instead
  of >= to verify the exact computed value.
- Remove dead _write_helpers call in test_extracts_max_constants that
  wrote to the wrong path (returned p was unused).
- Move `import yaml` from inside _write_scenario to module-level.
…ernance

Suppress no-any-return in factory.py (dynamic provider instantiation),
fix union-attr in openai/claude providers (stream response type), and
cast CheckpointStatus in checkpoint_bridge.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents devops DevOps/infrastructure changes eval Evaluation framework changes llm LLM backend changes mcp MCP integration changes p1 medium priority performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tests: cover eval toolchain (runner, scorecard, audit, claude judge)

2 participants