test(llm): cover llama.cpp backend integration paths by kovtcharov-amd · Pull Request #1248 · amd/gaia

kovtcharov-amd · 2026-05-29T03:40:06Z

The llama.cpp-specific code paths in LemonadeClient (llamacpp_args construction, ctx_size validation, model-load HTTP payloads, error classification) had zero dedicated test coverage. The #1030 ctx-truncation incident showed what happens when these paths regress silently — unit tests pass but doc-Q&A breaks in production. 46 new mocked unit tests now pin the critical integration seams so a future change that breaks load_model payloads, ctx-size resolution, or the retryable-overflow logic will fail fast in CI.

Test plan

python -m pytest tests/unit/test_llamacpp_backend.py -xvs — 46 tests pass, no Lemonade Server required
No existing tests regress: python -m pytest tests/unit/ -x

Closes #1153

github-actions · 2026-05-29T03:44:38Z

Solid regression-prevention suite that earns its place in CI — 46 tests with no running server required, directly encoding the #1030 contract in executable form. Three nits worth tightening before merge.

Summary

The tests are correctly implemented, well-organized by concern, and mock at the right level. TestModelsRegistry is especially valuable: it will catch future drift in min_ctx_size values or incorrect tool_calling flags on embedding models before anything reaches production. The test_reload_when_loaded_at_undersized_ctx test is the canonical regression guard for #1030 — if anyone ever reverts the undersized-ctx reload logic, this test breaks immediately.

Issues Found

🟢 Minor — `LemonadeProvider` imported inside test methods (inconsistency)

LemonadeContextOverflowError, LemonadeError, and the rest of the error classes are imported at module level (lines 34–41), but LemonadeProvider is imported inside each test method body (lines 685, 716, 751, 776, 799). There's no deferred-import reason for this (the symbol is stable, not behind a try/except), so it just adds noise and inconsistency.

from gaia.llm.providers.lemonade import (
    LemonadeContextOverflowError,
    LemonadeError,
    LemonadeModelNotLoadedError,
    LemonadeNetworkError,
    LemonadeProvider,
    LemonadeUpstreamTimeoutError,
    _classify_lemonade_response,
)

Then remove the five from gaia.llm.providers.lemonade import LemonadeProvider lines inside the test methods.

🟢 Minor — `test_missing_recipe_options_triggers_reload` doesn't assert the reload arguments (`tests/unit/test_llamacpp_backend.py:313`)

Every other test in TestEnsureModelLoadedCtxResolution uses assert_called_once_with(...) to verify the exact ctx_size passed. This one uses bare assert_called_once(), leaving the ctx_size unverified. Given that the whole class is pinning the ctx_size resolution logic, the omission is inconsistent.

        mock_load.assert_called_once_with(
            "Qwen3-0.6B-GGUF", auto_download=True, prompt=False, ctx_size=4096
        )

🟢 Nit — Docstring says "n_ctx < 32K" but code threshold is 65536 (64K) — tests expose this

The new tests correctly use 65536 as the retryable threshold (see test_context_overflow_retryable_when_n_ctx_small and test_context_overflow_not_retryable_at_full_ctx). But src/gaia/llm/providers/lemonade.py:66 still says:

retryable = False  # default; set True dynamically when n_ctx < 32K

The actual guard in _classify_lemonade_response is 0 < n_ctx_reported < 65536. The comment is stale from before Gemma 4 E4B's ctx was bumped. Worth fixing in this PR since the tests are the source of truth here:

    retryable = False  # default; set True dynamically when n_ctx < 65536

Strengths

TestModelsRegistry is a CI gatekeeper: tests like test_all_entries_have_min_ctx_size and test_embedding_model_has_tool_calling_false will catch future registry corruption immediately. These are the kind of structural invariant tests that save debugging time months later.
@patch.object over string paths: consistent with the lemonade-client-patterns convention; avoids the deferred-import target mistake documented in the skill file.
test_reload_when_loaded_at_undersized_ctx names the exact failure mode from [Bug]: 1.4MB file indexed no response to queries in the UI #1030 in its docstring and encodes the contract: ctx=4096 → must reload at 65536. This will catch regressions automatically.

Verdict

Approve with suggestions — no blocking issues. The nit on test_missing_recipe_options_triggers_reload:313 is the one worth applying before merge since it's a gap in an otherwise thorough assertion suite; the other two are polish.

github-actions · 2026-05-29T04:11:29Z

46 new mocked unit tests covering the llama.cpp-specific paths that caused the #1030 production regression — a solid, focused addition. All 46 tests are structurally correct against the current source; the public behavior contracts (payload construction, ctx-size resolution, retryable logic) are pinned accurately. Two minor issues worth fixing before merge, one nit.

Issues

🟡 `TestEnsureModelLoadedCtxResolution` — 8 tests make implicit network calls (`test_llamacpp_backend.py:166–312`)

All 8 tests in TestEnsureModelLoadedCtxResolution patch get_status and load_model but not list_models(). _ensure_model_loaded unconditionally calls self.list_models() (wrapped in its own except Exception) before every /load call, so each test silently fires a connection attempt to localhost:13305. The inner try/except swallows the refusal, so the tests pass — but they're not fully isolated unit tests and will be measurably slower on machines where port 13305 is actually in use (the call can succeed or fail in unexpected ways).

Fix: add @patch.object(LemonadeClient, "list_models", return_value={"data": []}) to each test in the class, or factor it into a class-level autouse fixture:

class TestEnsureModelLoadedCtxResolution:
    """Verify _ensure_model_loaded resolves ctx_size correctly from the
    MODELS registry and falls back to DEFAULT_CONTEXT_SIZE for unknowns.
    """

    @pytest.fixture(autouse=True)
    def _stub_list_models(self):
        with patch.object(LemonadeClient, "list_models", return_value={"data": []}):
            yield

    @patch.object(LemonadeClient, "get_status")
    @patch.object(LemonadeClient, "load_model")
    def test_gemma_4_e4b_loads_at_65536(self, mock_load, mock_status):

Apply the fixture to all 8 tests in the class. The _stub_list_models autouse fixture means no individual test needs to change its own @patch decorators.

🟢 Redundant `patch.object(LemonadeClient, "init")` in 5 provider tests (`test_llamacpp_backend.py:692, 720, 757, 782, 806`)

All five LemonadeProvider tests create the provider via LemonadeProvider.__new__(LemonadeProvider), which bypasses __init__ entirely. The with patch.object(LemonadeClient, "__init__", return_value=None): context manager is therefore never triggered and misleads a reader into thinking there's a LemonadeClient() call being suppressed. Remove it in each test method; the provider._backend = MagicMock(spec=LemonadeClient) assignment already provides all the isolation needed.

Example for test_chat_sets_repeat_penalty_defaults:

    def test_chat_sets_repeat_penalty_defaults(self):
        """chat() should inject repeat_penalty and repeat_last_n defaults."""
        from gaia.llm.providers.lemonade import LemonadeProvider

        provider = LemonadeProvider.__new__(LemonadeProvider)
        provider._backend = MagicMock(spec=LemonadeClient)
        provider._model = "Gemma-4-E4B-it-GGUF"
        provider._system_prompt = None

Same change for test_chat_allows_override_of_repeat_penalty, test_context_overflow_raises_typed_error, test_model_not_loaded_raises_typed_error, and test_unexpected_response_shape_raises_generic.

🟢 `TestLaunchServerCtxSize` daemon-thread noise (`test_llamacpp_backend.py:318–350`)

Both tests use background="none", which spawns a print_output daemon thread reading stdout.readline() on the MagicMock Popen return. MagicMock.readline() returns a truthy MagicMock on every call, so the thread busy-loops until the test process exits — producing noisy debug log output in pytest -v runs. background="none" isn't what's being tested here (the assertion is only about the command array); switching to background="silent" removes the thread entirely and keeps the test focused.

    def test_ctx_size_appended_to_command(
        self, _sleep, mock_conn, mock_popen, mock_kill
    ):
        mock_conn.return_value = MagicMock()
        mock_popen.return_value = MagicMock(pid=12345)
        client = LemonadeClient(host="localhost", port=13305)

        client.launch_server(ctx_size=65536, background="silent")

Apply the same change to test_no_ctx_size_flag_when_none. You'll also need to patch builtins.open (or the specific open("lemonade.log", ...) call) since the silent path writes to a log file.

Strengths

The test names are dense with signal — test_reload_when_loaded_at_undersized_ctx and test_no_reload_when_loaded_above_expected_ctx capture a subtle >= vs > boundary that has already caused a production incident; any future regression on that comparison will fail immediately.
TestContextOverflowClassification covers 7 distinct scenarios (top-level type, nested llama-server error, message-string fallback, n_ctx=0 edge case) — this is exactly the depth needed for error-classification code whose bugs are invisible until RAG silently fails.
TestModelsRegistry locks in structural invariants (min_ctx_size > 0, embedding models with tool_calling=False, Gemma 4 E4B's load-bearing 65536 constant) so a misconfigured registry entry fails at CI time rather than in production.
@patch.object(LemonadeClient, "_send_request") — correct deferred-import patch target per the lemonade-client-patterns skill; avoids the "patch the wrong module" mistake.

Verdict

Approve with suggestions. No blocking correctness issues — all 46 tests are valid assertions against the current implementation. The 🟡 list_models isolation issue is worth fixing before merge (it makes the tests slower and slightly environment-sensitive), but it won't cause CI failures on a clean machine. The two 🟢 items are low-lift cleanup that'll pay off next time someone reads the test file.

github-actions · 2026-05-29T06:51:23Z

Solid follow-up to the #1030 regression — 46 targeted unit tests now pin the exact integration seams (payload construction, ctx-size resolution, error classification) that can silently break doc-Q&A in production. The docstring update in lemonade.py is also correct.

Summary

The test suite is well-structured and the mocking strategy is consistent throughout: @patch.object for instance methods, string paths for module-level names (gaia.llm.lemonade_client.OpenAI, subprocess.Popen). The _stub_list_models autouse fixture is the right approach for class-wide stubbing. The TestModelsRegistry invariant tests are particularly valuable — they'll catch any future registry edit that breaks the ctx-size assumptions the rest of the code relies on.

One minor docstring inaccuracy; no blocking issues.

Issues Found

🟢 Minor — misleading docstring in `test_context_overflow_by_message_string` (`tests/unit/test_llamacpp_backend.py:547`)

The docstring says "even without type field" but the payload does have a type field — it's just "unknown_error", not "exceed_context_size". The test exercises message-substring fallback when the type doesn't match, not when the type is absent.

        """Detection works via message substring even when type is not 'exceed_context_size'."""

Strengths

TestModelsRegistry encodes load-bearing constants as assertions. The test_gemma_4_e4b_ctx_is_65536 and test_qwen3_0_6b_is_smallest tests will catch any future registry edit that silently drops a min_ctx_size, which is exactly how the [Bug]: 1.4MB file indexed no response to queries in the UI #1030 class of bugs gets in.
test_reload_when_loaded_at_undersized_ctx reproduces the exact [Bug]: 1.4MB file indexed no response to queries in the UI #1030 failure mode (Gemma at 4096 after embedder warm-up) and asserts the correct reload with ctx_size=65536. Future regressions on this path will fail fast in CI.
test_full_payload_with_all_params uses exact dict equality (assert payload == {...}) rather than per-key assertions, so any unexpected extra keys in the /load payload will also fail the test.

Verdict

Approve. One tiny docstring nit; everything else is clean and directly addressing the stated coverage gap.

kovtcharov-amd · 2026-05-29T08:41:30Z

The mypy lint failure (Run Code Quality Checks) is the same pre-existing type error issue as PR #1243 — untouched files failing mypy. PR #1253 fixes all these type errors — once it merges and this PR rebases, the failure will resolve.

kovtcharov-amd · 2026-05-29T10:02:04Z

The Example Agents Integration Tests (stx) failure is not a test failure — all 7 tests passed (3 skipped). The job failed at the cleanup step Upload Lemonade logs, which is a CI infrastructure issue (log upload failing), not a code regression. The actual test results are clean.

itomek

High-value regression suite — the mocked tests pin exactly the seams that broke in #1030 (load_model payload, _ensure_model_loaded ctx resolution, undersized-ctx reload, retryable-overflow classification) with no Lemonade Server needed. Verified the autouse list-models stub (real isolation, no implicit localhost calls), the ctx_size=4096 assertion on the missing-recipe-options test, and background=silent on the launch_server tests. Approving. Tiny optional nit: test_context_overflow_by_message_string's docstring says "even without type field" but the payload sets type=unknown_error — it actually tests message-substring fallback when the type doesn't match.

Generated by Claude Code

LemonadeClient's llama.cpp-specific code paths (llamacpp_args, ctx_size parameter construction, model loading, context-size validation, error classification) had no dedicated test coverage. A regression in any of these paths could silently break model loading or produce confusing errors — the #1030 ctx-truncation incident showed what that looks like. 46 new unit tests covering: - load_model HTTP payload construction (llamacpp_args, ctx_size, save_options) - _ensure_model_loaded ctx_size resolution from MODELS registry vs fallback - Under-sized ctx reload detection (the #1030 follow-up regression path) - launch_server --ctx-size CLI flag forwarding - OpenAI-standard vs llama.cpp-native parameter separation (extra_body) - Context-overflow error classification with dynamic retryable logic - Nested llama-server error envelopes (details.response.error) - model_not_loaded and corrupt-download detection - MODELS registry structural invariants - LemonadeProvider repetition-penalty defaults and error raising All tests are fully mocked — no running Lemonade Server required. Closes #1153

- Move LemonadeProvider to module-level import (was repeated in 5 methods) - Add autouse fixture to stub list_models() in TestEnsureModelLoadedCtxResolution so tests don't make implicit network calls to localhost:13305 - Strengthen test_missing_recipe_options_triggers_reload to assert exact ctx_size=4096 argument (was bare assert_called_once) - Remove redundant patch.object(LemonadeClient, "__init__") from 5 provider tests — __new__ already bypasses __init__ - Switch TestLaunchServerCtxSize to background="silent" to avoid daemon- thread busy-loop noise in pytest output - Fix stale docstring in LemonadeContextOverflowError: "n_ctx < 32K" → 65536 to match the actual guard in _classify_lemonade_response

…ernance Cherry-picked from test/eval-toolchain-1151 to unblock CI.

github-actions Bot added the tests Test changes label May 29, 2026

github-actions Bot added llm LLM backend changes performance Performance-critical changes labels May 29, 2026

kovtcharov-amd added devops DevOps/infrastructure changes p1 medium priority labels May 29, 2026

kovtcharov-amd self-assigned this May 29, 2026

itomek previously approved these changes May 29, 2026

View reviewed changes

kovtcharov-amd dismissed itomek’s stale review via e8092f6 May 29, 2026 15:35

github-actions Bot added mcp MCP integration changes eval Evaluation framework changes agents labels May 29, 2026

Ovtcharov added 4 commits May 29, 2026 09:28

fix(test): remove unused AgentProfile import (flake8 F401)

b18274a

fix(types): resolve additional mypy errors in factory, providers, gov…

14dc920

…ernance Cherry-picked from test/eval-toolchain-1151 to unblock CI.

kovtcharov-amd force-pushed the test/llamacpp-backend-1153 branch from 1bcc882 to 14dc920 Compare May 29, 2026 16:28

kovtcharov-amd mentioned this pull request May 29, 2026

fix(types): resolve remaining mypy errors missed by #1253 #1296

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(llm): cover llama.cpp backend integration paths#1248

test(llm): cover llama.cpp backend integration paths#1248
kovtcharov-amd wants to merge 4 commits into
mainfrom
test/llamacpp-backend-1153

kovtcharov-amd commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

itomek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kovtcharov-amd commented May 29, 2026

Test plan

Uh oh!

github-actions Bot commented May 29, 2026

Summary

Issues Found

🟢 Minor — LemonadeProvider imported inside test methods (inconsistency)

🟢 Minor — test_missing_recipe_options_triggers_reload doesn't assert the reload arguments (tests/unit/test_llamacpp_backend.py:313)

🟢 Nit — Docstring says "n_ctx < 32K" but code threshold is 65536 (64K) — tests expose this

Strengths

Verdict

Uh oh!

github-actions Bot commented May 29, 2026

Issues

🟡 TestEnsureModelLoadedCtxResolution — 8 tests make implicit network calls (test_llamacpp_backend.py:166–312)

🟢 Redundant patch.object(LemonadeClient, "__init__") in 5 provider tests (test_llamacpp_backend.py:692, 720, 757, 782, 806)

🟢 TestLaunchServerCtxSize daemon-thread noise (test_llamacpp_backend.py:318–350)

Strengths

Verdict

Uh oh!

github-actions Bot commented May 29, 2026

Summary

Issues Found

🟢 Minor — misleading docstring in test_context_overflow_by_message_string (tests/unit/test_llamacpp_backend.py:547)

Strengths

Verdict

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

itomek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🟢 Minor — `LemonadeProvider` imported inside test methods (inconsistency)

🟢 Minor — `test_missing_recipe_options_triggers_reload` doesn't assert the reload arguments (`tests/unit/test_llamacpp_backend.py:313`)

🟡 `TestEnsureModelLoadedCtxResolution` — 8 tests make implicit network calls (`test_llamacpp_backend.py:166–312`)

🟢 Redundant `patch.object(LemonadeClient, "init")` in 5 provider tests (`test_llamacpp_backend.py:692, 720, 757, 782, 806`)

🟢 `TestLaunchServerCtxSize` daemon-thread noise (`test_llamacpp_backend.py:318–350`)

🟢 Minor — misleading docstring in `test_context_overflow_by_message_string` (`tests/unit/test_llamacpp_backend.py:547`)