Skip to content

test(llm): cover llama.cpp backend integration paths#1248

Open
kovtcharov-amd wants to merge 4 commits into
mainfrom
test/llamacpp-backend-1153
Open

test(llm): cover llama.cpp backend integration paths#1248
kovtcharov-amd wants to merge 4 commits into
mainfrom
test/llamacpp-backend-1153

Conversation

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

The llama.cpp-specific code paths in LemonadeClient (llamacpp_args construction, ctx_size validation, model-load HTTP payloads, error classification) had zero dedicated test coverage. The #1030 ctx-truncation incident showed what happens when these paths regress silently — unit tests pass but doc-Q&A breaks in production. 46 new mocked unit tests now pin the critical integration seams so a future change that breaks load_model payloads, ctx-size resolution, or the retryable-overflow logic will fail fast in CI.

Test plan

  • python -m pytest tests/unit/test_llamacpp_backend.py -xvs — 46 tests pass, no Lemonade Server required
  • No existing tests regress: python -m pytest tests/unit/ -x

Closes #1153

@github-actions github-actions Bot added the tests Test changes label May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Solid regression-prevention suite that earns its place in CI — 46 tests with no running server required, directly encoding the #1030 contract in executable form. Three nits worth tightening before merge.


Summary

The tests are correctly implemented, well-organized by concern, and mock at the right level. TestModelsRegistry is especially valuable: it will catch future drift in min_ctx_size values or incorrect tool_calling flags on embedding models before anything reaches production. The test_reload_when_loaded_at_undersized_ctx test is the canonical regression guard for #1030 — if anyone ever reverts the undersized-ctx reload logic, this test breaks immediately.


Issues Found

🟢 Minor — LemonadeProvider imported inside test methods (inconsistency)

LemonadeContextOverflowError, LemonadeError, and the rest of the error classes are imported at module level (lines 34–41), but LemonadeProvider is imported inside each test method body (lines 685, 716, 751, 776, 799). There's no deferred-import reason for this (the symbol is stable, not behind a try/except), so it just adds noise and inconsistency.

from gaia.llm.providers.lemonade import (
    LemonadeContextOverflowError,
    LemonadeError,
    LemonadeModelNotLoadedError,
    LemonadeNetworkError,
    LemonadeProvider,
    LemonadeUpstreamTimeoutError,
    _classify_lemonade_response,
)

Then remove the five from gaia.llm.providers.lemonade import LemonadeProvider lines inside the test methods.


🟢 Minor — test_missing_recipe_options_triggers_reload doesn't assert the reload arguments (tests/unit/test_llamacpp_backend.py:313)

Every other test in TestEnsureModelLoadedCtxResolution uses assert_called_once_with(...) to verify the exact ctx_size passed. This one uses bare assert_called_once(), leaving the ctx_size unverified. Given that the whole class is pinning the ctx_size resolution logic, the omission is inconsistent.

        mock_load.assert_called_once_with(
            "Qwen3-0.6B-GGUF", auto_download=True, prompt=False, ctx_size=4096
        )

🟢 Nit — Docstring says "n_ctx < 32K" but code threshold is 65536 (64K) — tests expose this

The new tests correctly use 65536 as the retryable threshold (see test_context_overflow_retryable_when_n_ctx_small and test_context_overflow_not_retryable_at_full_ctx). But src/gaia/llm/providers/lemonade.py:66 still says:

retryable = False  # default; set True dynamically when n_ctx < 32K

The actual guard in _classify_lemonade_response is 0 < n_ctx_reported < 65536. The comment is stale from before Gemma 4 E4B's ctx was bumped. Worth fixing in this PR since the tests are the source of truth here:

    retryable = False  # default; set True dynamically when n_ctx < 65536

Strengths

  • TestModelsRegistry is a CI gatekeeper: tests like test_all_entries_have_min_ctx_size and test_embedding_model_has_tool_calling_false will catch future registry corruption immediately. These are the kind of structural invariant tests that save debugging time months later.
  • @patch.object over string paths: consistent with the lemonade-client-patterns convention; avoids the deferred-import target mistake documented in the skill file.
  • test_reload_when_loaded_at_undersized_ctx names the exact failure mode from [Bug]: 1.4MB file indexed no response to queries in the UI #1030 in its docstring and encodes the contract: ctx=4096 → must reload at 65536. This will catch regressions automatically.

Verdict

Approve with suggestions — no blocking issues. The nit on test_missing_recipe_options_triggers_reload:313 is the one worth applying before merge since it's a gap in an otherwise thorough assertion suite; the other two are polish.

@github-actions
Copy link
Copy Markdown
Contributor

46 new mocked unit tests covering the llama.cpp-specific paths that caused the #1030 production regression — a solid, focused addition. All 46 tests are structurally correct against the current source; the public behavior contracts (payload construction, ctx-size resolution, retryable logic) are pinned accurately. Two minor issues worth fixing before merge, one nit.


Issues

🟡 TestEnsureModelLoadedCtxResolution — 8 tests make implicit network calls (test_llamacpp_backend.py:166–312)

All 8 tests in TestEnsureModelLoadedCtxResolution patch get_status and load_model but not list_models(). _ensure_model_loaded unconditionally calls self.list_models() (wrapped in its own except Exception) before every /load call, so each test silently fires a connection attempt to localhost:13305. The inner try/except swallows the refusal, so the tests pass — but they're not fully isolated unit tests and will be measurably slower on machines where port 13305 is actually in use (the call can succeed or fail in unexpected ways).

Fix: add @patch.object(LemonadeClient, "list_models", return_value={"data": []}) to each test in the class, or factor it into a class-level autouse fixture:

class TestEnsureModelLoadedCtxResolution:
    """Verify _ensure_model_loaded resolves ctx_size correctly from the
    MODELS registry and falls back to DEFAULT_CONTEXT_SIZE for unknowns.
    """

    @pytest.fixture(autouse=True)
    def _stub_list_models(self):
        with patch.object(LemonadeClient, "list_models", return_value={"data": []}):
            yield

    @patch.object(LemonadeClient, "get_status")
    @patch.object(LemonadeClient, "load_model")
    def test_gemma_4_e4b_loads_at_65536(self, mock_load, mock_status):

Apply the fixture to all 8 tests in the class. The _stub_list_models autouse fixture means no individual test needs to change its own @patch decorators.


🟢 Redundant patch.object(LemonadeClient, "__init__") in 5 provider tests (test_llamacpp_backend.py:692, 720, 757, 782, 806)

All five LemonadeProvider tests create the provider via LemonadeProvider.__new__(LemonadeProvider), which bypasses __init__ entirely. The with patch.object(LemonadeClient, "__init__", return_value=None): context manager is therefore never triggered and misleads a reader into thinking there's a LemonadeClient() call being suppressed. Remove it in each test method; the provider._backend = MagicMock(spec=LemonadeClient) assignment already provides all the isolation needed.

Example for test_chat_sets_repeat_penalty_defaults:

    def test_chat_sets_repeat_penalty_defaults(self):
        """chat() should inject repeat_penalty and repeat_last_n defaults."""
        from gaia.llm.providers.lemonade import LemonadeProvider

        provider = LemonadeProvider.__new__(LemonadeProvider)
        provider._backend = MagicMock(spec=LemonadeClient)
        provider._model = "Gemma-4-E4B-it-GGUF"
        provider._system_prompt = None

Same change for test_chat_allows_override_of_repeat_penalty, test_context_overflow_raises_typed_error, test_model_not_loaded_raises_typed_error, and test_unexpected_response_shape_raises_generic.


🟢 TestLaunchServerCtxSize daemon-thread noise (test_llamacpp_backend.py:318–350)

Both tests use background="none", which spawns a print_output daemon thread reading stdout.readline() on the MagicMock Popen return. MagicMock.readline() returns a truthy MagicMock on every call, so the thread busy-loops until the test process exits — producing noisy debug log output in pytest -v runs. background="none" isn't what's being tested here (the assertion is only about the command array); switching to background="silent" removes the thread entirely and keeps the test focused.

    def test_ctx_size_appended_to_command(
        self, _sleep, mock_conn, mock_popen, mock_kill
    ):
        mock_conn.return_value = MagicMock()
        mock_popen.return_value = MagicMock(pid=12345)
        client = LemonadeClient(host="localhost", port=13305)

        client.launch_server(ctx_size=65536, background="silent")

Apply the same change to test_no_ctx_size_flag_when_none. You'll also need to patch builtins.open (or the specific open("lemonade.log", ...) call) since the silent path writes to a log file.


Strengths

  • The test names are dense with signal — test_reload_when_loaded_at_undersized_ctx and test_no_reload_when_loaded_above_expected_ctx capture a subtle >= vs > boundary that has already caused a production incident; any future regression on that comparison will fail immediately.
  • TestContextOverflowClassification covers 7 distinct scenarios (top-level type, nested llama-server error, message-string fallback, n_ctx=0 edge case) — this is exactly the depth needed for error-classification code whose bugs are invisible until RAG silently fails.
  • TestModelsRegistry locks in structural invariants (min_ctx_size > 0, embedding models with tool_calling=False, Gemma 4 E4B's load-bearing 65536 constant) so a misconfigured registry entry fails at CI time rather than in production.
  • @patch.object(LemonadeClient, "_send_request") — correct deferred-import patch target per the lemonade-client-patterns skill; avoids the "patch the wrong module" mistake.

Verdict

Approve with suggestions. No blocking correctness issues — all 46 tests are valid assertions against the current implementation. The 🟡 list_models isolation issue is worth fixing before merge (it makes the tests slower and slightly environment-sensitive), but it won't cause CI failures on a clean machine. The two 🟢 items are low-lift cleanup that'll pay off next time someone reads the test file.

@github-actions github-actions Bot added llm LLM backend changes performance Performance-critical changes labels May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Solid follow-up to the #1030 regression — 46 targeted unit tests now pin the exact integration seams (payload construction, ctx-size resolution, error classification) that can silently break doc-Q&A in production. The docstring update in lemonade.py is also correct.

Summary

The test suite is well-structured and the mocking strategy is consistent throughout: @patch.object for instance methods, string paths for module-level names (gaia.llm.lemonade_client.OpenAI, subprocess.Popen). The _stub_list_models autouse fixture is the right approach for class-wide stubbing. The TestModelsRegistry invariant tests are particularly valuable — they'll catch any future registry edit that breaks the ctx-size assumptions the rest of the code relies on.

One minor docstring inaccuracy; no blocking issues.

Issues Found

🟢 Minor — misleading docstring in test_context_overflow_by_message_string (tests/unit/test_llamacpp_backend.py:547)

The docstring says "even without type field" but the payload does have a type field — it's just "unknown_error", not "exceed_context_size". The test exercises message-substring fallback when the type doesn't match, not when the type is absent.

        """Detection works via message substring even when type is not 'exceed_context_size'."""

Strengths

  • TestModelsRegistry encodes load-bearing constants as assertions. The test_gemma_4_e4b_ctx_is_65536 and test_qwen3_0_6b_is_smallest tests will catch any future registry edit that silently drops a min_ctx_size, which is exactly how the [Bug]: 1.4MB file indexed no response to queries in the UI #1030 class of bugs gets in.
  • test_reload_when_loaded_at_undersized_ctx reproduces the exact [Bug]: 1.4MB file indexed no response to queries in the UI #1030 failure mode (Gemma at 4096 after embedder warm-up) and asserts the correct reload with ctx_size=65536. Future regressions on this path will fail fast in CI.
  • test_full_payload_with_all_params uses exact dict equality (assert payload == {...}) rather than per-key assertions, so any unexpected extra keys in the /load payload will also fail the test.

Verdict

Approve. One tiny docstring nit; everything else is clean and directly addressing the stated coverage gap.

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator Author

The mypy lint failure (Run Code Quality Checks) is the same pre-existing type error issue as PR #1243 — untouched files failing mypy. PR #1253 fixes all these type errors — once it merges and this PR rebases, the failure will resolve.

@kovtcharov-amd kovtcharov-amd added devops DevOps/infrastructure changes p1 medium priority labels May 29, 2026
@kovtcharov-amd kovtcharov-amd self-assigned this May 29, 2026
@kovtcharov-amd
Copy link
Copy Markdown
Collaborator Author

The Example Agents Integration Tests (stx) failure is not a test failure — all 7 tests passed (3 skipped). The job failed at the cleanup step Upload Lemonade logs, which is a CI infrastructure issue (log upload failing), not a code regression. The actual test results are clean.

itomek
itomek previously approved these changes May 29, 2026
Copy link
Copy Markdown
Collaborator

@itomek itomek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-value regression suite — the mocked tests pin exactly the seams that broke in #1030 (load_model payload, _ensure_model_loaded ctx resolution, undersized-ctx reload, retryable-overflow classification) with no Lemonade Server needed. Verified the autouse list-models stub (real isolation, no implicit localhost calls), the ctx_size=4096 assertion on the missing-recipe-options test, and background=silent on the launch_server tests. Approving. Tiny optional nit: test_context_overflow_by_message_string's docstring says "even without type field" but the payload sets type=unknown_error — it actually tests message-substring fallback when the type doesn't match.


Generated by Claude Code

@github-actions github-actions Bot added mcp MCP integration changes eval Evaluation framework changes agents labels May 29, 2026
Ovtcharov added 4 commits May 29, 2026 09:28
LemonadeClient's llama.cpp-specific code paths (llamacpp_args, ctx_size
parameter construction, model loading, context-size validation, error
classification) had no dedicated test coverage. A regression in any of
these paths could silently break model loading or produce confusing
errors — the #1030 ctx-truncation incident showed what that looks like.

46 new unit tests covering:
- load_model HTTP payload construction (llamacpp_args, ctx_size, save_options)
- _ensure_model_loaded ctx_size resolution from MODELS registry vs fallback
- Under-sized ctx reload detection (the #1030 follow-up regression path)
- launch_server --ctx-size CLI flag forwarding
- OpenAI-standard vs llama.cpp-native parameter separation (extra_body)
- Context-overflow error classification with dynamic retryable logic
- Nested llama-server error envelopes (details.response.error)
- model_not_loaded and corrupt-download detection
- MODELS registry structural invariants
- LemonadeProvider repetition-penalty defaults and error raising

All tests are fully mocked — no running Lemonade Server required.

Closes #1153
- Move LemonadeProvider to module-level import (was repeated in 5 methods)
- Add autouse fixture to stub list_models() in TestEnsureModelLoadedCtxResolution
  so tests don't make implicit network calls to localhost:13305
- Strengthen test_missing_recipe_options_triggers_reload to assert exact
  ctx_size=4096 argument (was bare assert_called_once)
- Remove redundant patch.object(LemonadeClient, "__init__") from 5 provider
  tests — __new__ already bypasses __init__
- Switch TestLaunchServerCtxSize to background="silent" to avoid daemon-
  thread busy-loop noise in pytest output
- Fix stale docstring in LemonadeContextOverflowError: "n_ctx < 32K" → 65536
  to match the actual guard in _classify_lemonade_response
…ernance

Cherry-picked from test/eval-toolchain-1151 to unblock CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents devops DevOps/infrastructure changes eval Evaluation framework changes llm LLM backend changes mcp MCP integration changes p1 medium priority performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tests: cover llama.cpp backend integration

2 participants