fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting#1302
fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting#1302itomek wants to merge 5 commits into
Conversation
…orrupt download `_is_corrupt_download_error` matched the generic string "llama-server failed to start" as proof of a corrupt/incomplete model download. Lemonade raises that string for many non-corruption failures (resource limits, ctx_size, GPU/backend startup, port conflicts), so an ordinary load failure was routed into a destructive delete + re-download of the model (default ~25GB), dead-ending first-boot. Keep the five specific corruption phrases as unconditional signals; "llama-server failed to start" now only counts as corruption when one of those phrases also corroborates it. A bare load failure falls through to load_model's non-corrupt branch, which raises an actionable LemonadeClientError without entering the repair path. Closes #1294
… validator Commit 905036c introduced a timestamped backup naming convention via the security path validator, but two assertions in test_code_agent.py still expected the old hardcoded .bak suffix. Use result["backup_path"] instead.
On a fresh Agent UI install, first boot loaded the default model from the FastAPI lifespan threadpool (no TTY) with prompt=False. The corrupt-download repair path ignored that flag and fired interactive [y/N] prompts: input() raised EOFError (or hung on an idle pipe), dead-ending boot-init. Two fixes in load_model's corrupt-download branch: - _prompt_user_for_delete now has the same stdin/stdout isatty guard as its siblings _prompt_user_for_download / _prompt_user_for_repair, returning the proceed-default under a non-interactive environment instead of calling input(). - The repair/delete branch honors the prompt argument: with prompt=False it skips both prompts and auto-proceeds through the bounded recovery (resume, then a single delete + re-download). Recovery progress is logged at INFO (percent from the pull stream) so the backend log the UI tails shows movement and boot does not look frozen; the corrupt-detected detail logs at DEBUG. If recovery still fails after the one delete + re-download, it raises a single actionable LemonadeClientError naming the recovery affordance (UI Force-redownload / manual delete) and the Lemonade server.log -- no EOFError, no hang, no silent swallow. A real TTY with prompt=True still prompts as before.
|
@claude review this PR. |
|
Solid fix — the diagnosis is right and the implementation is clean. The two root causes (the missing One thing worth a look (🟢 minor, and pre-existing — not introduced here): Asymmetric handling when a pull stream ends without That contradicts AC#4 ("bounded to ONE delete+redownload") only in the narrow case of a silently-truncated resume stream, which is unusual in practice. The behavior matches the pre-PR code, so it's not a regression. If you want the recovery to be robust to that case, mirroring the fresh-download treatment (treat resume-incomplete as a failure that escalates to delete+redownload) would close it. Optional for this PR. Couldn't run the suite here (no venv in this environment), but the new tests are fully mocked and the assertions look correct; your Strix Halo run covers the real execution. Nothing blocking from my side. |
Closes #1293
Stacked on #1300 — review the classifier fix there first; this PR's diff is only the auto-heal behaviour on top of it.
Why this matters
Before: when the Agent UI backend detected a corrupt/incomplete model on first boot, it called
input("[y/N]")inside the FastAPI lifespan threadpool — which has no TTY.input()raisedEOFError, the boot-init job failed, and users were left with a broken UI requiring a manual force-redownload. After: the corrupt-download repair path is fully non-interactive. Whenprompt=False(boot context), all three prompt helpers auto-proceed without ever callinginput(), recovery attempts a single delete+redownload with INFO-level progress, and only surfaces a loud actionable error if recovery genuinely fails.What changed
_prompt_user_for_delete— added the samesys.stdin.isatty() / sys.stdout.isatty()guard its two siblings already had. This closes the EOFError gap.load_modelcorrupt-download branch — thepromptargument now gates both_prompt_user_for_repairand_prompt_user_for_deletecalls (if prompt and <isatty>), soprompt=Falsecallers (the UI server andLemonadeManager._try_preload_with_ctx) never reachinput().LemonadeClientErrornaming what failed, what to do (UI Force-redownload / manual delete), and where to look (server.log) — no EOFError, no hang.prompt=True+ real terminal) is preserved unchanged.Test plan
tests/unit/test_lemonade_model_loading.py(17 total, 10 pre-existing): each AC covered by a named test —_prompt_user_for_deletenon-TTY returns without EOFError;prompt=Falsenever reaches either prompt helper; recovery bounded to 1 delete+redownload; progress logged at INFO; unrecoverable raises actionable error; interactive TTY still reaches the prompt.test_lemonade_model_loading.py+test_lemonade_error_classification.py+test_lemonade_manager_preload.py(includes all of fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download #1300's tests)._prompt_user_for_*helpers return cleanly under non-TTY stdin (no EOFError);prompt=Falseon a corrupt-classified load confirms neither repair nor delete prompt is called, boot log shows📥 Resuming download…then a clean actionable error — no hang.prompt=Falsepath both verified on hardware (pytest not installed in that box's venv so unit suite run on Strix Halo only).