feat(email): LLM-assisted triage classification for low-confidence messages (#1107)#1307
Conversation
The heuristic fast path commits a category only when confident and never classifies urgent vs actionable (those need the body), so previously those messages were just flagged and the LLM follow-up was never wired — triage accuracy was capped at the heuristic's ceiling. This wires LLM classification into triage_inbox_impl: when the heuristic is not confident (or force_llm), the HTML-stripped body is sent to the local LLM for a structured {category, confidence, reasoning} decision, recorded with confident=True and source=llm.
Fail-loud (the #1107 AC): on LLM failure — unreachable, unparseable output, or an out-of-taxonomy category — the classifier raises LLMTriageError naming the message and the triage loop propagates it; it never silently defaults to informational. The body is fenced in the agent's untrusted-input delimiters so a crafted body cannot steer the classifier. With classifier=None the heuristic-only path is byte-for-byte unchanged.
On the synthetic stub corpus (Qwen3.5-35B-A3B) LLM follow-up lifts category accuracy from 0.56 (heuristic-only) to 0.67, clearing the baseline-minus-tolerance floor (0.65); the integration test's conditional xfail is now a hard gate.
|
The SummaryThis PR delivers the last missing piece from #1107: LLM classification is now wired into The one design smell worth noting is the deferred local import in Issues Found🟢 Minor — Greedy regex may swallow trailing
|
|
The implementation is solid: fail-loud contract is correctly enforced end-to-end, the Issues🟡 Important —
|
Why this matters
The email agent's heuristic fast path commits a category only when it's confident, and it deliberately never decides urgent-vs-actionable (that needs the body). Until now those low-confidence messages were just flagged — the LLM follow-up was never wired — so triage accuracy was capped at the heuristic's ceiling. This wires LLM classification into
triage_inbox_impl: when the heuristic isn't confident (orforce_llm), the HTML-stripped body is sent to the local LLM for a structured{category, confidence, reasoning}decision. On the synthetic corpus this lifts category accuracy from 0.56 → 0.67.Closes #1107.
Fail-loud (the AC's crux): on LLM failure — unreachable, unparseable output, or an out-of-taxonomy category — the classifier raises
LLMTriageErrornaming the message and the triage loop propagates it. It never silently defaults toinformational(a quiet wrong answer is worse than a loud, fixable failure). The body is fenced in the agent's existing<<<UNTRUSTED_EMAIL_BODY_*>>>delimiters so a crafted body can't steer the classifier. Withclassifier=Nonethe heuristic-only path is byte-for-byte unchanged (pre-scan and other callers are unaffected).How this was verified (test + baseline — for frame of reference)
The integration test
test_triage_meets_baseline_minus_tolerance(tests/integration/test_email_agent_triage.py,require_lemonade-gated) triages the committed synthetic stub inbox — 9 scorable messages intests/fixtures/email/— through heuristic + LLM follow-up, then compares per-message categories toground_truth.json.baseline_accuracy.jsonrecordscategory_accuracy = 0.70withtolerance_pp = 5, so the floor is 0.65. That file self-describes as a placeholder ("stub baseline (no real measurement yet)"); a real measured baseline arrives with the labelled corpus in feat(eval): generate + commit labelled email-triage corpus + gemma4 baseline #1230.temperature=0.0makes the run deterministic.pytest.xfailis now a hardassert.informational) that never reach the LLM — a heuristic-precision follow-up, out of feat(email): LLM-assisted triage classification when heuristic confidence is low #1107's scope — and 1 is a borderline urgent-vs-actionable LLM call. The thin margin resolves when feat(eval): generate + commit labelled email-triage corpus + gemma4 baseline #1230 replaces the stub with a real corpus + measured baseline.Test plan
python -m pytest tests/unit/agents/test_email_llm_triage.py -v— 13 offline unit tests, no Lemonade. Covers all fail-loud branches (transport / no-JSON / malformed / out-of-taxonomy), theclassifier=Noneheuristic-only path,force_llmroutes-all, failure-propagates, and that the body is delimiter-fenced.Qwen3.5-35B-A3B-GGUF:python -m pytest tests/integration/test_email_agent_triage.py -v -s— hard-gates category accuracy ≥ 0.65 and spam == perfect.python util/lint.py --black --flake8clean on the changed files.Follow-ups (separate issues, not this PR)