feat(email): LLM-assisted triage classification for low-confidence messages (#1107) by itomek · Pull Request #1307 · amd/gaia

itomek · 2026-05-30T16:25:21Z

Why this matters

The email agent's heuristic fast path commits a category only when it's confident, and it deliberately never decides urgent-vs-actionable (that needs the body). Until now those low-confidence messages were just flagged — the LLM follow-up was never wired — so triage accuracy was capped at the heuristic's ceiling. This wires LLM classification into triage_inbox_impl: when the heuristic isn't confident (or force_llm), the HTML-stripped body is sent to the local LLM for a structured {category, confidence, reasoning} decision. On the synthetic corpus this lifts category accuracy from 0.56 → 0.67.

Closes #1107.

Fail-loud (the AC's crux): on LLM failure — unreachable, unparseable output, or an out-of-taxonomy category — the classifier raises LLMTriageError naming the message and the triage loop propagates it. It never silently defaults to informational (a quiet wrong answer is worse than a loud, fixable failure). The body is fenced in the agent's existing <<<UNTRUSTED_EMAIL_BODY_*>>> delimiters so a crafted body can't steer the classifier. With classifier=None the heuristic-only path is byte-for-byte unchanged (pre-scan and other callers are unaffected).

How this was verified (test + baseline — for frame of reference)

The integration test test_triage_meets_baseline_minus_tolerance (tests/integration/test_email_agent_triage.py, require_lemonade-gated) triages the committed synthetic stub inbox — 9 scorable messages in tests/fixtures/email/ — through heuristic + LLM follow-up, then compares per-message categories to ground_truth.json.

Baseline: baseline_accuracy.json records category_accuracy = 0.70 with tolerance_pp = 5, so the floor is 0.65. That file self-describes as a placeholder ("stub baseline (no real measurement yet)"); a real measured baseline arrives with the labelled corpus in feat(eval): generate + commit labelled email-triage corpus + gemma4 baseline #1230.
Model: the classifier runs on Qwen3.5-35B-A3B-GGUF — the same model the baseline was recorded against — so the gate is apples-to-apples. temperature=0.0 makes the run deterministic.
Result: heuristic-only scored 0.56 (5/9); with LLM follow-up it scores 0.67 (6/9), clearing the 0.65 floor. The test's previous conditional pytest.xfail is now a hard assert.
Known caveat (margin): on a 9-message corpus, 0.67 is ~1 message above the floor. Of the 3 remaining misses, 2 are heuristic confident-errors (a phishing mail and a calendar invite the heuristic confidently labels informational) that never reach the LLM — a heuristic-precision follow-up, out of feat(email): LLM-assisted triage classification when heuristic confidence is low #1107's scope — and 1 is a borderline urgent-vs-actionable LLM call. The thin margin resolves when feat(eval): generate + commit labelled email-triage corpus + gemma4 baseline #1230 replaces the stub with a real corpus + measured baseline.

Test plan

python -m pytest tests/unit/agents/test_email_llm_triage.py -v — 13 offline unit tests, no Lemonade. Covers all fail-loud branches (transport / no-JSON / malformed / out-of-taxonomy), the classifier=None heuristic-only path, force_llm routes-all, failure-propagates, and that the body is delimiter-fenced.
With Lemonade serving Qwen3.5-35B-A3B-GGUF: python -m pytest tests/integration/test_email_agent_triage.py -v -s — hard-gates category accuracy ≥ 0.65 and spam == perfect.
python util/lint.py --black --flake8 clean on the changed files.

Follow-ups (separate issues, not this PR)

feat(eval): email triage scenario category #1108 — Claude-judge scenario eval for the email agent (needs a fake-backend injection seam in the eval runner + email scenario YAMLs). Now unblocked on the key side.
Heuristic confident-errors on phishing/calendar-invite categorization (the 2 capped misses above).

The heuristic fast path commits a category only when confident and never classifies urgent vs actionable (those need the body), so previously those messages were just flagged and the LLM follow-up was never wired — triage accuracy was capped at the heuristic's ceiling. This wires LLM classification into triage_inbox_impl: when the heuristic is not confident (or force_llm), the HTML-stripped body is sent to the local LLM for a structured {category, confidence, reasoning} decision, recorded with confident=True and source=llm. Fail-loud (the #1107 AC): on LLM failure — unreachable, unparseable output, or an out-of-taxonomy category — the classifier raises LLMTriageError naming the message and the triage loop propagates it; it never silently defaults to informational. The body is fenced in the agent's untrusted-input delimiters so a crafted body cannot steer the classifier. With classifier=None the heuristic-only path is byte-for-byte unchanged. On the synthetic stub corpus (Qwen3.5-35B-A3B) LLM follow-up lifts category accuracy from 0.56 (heuristic-only) to 0.67, clearing the baseline-minus-tolerance floor (0.65); the integration test's conditional xfail is now a hard gate.

github-actions · 2026-05-30T16:28:45Z

The classifier=None backward-compat default, the fail-loud LLMTriageError contract, and the prompt-injection fence are all done correctly. Two minor issues worth addressing before or after merge — neither blocks.

Summary

This PR delivers the last missing piece from #1107: LLM classification is now wired into triage_inbox_impl for heuristic-uncertain messages. The architecture is clean — a free-standing llm_triage.py module with a single entry point (classify_email_llm) and a factory (make_llm_classifier) that the tool calls at request time, after the agent's chat is initialized. The fail-loud contract is rigorously respected throughout: LLMTriageError names the offending message_id, the transport-exception catch re-raises with context, and classifier=None leaves the heuristic-only path byte-for-byte unchanged. Test coverage is good — 13 offline unit tests covering every branch, plus an upgraded integration test that now hard-gates accuracy instead of soft-xfailing.

The one design smell worth noting is the deferred local import in _build_user_prompt to break the read_tools ↔ llm_triage cycle. It works, but signals a coupling that will keep compounding.

Issues Found

🟢 Minor — Greedy regex may swallow trailing `}` after the JSON object (`llm_triage.py:100`)

match = re.search(r"\{.*\}", text or "", re.DOTALL)

re.DOTALL + greedy .* matches from the first { to the last } in the whole string. At temperature=0.0 this is rarely a problem, but if the model appends any text containing } after the JSON object (e.g., {"category": "urgent"} Done. is fine, but {"category": "urgent"} and {done} is not), the match grows beyond the valid JSON boundary and json.loads raises, producing a "malformed JSON" error that obscures the real response.

    match = re.search(r"\{[^{}]*(?:\{[^{}]*\}[^{}]*)?\}", text or "", re.DOTALL)

That handles one level of nesting (enough for the flat {category, confidence, reasoning} schema). Alternatively, a simpler and equally defensible fix: add a unit test that triggers the edge case so it's on record if the LLM ever produces trailing prose with braces.

🟢 Minor — Circular import via deferred local import (`llm_triage.py:87`)

# Local import breaks a circular dependency (read_tools imports this module)
from gaia.agents.email.tools.read_tools import wrap_untrusted_body

read_tools.py imports from llm_triage.py at module level; llm_triage.py imports from read_tools.py inside a function to break the cycle. The deferred import works, but it encodes the coupling rather than removing it — every future reader has to understand why it's there, and the cycle grows more expensive to untangle as the module grows.

The cleanest fix is one extracted constant file, e.g. src/gaia/agents/email/tools/body_fence.py:

UNTRUSTED_BODY_OPEN  = "<<<UNTRUSTED_EMAIL_BODY_START>>>"
UNTRUSTED_BODY_CLOSE = "<<<UNTRUSTED_EMAIL_BODY_END>>>"

def wrap_untrusted_body(body: str) -> str:
    return f"{UNTRUSTED_BODY_OPEN}\n{body}\n{UNTRUSTED_BODY_CLOSE}"

Both read_tools.py and llm_triage.py import from body_fence.py — no cycle, no deferred import, no comment needed. Out of scope for this PR, but worth a follow-up issue before the email module grows further.

Strengths

Fail-loud contract honoured end-to-end. Three distinct failure modes — transport exception, no-JSON response, out-of-taxonomy category — each raise LLMTriageError with the offending message_id and enough context to act on. The broad except Exception re-raise at llm_triage.py:157 is exactly the right pattern for wrapping an external LLM call.
Prompt injection fence preserved. The body is wrapped in <<<UNTRUSTED_EMAIL_BODY_START/END>>> before it reaches the model, and test_body_is_wrapped_in_untrusted_delimiters verifies position, not just presence. That's a meaningful assertion.
classifier=None keeps pre-existing callers safe. The heuristic-only path is unchanged when no classifier is wired; force_llm=True combined with a classifier does what the docs say. The unit tests test_classifier_none_is_heuristic_only and test_force_llm_routes_every_message lock both edges.

Verdict

Approve with suggestions. Both issues are minor — the greedy-regex edge case is unlikely at temperature=0.0 and the circular-import note is a follow-up, not a blocker. The fail-loud design, backwards-compat extension, and solid test coverage make this safe to merge.

github-actions · 2026-06-01T18:36:42Z

The implementation is solid: fail-loud contract is correctly enforced end-to-end, the classifier=None heuristic-only path is byte-for-byte unchanged, and the body is properly fenced in <<<UNTRUSTED_EMAIL_BODY_*>>> delimiters before reaching the model. One logging inaccuracy slipped through that will produce misleading structured logs.

Issues

🟡 Important — `log_triage_decision` records `"heuristic"` for LLM-classified messages (`read_tools.py:345`)

After the LLM block (lines 320–337) runs, decision["confident"] is set to True and decision["source"] to "llm". But the existing log call at line 345 still gates on decision["confident"], so it always emits confidence="heuristic" for LLM-classified messages. Operators using these structured logs to audit why a message landed in a category will see the wrong source.

                confidence=decision["source"] if decision["confident"] else "needs_llm",

This correctly emits "heuristic" for confident-heuristic decisions, "llm" for LLM-classified ones, and "needs_llm" for the classifier-less path where confident=False.

🟢 Minor — circular-import workaround belongs in a shared module (`llm_triage.py:84–87`)

The deferred from gaia.agents.email.tools.read_tools import wrap_untrusted_body inside _build_user_prompt is the only thing keeping llm_triage ↔ read_tools from a circular import at module load time. The comment calls this out, which is the right instinct. A cleaner follow-up would be to lift wrap_untrusted_body, UNTRUSTED_BODY_OPEN, and UNTRUSTED_BODY_CLOSE into a small email_utils.py (or prompt_utils.py) that neither module owns — then both can import it at module level with no cycle and no deferred-import footgun. Not blocking, but worth a follow-up issue since test_body_is_wrapped_in_untrusted_delimiters already imports the constants directly from read_tools, which would need updating anyway.

🟢 Nit — `test_classifier_none_is_heuristic_only` only checks the negative (`test_email_llm_triage.py:600–606`)

The test correctly asserts all(r.get("source") != "llm" ...). Adding the positive assertion keeps it airtight if a future path introduces a third source value:

        assert all(r.get("source") == "heuristic" for r in results)

Strengths

Fail-loud contract is consistently enforced. LLMTriageError propagates on transport failure, no-JSON, malformed JSON, and out-of-taxonomy category — confirmed both in the implementation and in four dedicated unit tests. The "never silently default to informational" rule from the AC is not just a comment; it's structurally guaranteed.
Test coverage is comprehensive. 13 offline unit tests cover all failure branches without requiring Lemonade; the integration test properly gates on require_lemonade and upgrades the previous pytest.xfail soft gate to a hard assert — the right move now that the LLM path is wired. The prompt-injection boundary test (test_body_is_wrapped_in_untrusted_delimiters) is a non-obvious but important correctness check that I'm glad is there.
Backward compatibility is clean. The classifier=None default preserves the heuristic-only path exactly, and the source="heuristic" field addition to existing decisions is additive — callers that ignore unknown keys are unaffected.

Verdict

Approve with suggestions. The one substantive fix is the log_triage_decision line — a one-liner that swaps "heuristic" if decision["confident"] for decision["source"] if decision["confident"]. The nits are optional. Everything else — architecture, error handling, test strategy, PR description — is well done.

itomek requested a review from kovtcharov-amd as a code owner May 30, 2026 16:25

github-actions Bot added tests Test changes agents labels May 30, 2026

itomek self-assigned this May 30, 2026

itomek marked this pull request as draft May 30, 2026 16:32

itomek added this to the v0.20 Email Agent & Platform Foundations milestone May 30, 2026

itomek mentioned this pull request Jun 1, 2026

docs(email): Milestone execution-order spec (sequencing + rationale) #1319

Open

itomek changed the base branch from main to v0.20-email-triage-agent June 1, 2026 18:32

itomek marked this pull request as ready for review June 1, 2026 18:33

itomek merged commit 1ea8ad3 into v0.20-email-triage-agent Jun 1, 2026
36 checks passed

itomek deleted the feat/email-llm-assist-triage-1107 branch June 1, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(email): LLM-assisted triage classification for low-confidence messages (#1107)#1307

feat(email): LLM-assisted triage classification for low-confidence messages (#1107)#1307
itomek merged 1 commit into
v0.20-email-triage-agentfrom
feat/email-llm-assist-triage-1107

itomek commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itomek commented May 30, 2026

Why this matters

How this was verified (test + baseline — for frame of reference)

Test plan

Follow-ups (separate issues, not this PR)

Uh oh!

github-actions Bot commented May 30, 2026

Summary

Issues Found

🟢 Minor — Greedy regex may swallow trailing } after the JSON object (llm_triage.py:100)

🟢 Minor — Circular import via deferred local import (llm_triage.py:87)

Strengths

Verdict

Uh oh!

github-actions Bot commented Jun 1, 2026

Issues

🟡 Important — log_triage_decision records "heuristic" for LLM-classified messages (read_tools.py:345)

🟢 Minor — circular-import workaround belongs in a shared module (llm_triage.py:84–87)

🟢 Nit — test_classifier_none_is_heuristic_only only checks the negative (test_email_llm_triage.py:600–606)

Strengths

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🟢 Minor — Greedy regex may swallow trailing `}` after the JSON object (`llm_triage.py:100`)

🟢 Minor — Circular import via deferred local import (`llm_triage.py:87`)

🟡 Important — `log_triage_decision` records `"heuristic"` for LLM-classified messages (`read_tools.py:345`)

🟢 Minor — circular-import workaround belongs in a shared module (`llm_triage.py:84–87`)

🟢 Nit — `test_classifier_none_is_heuristic_only` only checks the negative (`test_email_llm_triage.py:600–606`)