feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233) by itomek · Pull Request #1301 · amd/gaia

itomek · 2026-05-29T18:55:22Z

Why this matters

Before, GAIA had no way to measure on-device email-triage throughput, and the eval framework carried no reusable statistics or performance-extraction primitives — so "is the agent fast enough on Ryzen AI?" couldn't be answered with a number. Now gaia eval benchmark direct-drives the email agent over the committed synthetic corpus and reports TTFT, tokens/sec, and pipeline latency through the existing scorecard. On Gemma-4-E4B it measures ~62–69 tok/s — well above the committed ≥10 tok/s bar (non-gating for the demo; ~30 tok/s stretch).

Closes #1233.

The benchmark rides on four dependency-light eval modules that are reusable beyond email (they're the foundation for #1277 perf-metrics and #1278 quality-metrics next):

statistics.py — variance/CV/percentiles, Mann-Whitney U, Cliff's delta, bootstrap CI (stdlib-only); the rigor layer for cross-run/cross-model comparison.
performance.py — domain-free per-step TTFT/throughput/token extraction from the agent conversation; emits the scorecard performance_summary shape so aggregation/rendering is reused, not reinvented.
quality_metrics.py — category accuracy + spam/phishing/needs-attention confusion (FP/FN) + token cost via MODEL_PRICING (local models are correctly free).
benchmark.py — the orchestrator; malformed tool-result envelopes now raise loudly instead of being silently swallowed (the upstream pattern this was lifted from used except: pass), while genuine non-envelope tool output is still skipped.

Test plan

python -m pytest tests/unit/eval/ -k "statistics or performance_extractor or quality_metrics or benchmark" -v — 52 unit tests, no Lemonade required (includes fail-loud pytest.raises coverage).
With Lemonade serving Gemma-4-E4B-it-GGUF: python -m pytest tests/integration/test_email_bench_throughput.py -v -s — end-to-end; asserts perf is harvested (>0 tok/s) and xfails (visible, non-gating) on a sub-bar number.
gaia eval benchmark --model Gemma-4-E4B-it-GGUF --limit 50 — prints a throughput number ≥10 tok/s plus a scorecard Performance section.
python util/lint.py --black --flake8 clean on the changed files.

#1233) GAIA had no way to measure on-device email-triage throughput, and the eval framework lacked reusable statistics + perf-extraction primitives. This adds `gaia eval benchmark`, which direct-drives the email agent over the committed synthetic corpus and harvests TTFT / tokens-per-sec / pipeline latency from the agent's per-step stats, rendering them through the existing scorecard. On Gemma-4-E4B it measures ~62-69 tok/s, well above the committed >10 tok/s bar (non-gating for the demo). Built on four dependency-light, reusable eval modules: statistics.py (variance/CV/percentiles, Mann-Whitney U, Cliff's delta, bootstrap CI; stdlib-only), performance.py (domain-free per-step perf extraction matching the scorecard performance_summary contract), quality_metrics.py (category accuracy + spam/phishing/needs-attention confusion + token cost via MODEL_PRICING), and benchmark.py (orchestrator). Tool-result envelopes that fail to parse now raise loudly instead of being silently swallowed. Reuses the existing scorecard aggregation, MODEL_PRICING, and the tests/fixtures/email corpus rather than duplicating them.

…mark dispatch (#1233) Pylint W0404 (Path was reimported as _Path while already module-level) and W1514 (Path.read_text without an explicit encoding) failed the Code Quality check. Use the module-level Path, a plain local tempfile import, and encoding=utf-8 on the --compare read.

…mark-1233

itomek requested a review from kovtcharov-amd as a code owner May 29, 2026 18:55

github-actions Bot added documentation Documentation changes cli CLI changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels May 29, 2026

itomek modified the milestones: v0.19 — Test & CI Hardening [OSS], v0.20 Email Agent & Platform Foundations May 29, 2026

itomek marked this pull request as draft May 29, 2026 20:04

itomek added 2 commits May 30, 2026 12:38

Merge remote-tracking branch 'origin/main' into feat/eval-email-bench…

31881ee

…mark-1233

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233)#1301

feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233)#1301
itomek wants to merge 3 commits into
mainfrom
feat/eval-email-benchmark-1233

itomek commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itomek commented May 29, 2026

Why this matters

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant