Skip to content

feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233)#1301

Draft
itomek wants to merge 3 commits into
mainfrom
feat/eval-email-benchmark-1233
Draft

feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233)#1301
itomek wants to merge 3 commits into
mainfrom
feat/eval-email-benchmark-1233

Conversation

@itomek
Copy link
Copy Markdown
Collaborator

@itomek itomek commented May 29, 2026

Why this matters

Before, GAIA had no way to measure on-device email-triage throughput, and the eval framework carried no reusable statistics or performance-extraction primitives — so "is the agent fast enough on Ryzen AI?" couldn't be answered with a number. Now gaia eval benchmark direct-drives the email agent over the committed synthetic corpus and reports TTFT, tokens/sec, and pipeline latency through the existing scorecard. On Gemma-4-E4B it measures ~62–69 tok/s — well above the committed ≥10 tok/s bar (non-gating for the demo; ~30 tok/s stretch).

Closes #1233.

The benchmark rides on four dependency-light eval modules that are reusable beyond email (they're the foundation for #1277 perf-metrics and #1278 quality-metrics next):

  • statistics.py — variance/CV/percentiles, Mann-Whitney U, Cliff's delta, bootstrap CI (stdlib-only); the rigor layer for cross-run/cross-model comparison.
  • performance.py — domain-free per-step TTFT/throughput/token extraction from the agent conversation; emits the scorecard performance_summary shape so aggregation/rendering is reused, not reinvented.
  • quality_metrics.py — category accuracy + spam/phishing/needs-attention confusion (FP/FN) + token cost via MODEL_PRICING (local models are correctly free).
  • benchmark.py — the orchestrator; malformed tool-result envelopes now raise loudly instead of being silently swallowed (the upstream pattern this was lifted from used except: pass), while genuine non-envelope tool output is still skipped.

Test plan

  • python -m pytest tests/unit/eval/ -k "statistics or performance_extractor or quality_metrics or benchmark" -v — 52 unit tests, no Lemonade required (includes fail-loud pytest.raises coverage).
  • With Lemonade serving Gemma-4-E4B-it-GGUF: python -m pytest tests/integration/test_email_bench_throughput.py -v -s — end-to-end; asserts perf is harvested (>0 tok/s) and xfails (visible, non-gating) on a sub-bar number.
  • gaia eval benchmark --model Gemma-4-E4B-it-GGUF --limit 50 — prints a throughput number ≥10 tok/s plus a scorecard Performance section.
  • python util/lint.py --black --flake8 clean on the changed files.

#1233)

GAIA had no way to measure on-device email-triage throughput, and the eval framework lacked reusable statistics + perf-extraction primitives. This adds `gaia eval benchmark`, which direct-drives the email agent over the committed synthetic corpus and harvests TTFT / tokens-per-sec / pipeline latency from the agent's per-step stats, rendering them through the existing scorecard. On Gemma-4-E4B it measures ~62-69 tok/s, well above the committed >10 tok/s bar (non-gating for the demo).

Built on four dependency-light, reusable eval modules: statistics.py (variance/CV/percentiles, Mann-Whitney U, Cliff's delta, bootstrap CI; stdlib-only), performance.py (domain-free per-step perf extraction matching the scorecard performance_summary contract), quality_metrics.py (category accuracy + spam/phishing/needs-attention confusion + token cost via MODEL_PRICING), and benchmark.py (orchestrator). Tool-result envelopes that fail to parse now raise loudly instead of being silently swallowed. Reuses the existing scorecard aggregation, MODEL_PRICING, and the tests/fixtures/email corpus rather than duplicating them.
@itomek itomek requested a review from kovtcharov-amd as a code owner May 29, 2026 18:55
@github-actions github-actions Bot added documentation Documentation changes cli CLI changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels May 29, 2026
@itomek itomek marked this pull request as draft May 29, 2026 20:04
itomek added 2 commits May 30, 2026 12:38
…mark dispatch (#1233)

Pylint W0404 (Path was reimported as _Path while already module-level) and W1514 (Path.read_text without an explicit encoding) failed the Code Quality check. Use the module-level Path, a plain local tempfile import, and encoding=utf-8 on the --compare read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI changes documentation Documentation changes eval Evaluation framework changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(email): gemma4-it-e2b throughput benchmark (stretch: 30 tok/s)

1 participant