feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233)#1301
Draft
itomek wants to merge 3 commits into
Draft
feat(eval): email-triage throughput benchmark + reusable eval stats/metrics (#1233)#1301itomek wants to merge 3 commits into
itomek wants to merge 3 commits into
Conversation
#1233) GAIA had no way to measure on-device email-triage throughput, and the eval framework lacked reusable statistics + perf-extraction primitives. This adds `gaia eval benchmark`, which direct-drives the email agent over the committed synthetic corpus and harvests TTFT / tokens-per-sec / pipeline latency from the agent's per-step stats, rendering them through the existing scorecard. On Gemma-4-E4B it measures ~62-69 tok/s, well above the committed >10 tok/s bar (non-gating for the demo). Built on four dependency-light, reusable eval modules: statistics.py (variance/CV/percentiles, Mann-Whitney U, Cliff's delta, bootstrap CI; stdlib-only), performance.py (domain-free per-step perf extraction matching the scorecard performance_summary contract), quality_metrics.py (category accuracy + spam/phishing/needs-attention confusion + token cost via MODEL_PRICING), and benchmark.py (orchestrator). Tool-result envelopes that fail to parse now raise loudly instead of being silently swallowed. Reuses the existing scorecard aggregation, MODEL_PRICING, and the tests/fixtures/email corpus rather than duplicating them.
…mark dispatch (#1233) Pylint W0404 (Path was reimported as _Path while already module-level) and W1514 (Path.read_text without an explicit encoding) failed the Code Quality check. Use the module-level Path, a plain local tempfile import, and encoding=utf-8 on the --compare read.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this matters
Before, GAIA had no way to measure on-device email-triage throughput, and the eval framework carried no reusable statistics or performance-extraction primitives — so "is the agent fast enough on Ryzen AI?" couldn't be answered with a number. Now
gaia eval benchmarkdirect-drives the email agent over the committed synthetic corpus and reports TTFT, tokens/sec, and pipeline latency through the existing scorecard. OnGemma-4-E4Bit measures ~62–69 tok/s — well above the committed ≥10 tok/s bar (non-gating for the demo; ~30 tok/s stretch).Closes #1233.
The benchmark rides on four dependency-light eval modules that are reusable beyond email (they're the foundation for #1277 perf-metrics and #1278 quality-metrics next):
statistics.py— variance/CV/percentiles, Mann-Whitney U, Cliff's delta, bootstrap CI (stdlib-only); the rigor layer for cross-run/cross-model comparison.performance.py— domain-free per-step TTFT/throughput/token extraction from the agent conversation; emits the scorecardperformance_summaryshape so aggregation/rendering is reused, not reinvented.quality_metrics.py— category accuracy + spam/phishing/needs-attention confusion (FP/FN) + token cost viaMODEL_PRICING(local models are correctly free).benchmark.py— the orchestrator; malformed tool-result envelopes now raise loudly instead of being silently swallowed (the upstream pattern this was lifted from usedexcept: pass), while genuine non-envelope tool output is still skipped.Test plan
python -m pytest tests/unit/eval/ -k "statistics or performance_extractor or quality_metrics or benchmark" -v— 52 unit tests, no Lemonade required (includes fail-loudpytest.raisescoverage).Gemma-4-E4B-it-GGUF:python -m pytest tests/integration/test_email_bench_throughput.py -v -s— end-to-end; asserts perf is harvested (>0 tok/s) and xfails (visible, non-gating) on a sub-bar number.gaia eval benchmark --model Gemma-4-E4B-it-GGUF --limit 50— prints a throughput number ≥10 tok/s plus a scorecard Performance section.python util/lint.py --black --flake8clean on the changed files.