fix(cron): classify agent job errors into actionable user messages (#2279)#2340
Conversation
…inyhumansai#2279) Cron job failures previously surfaced a generic `AGENT_JOB_USER_FAILURE_MESSAGE` in user notifications, discarding the structured `last_agent_error`. Add a classifier that maps known `AgentError` variants to canned, user-actionable strings (budget, rate-limit, timeout, misconfigured-model, …) while falling back to the existing generic message for unknown variants. Classifier output is canned-string-only — never interpolates error contents — guarded by a leak-canary test that proves no error-derived substring reaches the user-facing message. Bug B of tinyhumansai#2279 (generic error). Bug A (markup leak) is a separate PR. Refs tinyhumansai#2279. Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe PR introduces error-to-user-message classification for agent cron-job failures. The scheduler now maps typed ChangesAgent Error Classification for Cron Notifications
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
Nice work — the &'static str return type is an elegant structural guarantee against data leakage, and the leak-canary fuzz tests are thorough. No critical or major issues; a couple of minor UX observations below.
| File | +/− | Summary |
|---|---|---|
scheduler.rs |
+58 −5 | Two classifier fns (agent_error_to_user_message, classify_agent_anyhow_for_user), wired at the error site |
scheduler_tests.rs |
+290 | 14 tests: per-variant mapping, residual fallback, anyhow wrapper, length cap, leak-canary fuzz |
| "The model provider is temporarily unavailable. The next run will retry automatically." | ||
| } | ||
| AgentError::ProviderError { retryable: false, .. } => { | ||
| "The model provider rejected the request. Check your provider credentials in Settings \u{2192} AI \u{2192} LLM." |
There was a problem hiding this comment.
[minor] "Start a new session" is solid advice in an interactive context, but cron jobs start a fresh session on every run automatically — a user reading this notification can't really act on it. Consider rewording to something like:
"The conversation grew too long for the model's context window. The next run will start fresh. If this recurs, pick a model with a larger context window in Settings → AI → LLM."
This also gives the user a concrete Settings path, matching the other messages.
| "The model provider rejected the request. Check your provider credentials in Settings \u{2192} AI \u{2192} LLM." | ||
| } | ||
| AgentError::ContextLimitExceeded { .. } => { | ||
| "The conversation grew too long for the model. Start a new session or pick a model with a larger context window." |
There was a problem hiding this comment.
[minor] "daily cost budget" assumes the budget window is always daily. The variant fields are spent_microdollars / budget_microdollars with no period attached. If the budget window is ever configurable (or already isn't daily), this message will mislead. A safer phrasing: "You've reached the cost budget for this agent" (drop "daily").
…inyhumansai#2279) (tinyhumansai#2340) Co-authored-by: sanil-23 <sanil@alphahuman.xyz> Co-authored-by: Claude <noreply@anthropic.com>
Summary
AGENT_JOB_USER_FAILURE_MESSAGEboilerplate on cron job failures with actionable, canned, user-facing messages keyed offAgentErrorvariants.ToolExecutionError,Other), so behaviour is bug-compatible withmainin the residual.Problem
Bug B of #2279. When the morning-briefing (or any agent) cron job fails,
cron/scheduler.rs:333returned only the generic"Something went wrong. Please try again. This error has been reported…"boilerplate, discarding the structuredlast_agent_error. Users saw repeated identical generic-failure cards in the notifications panel with no indication of cause (budget, rate-limit, misconfiguration, permission).(Bug A — markup leak in the renderer — is a separate PR on
fix/2279-notification-link-markup.)Solution
Two private fns in
cron/scheduler.rs:fn agent_error_to_user_message(err: &AgentError) -> &'static str— matches the 8AgentErrorvariants:ProviderError {retryable: true}ProviderError {retryable: false}ContextLimitExceededCostBudgetExceededMaxIterationsExceededCompactionFailedPermissionDeniedToolExecutionError,OtherAGENT_JOB_USER_FAILURE_MESSAGEAll canned strings ≤120 chars; Settings sub-paths verified against
useSettingsNavigation.ts:163, 187, 195breadcrumbs.fn classify_agent_anyhow_for_user(err: &anyhow::Error) -> &'static str— downcasts toAgentError(mirroringruntime.rs::sanitize_event_error_message) and routes through (1); falls back to the generic message when the downcast fails.Wired at
scheduler.rs:397— the liveanyhow::Errorreturned byagent.run_singleis classified before any.to_string(). The raw error continues intolast_agent_errorfor the observability pipeline (report_error), where stack traces and provider URLs are appropriate.AGENT_JOB_USER_FAILURE_MESSAGEis preserved as the residual fallback — not removed, not renamed.Submission Checklist
scheduler_tests.rs: 7 per-variant mapping, 2 residual fallback, 2anyhowwrapper, 1 length cap (≤120 chars), 2 leak-canary fuzz.cargo test --lib cron::scheduler→ 47 passed, 0 failed. Every classifier arm is exercised.N/A: behaviour-only fix to existing cron error surface.N/A(no matrix row).N/A: not a release-cut surface; covered by Rust unit tests.Closes #NNN— see Related (this PR closes Bug B only; Morning briefing notifications show generic failure errors #2279 stays open until Bug A's PR also merges).Airtight guarantee: classifier never leaks error content
The classifier output is a
&'static str. The leak-canary test (classifier_does_not_leak_error_content+classify_agent_anyhow_does_not_leak_when_downcast_succeeds) injects 10 distinctLEAK_CANARY_<n>_<hex>markers across everyString/wrapped-source field of every variant (includingOther(anyhow::anyhow!(canary).context(canary))), runs each through the classifier, and asserts no marker appears in the output and that the residual constant itself is canary-free. This guards against future refactors accidentally introducingformat!("{}", err)orerr.to_string()interpolation.Impact
notifications/store.rs::insert_if_not_recent) is intentionally not touched. Today two failures with the same body collapse within 60s; after this change, two failures with different classified reasons no longer collapse. Acceptable — arguably informative — and easily revisited if noisy in practice.AGENT_JOB_USER_FAILURE_MESSAGEare unaffected.Related
fix(notifications): render <openhuman-link> tags in notification bodies (#2279)(separate branch).AI Authored PR Metadata
Linear Issue
Commit & Branch
fix/2279-cron-error-classifier69b4c86eValidation Run
pnpm --filter openhuman-app format:check—N/A: no frontend changes.pnpm typecheck—N/A: no frontend changes.cargo test --manifest-path Cargo.toml --lib cron::scheduler→ 47 passed, 0 failed.cargo fmt --checkclean;cargo check --libclean (only pre-existing unrelated warnings).N/A: shell not touched.Validation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
AgentErrorvariant now surface a specific user-facing message instead of the generic "Something went wrong" boilerplate.Parity Contract
AGENT_JOB_USER_FAILURE_MESSAGEbyte-identical to today. The constant itself is unchanged. Dedup behaviour is untouched.🤖 Generated with Claude Code