fix(cron): halt agent-job retry on backend session-expired (TAURI-RUST-N)#3350
Conversation
…I-RUST-N) After the user's OpenHuman backend JWT lapses, the inference layer's `api_error` already publishes `DomainEvent::SessionExpired` for the 401 and intentionally skips its own `report_error` call. But the cron retry loop in `execute_job_with_retry` doesn't consult that signal — it sleeps with exponential backoff, retries the same job N times (every attempt hitting the same global 401), then unconditionally calls `report_error` with `failure=retries_exhausted`. That generated TAURI-RUST-N: 7,038 events / 5 users, all `domain=cron operation=agent_job` from a `morning_briefing` agent grinding through retries after one lapse. Fix mirrors PR tinyhumansai#3334's halt-on-first-occurrence pattern, but at the cron retry layer instead of the agent tool loop: - New `is_session_expired_failure` predicate consults the existing `core::observability::is_session_expired_message` classifier (the `OpenHuman API error (401` + `"error":"Invalid token"` conjunction was already added for OPENHUMAN-TAURI-4P0). Matches on `last_agent_error` first (carries the raw provider wire chain), falls back to `last_output` for defense-in-depth. - `execute_job_with_retry` breaks out of the loop on the first occurrence — no further attempts, no `report_error` call. The inference layer's `SessionExpired` publish already drives the credentials/scheduler-gate handshake; retries can't recover until the user re-auths. - Restricted to `JobType::Agent`: shell jobs that happen to echo a 401-shaped string keep their existing retry semantics (no `SessionExpired` publish from shell stdout, no reason to flip the gate). Five focused unit tests pin the predicate behaviour: the 401 wire shape trips the halt via `last_agent_error` or `last_output`; the canned user message, ordinary provider errors (incl. third-party BYO-key 401s), and shell-job invocations all stay on the retry path. Sentry-Issue: TAURI-RUST-N
📝 WalkthroughWalkthroughThe scheduler now detects when an agent job fails due to session expiration (a 401 "Invalid token" backend response) and immediately halts the retry loop instead of continuing exponential backoff. Retry-exhaustion error reporting is suppressed for these expected authentication states. Comprehensive tests verify the classifier matches only the specific wire shape and respects job-type boundaries. ChangesSession-Expired Detection and Early Halt
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/cron/scheduler.rs (1)
277-289: 💤 Low valueConsider adding a debug log when halting on session-expired.
A brief
tracing::debug!here would help operators diagnose why a cron job stopped retrying, and aligns with the coding guideline requiring logs at state transitions.🔧 Suggested trace-level log
if is_session_expired_failure( &job.job_type, last_agent_error.as_deref(), last_output.as_str(), ) { + tracing::debug!( + job_id = %job.id, + attempt = attempt, + "[cron] halting retry loop — backend session expired" + ); // Halt on the first occurrence — the inference layer already // published `SessionExpired`, retries cannot recover until the // user re-auths, and the classifier considers this expected // user state (TAURI-RUST-N). See `is_session_expired_failure` // for the full rationale. session_expired = true; break; }As per coding guidelines: "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with stable grep-friendly prefixes."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/cron/scheduler.rs` around lines 277 - 289, Add a debug log immediately before breaking when is_session_expired_failure(...) returns true: log the job identifier (use job.job_type or job id if available), the last_agent_error and last_output values, and a stable grep-friendly prefix like "cron:session-expired". Update the block around is_session_expired_failure to call tracing::debug! with those fields right after setting session_expired = true and before break so operators can see why the cron stopped retrying.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/cron/scheduler.rs`:
- Around line 277-289: Add a debug log immediately before breaking when
is_session_expired_failure(...) returns true: log the job identifier (use
job.job_type or job id if available), the last_agent_error and last_output
values, and a stable grep-friendly prefix like "cron:session-expired". Update
the block around is_session_expired_failure to call tracing::debug! with those
fields right after setting session_expired = true and before break so operators
can see why the cron stopped retrying.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cb08cc99-36e9-402b-96b2-ddceb898aa51
📒 Files selected for processing (2)
src/openhuman/cron/scheduler.rssrc/openhuman/cron/scheduler_tests.rs
Summary
openhuman@0.57.13(and earlier). Tags:domain=cron,operation=agent_job,agent_id=morning_briefing,failure=retries_exhausted. Title:OpenHuman API error (401 Unauthorized): {"success":false,"error":"Invalid token"}.inference::provider::ops::api_erroralready publishesDomainEvent::SessionExpiredfor the 401 and intentionally skips its ownreport_error. But the cron retry loop incron::scheduler::execute_job_with_retrydoesn't consult that signal — it sleeps with exponential backoff, retries the same job N times (every attempt hitting the same global 401), then unconditionally callsreport_errorwithfailure=retries_exhausted.BACKEND_USER_STATE_MARKERinagent::harness::tool_loop) but at the cron retry layer: detect the session-expired wire shape via the existingcore::observability::is_session_expired_messageclassifier, halt the loop on the first failed attempt, and skip theretries_exhaustedSentry capture (the classifier already considers this expected user state, theSessionExpiredevent already drives the credentials + scheduler-gate handshake).Problem
inference/provider/ops.rs::api_errortreats backend 401/403 as expected session-lapse user state — it publishesDomainEvent::SessionExpiredviapublish_backend_session_expiredso the credentials subscriber clears the stored session and the scheduler-gatesigned_outoverride halts downstream LLM work — and deliberately does not callreport_errorat that site.That handshake leaves a gap in
cron/scheduler.rs::execute_job_with_retry:The retry can't recover until the user re-auths, but the loop runs to exhaustion and the final
report_errorcaptures a 401 wire shape the existing classifier already knows is expected user state. The 7,038-event volume is one Windows user's cron-firedmorning_briefingagent grinding through retries every poll for ~16 days.The classifier
core::observability::is_session_expired_messagewas already extended for OPENHUMAN-TAURI-4P0 (observability.rs:615):so the predicate to consult is already there — the cron retry just never asks.
Solution
Two coordinated changes in
src/openhuman/cron/scheduler.rs:New
is_session_expired_failurepredicatelast_agent_errorfirst becauserun_agent_jobroutes the rawanyhow::Error::to_string()chain there (containing the provider's wire message), whilelast_outputonly carries the canned user-facing notification (AGENT_JOB_USER_FAILURE_MESSAGE/ per-variant copy).last_outputas defense-in-depth so a future code path that surfaces the raw error there isn't a silent miss.JobType::Agentbecause theSessionExpiredpublish + scheduler-gate handshake only fires from the inference layer; halting a shell job because its stdout happens to echo a 401-shaped string would skip retries the operator may want.Halt on first occurrence in
execute_job_with_retryWhen
session_expired == truewe skip theretries_exhaustedcapture entirely, since the classifier already considers it expected user state.Why halt on first (not the standard
REPEAT_FAILURE_THRESHOLD = 3)Same reasoning as #3334's
BACKEND_USER_STATE_MARKER: the condition is global — every paid LLM call will hit the same 401 until the user re-auths. Pivoting query / model / args can't help, so the standard retry budget is wasted attempts. Halting on first matches thetool_loopprecedent and stops the backoff-sleep storm too.Why not just swap
report_error→report_error_or_expectedThat would quiet Sentry but leave the agent burning N×(300ms..30s exponential backoff) per cron tick on a 401 that can never recover. Per the project's
feedback-sentry-skip-vs-real-fixconvention, fix the bug (the wasted retry) rather than just demote the symptom.Tests
Five focused unit tests pin the predicate behaviour:
..._matches_openhuman_backend_401_in_agent_errorlast_agent_errortrips the halt..._matches_when_only_output_carries_signallast_agent_error == None..._does_not_match_canned_user_messageAGENT_JOB_USER_FAILURE_MESSAGEdoesn't false-positive (non-401 agent failures keep normal retry semantics)..._does_not_match_ordinary_provider_errorOpenAI API error (401 ...) Invalid API key) stay on the retry path (those are actionable)..._does_not_halt_shell_jobsJobType::Shellalways returnsfalseregardless of stdout contentSubmission Checklist
last_agent_errorand vialast_outputfallback), both negative arms (canned message, ordinary provider error / BYO-key 401), and theJobType::Shellscope guard. The 3-line call-site change inexecute_job_with_retryis exercised through the predicate (no separate behaviour beyondbreak).N/A: behaviour-only change(no feature added/removed/renamed; tightens an existing safety net inside the cron retry loop).## Related—N/A: behaviour-only change.N/A: internal cron error routing, no release-cut surface touched.Closes #NNNin the## Relatedsection —N/A: Sentry-tracked issue (TAURI-RUST-N), no GitHub issue.Impact
src/openhuman/cron/scheduler.rs). No UI, no schema, no migration.failure=retries_exhaustedcapture stream from cron-fired agent jobs hitting backend 401. The inference layer already publishesSessionExpiredso the credentials/scheduler-gate handshake continues to drive re-auth.REPEAT_FAILURE_THRESHOLD = scheduler_retriesretry budget and still surfacefailure=retries_exhaustedto Sentry — that's what the BYO-key negative-guard test pins.Related
team/billingauthed_json401s) — covers a different code path (RPC ops), no overlap with cron retries.AgentError::ProviderSessionExpiredvariant could give the user-visible notification more actionable copy than the current generic "Something went wrong" — strictly orthogonal, the user-visible behaviour today is already governed byclassify_agent_anyhow_for_user.Sentry-Issue: TAURI-RUST-N
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/cron-401-session-expired-haltValidation Run
N/A: Rust-only change—pnpm --filter openhuman-app format:checkN/A: Rust-only change—pnpm typecheckcargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler::tests::is_session_expired_failure→ 5 passed;cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler→ 55 passed (no regression)cargo fmt --manifest-path Cargo.toml -- --checkclean;cargo check --manifest-path Cargo.toml --testsexit 0N/A: Tauri shell untouched— Tauri fmt/checkValidation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
failure=retries_exhaustedSentry capture, instead of running throughscheduler_retriesattempts before reporting.Parity Contract
JobType::Agentjob keeps existing retry semantics; agent jobs whose failure does not matchis_session_expired_messagekeep the fullscheduler_retriesbudget and still emitfailure=retries_exhaustedto Sentry. Shell jobs always return early from the predicate (no behavioural change to the shell branch)...._does_not_match_canned_user_message,..._does_not_match_ordinary_provider_error,..._does_not_halt_shell_jobs.Duplicate / Superseded PR Handling
cron 401/scheduler retry session/TAURI-RUST-Nkeywords orexecute_job_with_retrysurface)Summary by CodeRabbit
Bug Fixes
Tests