fix(cron): halt agent-job retry on backend session-expired (TAURI-RUST-N) by CodeGhost21 · Pull Request #3350 · tinyhumansai/openhuman

CodeGhost21 · 2026-06-04T09:50:32Z

Summary

Sentry TAURI-RUST-N — 7,038 events / 5 users on openhuman@0.57.13 (and earlier). Tags: domain=cron, operation=agent_job, agent_id=morning_briefing, failure=retries_exhausted. Title: OpenHuman API error (401 Unauthorized): {"success":false,"error":"Invalid token"}.
After the user's OpenHuman backend JWT lapses, inference::provider::ops::api_error already publishes DomainEvent::SessionExpired for the 401 and intentionally skips its own report_error. But the cron retry loop in cron::scheduler::execute_job_with_retry doesn't consult that signal — it sleeps with exponential backoff, retries the same job N times (every attempt hitting the same global 401), then unconditionally calls report_error with failure=retries_exhausted.
Fix mirrors #3334's halt-on-first-occurrence pattern (BACKEND_USER_STATE_MARKER in agent::harness::tool_loop) but at the cron retry layer: detect the session-expired wire shape via the existing core::observability::is_session_expired_message classifier, halt the loop on the first failed attempt, and skip the retries_exhausted Sentry capture (the classifier already considers this expected user state, the SessionExpired event already drives the credentials + scheduler-gate handshake).

Problem

inference/provider/ops.rs::api_error treats backend 401/403 as expected session-lapse user state — it publishes DomainEvent::SessionExpired via publish_backend_session_expired so the credentials subscriber clears the stored session and the scheduler-gate signed_out override halts downstream LLM work — and deliberately does not call report_error at that site.

That handshake leaves a gap in cron/scheduler.rs::execute_job_with_retry:

for attempt in 0..=retries {
    let (success, output, agent_error) = ...;     // hits 401, fails
    last_output = output;
    if agent_error.is_some() { last_agent_error = agent_error; }
    if success { return ...; }
    if attempt < retries {
        time::sleep(...).await;                   // backoff and retry the same global-401 call
        backoff_ms = (backoff_ms * 2).min(30_000);
    }
}
if matches!(job.job_type, JobType::Agent) {
    crate::core::observability::report_error(    // → Sentry as failure=retries_exhausted
        report_message, "cron", "agent_job", &[("failure", "retries_exhausted"), ...],
    );
}

The retry can't recover until the user re-auths, but the loop runs to exhaustion and the final report_error captures a 401 wire shape the existing classifier already knows is expected user state. The 7,038-event volume is one Windows user's cron-fired morning_briefing agent grinding through retries every poll for ~16 days.

The classifier core::observability::is_session_expired_message was already extended for OPENHUMAN-TAURI-4P0 (observability.rs:615):

|| (msg.contains("OpenHuman API error (401")
    && msg.contains("\"error\":\"Invalid token\""))

so the predicate to consult is already there — the cron retry just never asks.

Solution

Two coordinated changes in src/openhuman/cron/scheduler.rs:

New `is_session_expired_failure` predicate

fn is_session_expired_failure(
    job_type: &JobType,
    last_agent_error: Option<&str>,
    last_output: &str,
) -> bool {
    if !matches!(job_type, JobType::Agent) {
        return false;
    }
    let signal = last_agent_error.unwrap_or(last_output);
    crate::core::observability::is_session_expired_message(signal)
}

Matches on last_agent_error first because run_agent_job routes the raw anyhow::Error::to_string() chain there (containing the provider's wire message), while last_output only carries the canned user-facing notification (AGENT_JOB_USER_FAILURE_MESSAGE / per-variant copy).
Falls back to last_output as defense-in-depth so a future code path that surfaces the raw error there isn't a silent miss.
Restricted to JobType::Agent because the SessionExpired publish + scheduler-gate handshake only fires from the inference layer; halting a shell job because its stdout happens to echo a 401-shaped string would skip retries the operator may want.

Halt on first occurrence in `execute_job_with_retry`

if is_session_expired_failure(
    &job.job_type,
    last_agent_error.as_deref(),
    last_output.as_str(),
) {
    session_expired = true;
    break;
}
// ...
if matches!(job.job_type, JobType::Agent) && !session_expired {
    crate::core::observability::report_error(...);  // unchanged
}

When session_expired == true we skip the retries_exhausted capture entirely, since the classifier already considers it expected user state.

Why halt on first (not the standard `REPEAT_FAILURE_THRESHOLD = 3`)

Same reasoning as #3334's BACKEND_USER_STATE_MARKER: the condition is global — every paid LLM call will hit the same 401 until the user re-auths. Pivoting query / model / args can't help, so the standard retry budget is wasted attempts. Halting on first matches the tool_loop precedent and stops the backoff-sleep storm too.

Why not just swap `report_error` → `report_error_or_expected`

That would quiet Sentry but leave the agent burning N×(300ms..30s exponential backoff) per cron tick on a 401 that can never recover. Per the project's feedback-sentry-skip-vs-real-fix convention, fix the bug (the wasted retry) rather than just demote the symptom.

Tests

Five focused unit tests pin the predicate behaviour:

Test	Pins
`..._matches_openhuman_backend_401_in_agent_error`	401 wire shape in `last_agent_error` trips the halt
`..._matches_when_only_output_carries_signal`	defense-in-depth fallback when `last_agent_error == None`
`..._does_not_match_canned_user_message`	`AGENT_JOB_USER_FAILURE_MESSAGE` doesn't false-positive (non-401 agent failures keep normal retry semantics)
`..._does_not_match_ordinary_provider_error`	500s and third-party BYO-key 401s (`OpenAI API error (401 ...) Invalid API key`) stay on the retry path (those are actionable)
`..._does_not_halt_shell_jobs`	scope guard — `JobType::Shell` always returns `false` regardless of stdout content

$ cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler::tests::is_session_expired_failure
running 5 tests
test ..._does_not_halt_shell_jobs ... ok
test ..._matches_openhuman_backend_401_in_agent_error ... ok
test ..._matches_when_only_output_carries_signal ... ok
test ..._does_not_match_ordinary_provider_error ... ok
test ..._does_not_match_canned_user_message ... ok
test result: ok. 5 passed; 0 failed; 0 ignored

$ cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler
test result: ok. 55 passed; 0 failed; 0 ignored

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
Diff coverage ≥ 80% — 5 new tests cover the new predicate's match arm (positive via last_agent_error and via last_output fallback), both negative arms (canned message, ordinary provider error / BYO-key 401), and the JobType::Shell scope guard. The 3-line call-site change in execute_job_with_retry is exercised through the predicate (no separate behaviour beyond break).
Coverage matrix updated — N/A: behaviour-only change (no feature added/removed/renamed; tightens an existing safety net inside the cron retry loop).
All affected feature IDs from the matrix are listed under ## Related — N/A: behaviour-only change.
No new external network dependencies introduced (mock backend used per Testing Strategy)
Manual smoke checklist updated if this touches release-cut surfaces — N/A: internal cron error routing, no release-cut surface touched.
Linked issue closed via Closes #NNN in the ## Related section — N/A: Sentry-tracked issue (TAURI-RUST-N), no GitHub issue.

Impact

Desktop Rust core only (src/openhuman/cron/scheduler.rs). No UI, no schema, no migration.
Sentry: stops the failure=retries_exhausted capture stream from cron-fired agent jobs hitting backend 401. The inference layer already publishes SessionExpired so the credentials/scheduler-gate handshake continues to drive re-auth.
Resource: agent cron jobs no longer waste N×(300ms..30s exponential backoff) per tick on a 401 they can't recover. The morning_briefing user case dropped from ~5 attempts per poll to 1.
No change to genuine failures: non-401 agent failures (provider down, tool error, network blip, third-party BYO-key 401) keep their full REPEAT_FAILURE_THRESHOLD = scheduler_retries retry budget and still surface failure=retries_exhausted to Sentry — that's what the BYO-key negative-guard test pins.
Security: no secrets logged; the predicate is a string scan with conjunctive anchors already audited for cross-provider false-positives in OPENHUMAN-TAURI-4P0.

Closes: N/A (Sentry TAURI-RUST-N, no linked GitHub issue)
Pattern peer: #3334 (halt-on-first user-state failure in the agent tool loop) — same convention, different layer.
Adjacent surface: #3336 (typed-error flatten for team/billing authed_json 401s) — covers a different code path (RPC ops), no overlap with cron retries.
Follow-up PR(s)/TODOs: none. A AgentError::ProviderSessionExpired variant could give the user-visible notification more actionable copy than the current generic "Something went wrong" — strictly orthogonal, the user-visible behaviour today is already governed by classify_agent_anyhow_for_user.

Sentry-Issue: TAURI-RUST-N

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A (Sentry TAURI-RUST-N)

Commit & Branch

Branch: fix/cron-401-session-expired-halt
Commit SHA: see PR head

Validation Run

N/A: Rust-only change — pnpm --filter openhuman-app format:check
N/A: Rust-only change — pnpm typecheck
Focused tests: cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler::tests::is_session_expired_failure → 5 passed; cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler → 55 passed (no regression)
Rust fmt/check (if changed): cargo fmt --manifest-path Cargo.toml -- --check clean; cargo check --manifest-path Cargo.toml --tests exit 0
N/A: Tauri shell untouched — Tauri fmt/check

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: cron-fired agent jobs that hit a backend session-expired 401 now halt the retry loop on the first occurrence and skip the failure=retries_exhausted Sentry capture, instead of running through scheduler_retries attempts before reporting.
User-visible effect: same canned failure notification surfaces to the user, but immediately (after ~one attempt) rather than after N×backoff sleeps. Other failure shapes unchanged.

Parity Contract

Legacy behavior preserved: every non-JobType::Agent job keeps existing retry semantics; agent jobs whose failure does not match is_session_expired_message keep the full scheduler_retries budget and still emit failure=retries_exhausted to Sentry. Shell jobs always return early from the predicate (no behavioural change to the shell branch).
Guard/fallback/dispatch parity checks: covered by ..._does_not_match_canned_user_message, ..._does_not_match_ordinary_provider_error, ..._does_not_halt_shell_jobs.

Duplicate / Superseded PR Handling

Duplicate PR(s): none (verified — no open PR on cron 401/scheduler retry session/TAURI-RUST-N keywords or execute_job_with_retry surface)
Canonical PR: this
Resolution (closed/superseded/updated): N/A

Summary by CodeRabbit

Bug Fixes
- Session expiration handling improved—when authentication tokens become invalid, the system now immediately halts retry attempts instead of exhausting the retry queue, preventing misleading "retries exhausted" error messages.
Tests
- Added comprehensive test coverage for session expiration detection and retry halt behavior.

…I-RUST-N) After the user's OpenHuman backend JWT lapses, the inference layer's `api_error` already publishes `DomainEvent::SessionExpired` for the 401 and intentionally skips its own `report_error` call. But the cron retry loop in `execute_job_with_retry` doesn't consult that signal — it sleeps with exponential backoff, retries the same job N times (every attempt hitting the same global 401), then unconditionally calls `report_error` with `failure=retries_exhausted`. That generated TAURI-RUST-N: 7,038 events / 5 users, all `domain=cron operation=agent_job` from a `morning_briefing` agent grinding through retries after one lapse. Fix mirrors PR tinyhumansai#3334's halt-on-first-occurrence pattern, but at the cron retry layer instead of the agent tool loop: - New `is_session_expired_failure` predicate consults the existing `core::observability::is_session_expired_message` classifier (the `OpenHuman API error (401` + `"error":"Invalid token"` conjunction was already added for OPENHUMAN-TAURI-4P0). Matches on `last_agent_error` first (carries the raw provider wire chain), falls back to `last_output` for defense-in-depth. - `execute_job_with_retry` breaks out of the loop on the first occurrence — no further attempts, no `report_error` call. The inference layer's `SessionExpired` publish already drives the credentials/scheduler-gate handshake; retries can't recover until the user re-auths. - Restricted to `JobType::Agent`: shell jobs that happen to echo a 401-shaped string keep their existing retry semantics (no `SessionExpired` publish from shell stdout, no reason to flip the gate). Five focused unit tests pin the predicate behaviour: the 401 wire shape trips the halt via `last_agent_error` or `last_output`; the canned user message, ordinary provider errors (incl. third-party BYO-key 401s), and shell-job invocations all stay on the retry path. Sentry-Issue: TAURI-RUST-N

coderabbitai · 2026-06-04T09:50:49Z

📝 Walkthrough

Walkthrough

The scheduler now detects when an agent job fails due to session expiration (a 401 "Invalid token" backend response) and immediately halts the retry loop instead of continuing exponential backoff. Retry-exhaustion error reporting is suppressed for these expected authentication states. Comprehensive tests verify the classifier matches only the specific wire shape and respects job-type boundaries.

Changes

Session-Expired Detection and Early Halt

Layer / File(s)	Summary
Session-expired classifier and retry-loop integration `src/openhuman/cron/scheduler.rs`	`is_session_expired_failure` helper detects agent-job 401 "Invalid token" failures by routing `last_agent_error` or `last_output` through the core observability session-expired classifier. The retry loop introduces a `session_expired` flag, breaks on first match, and suppresses "retries exhausted" reporting for agent jobs where the flag is set.
Classifier and halt behavior test coverage `src/openhuman/cron/scheduler_tests.rs`	Five new test functions establish classifier boundaries: positive match on OpenHuman 401 wire, fallback matching via `last_output` only, non-matching of generic agent-failure messages and third-party 401s, and scope enforcement that only `JobType::Agent` (not `JobType::Shell`) halts on session expiration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

tinyhumansai/openhuman#2188: Pins the core is_session_expired_message matcher via sentinel-string unit tests, which the main PR now uses to classify agent-job 401s.
tinyhumansai/openhuman#2924: Extends the is_session_expired_message classifier to recognize a billing 401 "session token rejected" string, directly affecting when the scheduler will halt and suppress reporting.
tinyhumansai/openhuman#2200: Similar fail-fast pattern applied to a different retry loop (ReliableProvider) using the same is_session_expired_message classifier.

Suggested labels

rust-core, bug, agent, sentry-traced-bug

Suggested reviewers

oxoxDev
graycyrus
senamakel

Poem

🐰 A token expires, the agent cries out,
No more backoff loops, we pivot about!
One 401 check, then halt and report,
No false "retries spent" at the court,
Swift fail-fast logic in Rust, without doubt! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: halting agent-job retries when a backend session-expired condition (401 JWT lapse) is detected, with a reference ticket.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/openhuman/cron/scheduler.rs (1)

277-289: 💤 Low value

Consider adding a debug log when halting on session-expired.

A brief tracing::debug! here would help operators diagnose why a cron job stopped retrying, and aligns with the coding guideline requiring logs at state transitions.

🔧 Suggested trace-level log

         if is_session_expired_failure(
             &job.job_type,
             last_agent_error.as_deref(),
             last_output.as_str(),
         ) {
+            tracing::debug!(
+                job_id = %job.id,
+                attempt = attempt,
+                "[cron] halting retry loop — backend session expired"
+            );
             // Halt on the first occurrence — the inference layer already
             // published `SessionExpired`, retries cannot recover until the
             // user re-auths, and the classifier considers this expected
             // user state (TAURI-RUST-N). See `is_session_expired_failure`
             // for the full rationale.
             session_expired = true;
             break;
         }

As per coding guidelines: "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with stable grep-friendly prefixes."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/cron/scheduler.rs` around lines 277 - 289, Add a debug log
immediately before breaking when is_session_expired_failure(...) returns true:
log the job identifier (use job.job_type or job id if available), the
last_agent_error and last_output values, and a stable grep-friendly prefix like
"cron:session-expired". Update the block around is_session_expired_failure to
call tracing::debug! with those fields right after setting session_expired =
true and before break so operators can see why the cron stopped retrying.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/cron/scheduler.rs`:
- Around line 277-289: Add a debug log immediately before breaking when
is_session_expired_failure(...) returns true: log the job identifier (use
job.job_type or job id if available), the last_agent_error and last_output
values, and a stable grep-friendly prefix like "cron:session-expired". Update
the block around is_session_expired_failure to call tracing::debug! with those
fields right after setting session_expired = true and before break so operators
can see why the cron stopped retrying.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cb08cc99-36e9-402b-96b2-ddceb898aa51

📥 Commits

Reviewing files that changed from the base of the PR and between 87a91ae and b9dd2d4.

📒 Files selected for processing (2)

src/openhuman/cron/scheduler.rs
src/openhuman/cron/scheduler_tests.rs

CodeGhost21 requested a review from a team June 4, 2026 09:50

coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. sentry-traced-bug Bug identified via Sentry triage bug labels Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 4, 2026

View reviewed changes

senamakel merged commit 2e16d98 into tinyhumansai:main Jun 4, 2026
38 of 42 checks passed

coderabbitai Bot mentioned this pull request Jun 4, 2026

fix(credentials): reject expired session locally + gate usage polling — RCA for backend 401 flood (#3297) #3384

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): halt agent-job retry on backend session-expired (TAURI-RUST-N)#3350

fix(cron): halt agent-job retry on backend session-expired (TAURI-RUST-N)#3350
senamakel merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/cron-401-session-expired-halt

CodeGhost21 commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CodeGhost21 commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

New is_session_expired_failure predicate

Halt on first occurrence in execute_job_with_retry

Why halt on first (not the standard REPEAT_FAILURE_THRESHOLD = 3)

Why not just swap report_error → report_error_or_expected

Tests

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CodeGhost21 commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

New `is_session_expired_failure` predicate

Halt on first occurrence in `execute_job_with_retry`

Why halt on first (not the standard `REPEAT_FAILURE_THRESHOLD = 3`)

Why not just swap `report_error` → `report_error_or_expected`

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading