Skip to content

fix(cron): halt agent-job retry on backend session-expired (TAURI-RUST-N)#3350

Merged
senamakel merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/cron-401-session-expired-halt
Jun 4, 2026
Merged

fix(cron): halt agent-job retry on backend session-expired (TAURI-RUST-N)#3350
senamakel merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/cron-401-session-expired-halt

Conversation

@CodeGhost21
Copy link
Copy Markdown
Contributor

@CodeGhost21 CodeGhost21 commented Jun 4, 2026

Summary

  • Sentry TAURI-RUST-N — 7,038 events / 5 users on openhuman@0.57.13 (and earlier). Tags: domain=cron, operation=agent_job, agent_id=morning_briefing, failure=retries_exhausted. Title: OpenHuman API error (401 Unauthorized): {"success":false,"error":"Invalid token"}.
  • After the user's OpenHuman backend JWT lapses, inference::provider::ops::api_error already publishes DomainEvent::SessionExpired for the 401 and intentionally skips its own report_error. But the cron retry loop in cron::scheduler::execute_job_with_retry doesn't consult that signal — it sleeps with exponential backoff, retries the same job N times (every attempt hitting the same global 401), then unconditionally calls report_error with failure=retries_exhausted.
  • Fix mirrors #3334's halt-on-first-occurrence pattern (BACKEND_USER_STATE_MARKER in agent::harness::tool_loop) but at the cron retry layer: detect the session-expired wire shape via the existing core::observability::is_session_expired_message classifier, halt the loop on the first failed attempt, and skip the retries_exhausted Sentry capture (the classifier already considers this expected user state, the SessionExpired event already drives the credentials + scheduler-gate handshake).

Problem

inference/provider/ops.rs::api_error treats backend 401/403 as expected session-lapse user state — it publishes DomainEvent::SessionExpired via publish_backend_session_expired so the credentials subscriber clears the stored session and the scheduler-gate signed_out override halts downstream LLM work — and deliberately does not call report_error at that site.

That handshake leaves a gap in cron/scheduler.rs::execute_job_with_retry:

for attempt in 0..=retries {
    let (success, output, agent_error) = ...;     // hits 401, fails
    last_output = output;
    if agent_error.is_some() { last_agent_error = agent_error; }
    if success { return ...; }
    if attempt < retries {
        time::sleep(...).await;                   // backoff and retry the same global-401 call
        backoff_ms = (backoff_ms * 2).min(30_000);
    }
}
if matches!(job.job_type, JobType::Agent) {
    crate::core::observability::report_error(    // → Sentry as failure=retries_exhausted
        report_message, "cron", "agent_job", &[("failure", "retries_exhausted"), ...],
    );
}

The retry can't recover until the user re-auths, but the loop runs to exhaustion and the final report_error captures a 401 wire shape the existing classifier already knows is expected user state. The 7,038-event volume is one Windows user's cron-fired morning_briefing agent grinding through retries every poll for ~16 days.

The classifier core::observability::is_session_expired_message was already extended for OPENHUMAN-TAURI-4P0 (observability.rs:615):

|| (msg.contains("OpenHuman API error (401")
    && msg.contains("\"error\":\"Invalid token\""))

so the predicate to consult is already there — the cron retry just never asks.

Solution

Two coordinated changes in src/openhuman/cron/scheduler.rs:

New is_session_expired_failure predicate

fn is_session_expired_failure(
    job_type: &JobType,
    last_agent_error: Option<&str>,
    last_output: &str,
) -> bool {
    if !matches!(job_type, JobType::Agent) {
        return false;
    }
    let signal = last_agent_error.unwrap_or(last_output);
    crate::core::observability::is_session_expired_message(signal)
}
  • Matches on last_agent_error first because run_agent_job routes the raw anyhow::Error::to_string() chain there (containing the provider's wire message), while last_output only carries the canned user-facing notification (AGENT_JOB_USER_FAILURE_MESSAGE / per-variant copy).
  • Falls back to last_output as defense-in-depth so a future code path that surfaces the raw error there isn't a silent miss.
  • Restricted to JobType::Agent because the SessionExpired publish + scheduler-gate handshake only fires from the inference layer; halting a shell job because its stdout happens to echo a 401-shaped string would skip retries the operator may want.

Halt on first occurrence in execute_job_with_retry

if is_session_expired_failure(
    &job.job_type,
    last_agent_error.as_deref(),
    last_output.as_str(),
) {
    session_expired = true;
    break;
}
// ...
if matches!(job.job_type, JobType::Agent) && !session_expired {
    crate::core::observability::report_error(...);  // unchanged
}

When session_expired == true we skip the retries_exhausted capture entirely, since the classifier already considers it expected user state.

Why halt on first (not the standard REPEAT_FAILURE_THRESHOLD = 3)

Same reasoning as #3334's BACKEND_USER_STATE_MARKER: the condition is global — every paid LLM call will hit the same 401 until the user re-auths. Pivoting query / model / args can't help, so the standard retry budget is wasted attempts. Halting on first matches the tool_loop precedent and stops the backoff-sleep storm too.

Why not just swap report_errorreport_error_or_expected

That would quiet Sentry but leave the agent burning N×(300ms..30s exponential backoff) per cron tick on a 401 that can never recover. Per the project's feedback-sentry-skip-vs-real-fix convention, fix the bug (the wasted retry) rather than just demote the symptom.

Tests

Five focused unit tests pin the predicate behaviour:

Test Pins
..._matches_openhuman_backend_401_in_agent_error 401 wire shape in last_agent_error trips the halt
..._matches_when_only_output_carries_signal defense-in-depth fallback when last_agent_error == None
..._does_not_match_canned_user_message AGENT_JOB_USER_FAILURE_MESSAGE doesn't false-positive (non-401 agent failures keep normal retry semantics)
..._does_not_match_ordinary_provider_error 500s and third-party BYO-key 401s (OpenAI API error (401 ...) Invalid API key) stay on the retry path (those are actionable)
..._does_not_halt_shell_jobs scope guard — JobType::Shell always returns false regardless of stdout content
$ cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler::tests::is_session_expired_failure
running 5 tests
test ..._does_not_halt_shell_jobs ... ok
test ..._matches_openhuman_backend_401_in_agent_error ... ok
test ..._matches_when_only_output_carries_signal ... ok
test ..._does_not_match_ordinary_provider_error ... ok
test ..._does_not_match_canned_user_message ... ok
test result: ok. 5 passed; 0 failed; 0 ignored

$ cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler
test result: ok. 55 passed; 0 failed; 0 ignored

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — 5 new tests cover the new predicate's match arm (positive via last_agent_error and via last_output fallback), both negative arms (canned message, ordinary provider error / BYO-key 401), and the JobType::Shell scope guard. The 3-line call-site change in execute_job_with_retry is exercised through the predicate (no separate behaviour beyond break).
  • Coverage matrix updated — N/A: behaviour-only change (no feature added/removed/renamed; tightens an existing safety net inside the cron retry loop).
  • All affected feature IDs from the matrix are listed under ## RelatedN/A: behaviour-only change.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: internal cron error routing, no release-cut surface touched.
  • Linked issue closed via Closes #NNN in the ## Related section — N/A: Sentry-tracked issue (TAURI-RUST-N), no GitHub issue.

Impact

  • Desktop Rust core only (src/openhuman/cron/scheduler.rs). No UI, no schema, no migration.
  • Sentry: stops the failure=retries_exhausted capture stream from cron-fired agent jobs hitting backend 401. The inference layer already publishes SessionExpired so the credentials/scheduler-gate handshake continues to drive re-auth.
  • Resource: agent cron jobs no longer waste N×(300ms..30s exponential backoff) per tick on a 401 they can't recover. The morning_briefing user case dropped from ~5 attempts per poll to 1.
  • No change to genuine failures: non-401 agent failures (provider down, tool error, network blip, third-party BYO-key 401) keep their full REPEAT_FAILURE_THRESHOLD = scheduler_retries retry budget and still surface failure=retries_exhausted to Sentry — that's what the BYO-key negative-guard test pins.
  • Security: no secrets logged; the predicate is a string scan with conjunctive anchors already audited for cross-provider false-positives in OPENHUMAN-TAURI-4P0.

Related

  • Closes: N/A (Sentry TAURI-RUST-N, no linked GitHub issue)
  • Pattern peer: #3334 (halt-on-first user-state failure in the agent tool loop) — same convention, different layer.
  • Adjacent surface: #3336 (typed-error flatten for team/billing authed_json 401s) — covers a different code path (RPC ops), no overlap with cron retries.
  • Follow-up PR(s)/TODOs: none. A AgentError::ProviderSessionExpired variant could give the user-visible notification more actionable copy than the current generic "Something went wrong" — strictly orthogonal, the user-visible behaviour today is already governed by classify_agent_anyhow_for_user.

Sentry-Issue: TAURI-RUST-N


AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A (Sentry TAURI-RUST-N)

Commit & Branch

  • Branch: fix/cron-401-session-expired-halt
  • Commit SHA: see PR head

Validation Run

  • N/A: Rust-only changepnpm --filter openhuman-app format:check
  • N/A: Rust-only changepnpm typecheck
  • Focused tests: cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler::tests::is_session_expired_failure → 5 passed; cargo test --manifest-path Cargo.toml --lib -- openhuman::cron::scheduler → 55 passed (no regression)
  • Rust fmt/check (if changed): cargo fmt --manifest-path Cargo.toml -- --check clean; cargo check --manifest-path Cargo.toml --tests exit 0
  • N/A: Tauri shell untouched — Tauri fmt/check

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: cron-fired agent jobs that hit a backend session-expired 401 now halt the retry loop on the first occurrence and skip the failure=retries_exhausted Sentry capture, instead of running through scheduler_retries attempts before reporting.
  • User-visible effect: same canned failure notification surfaces to the user, but immediately (after ~one attempt) rather than after N×backoff sleeps. Other failure shapes unchanged.

Parity Contract

  • Legacy behavior preserved: every non-JobType::Agent job keeps existing retry semantics; agent jobs whose failure does not match is_session_expired_message keep the full scheduler_retries budget and still emit failure=retries_exhausted to Sentry. Shell jobs always return early from the predicate (no behavioural change to the shell branch).
  • Guard/fallback/dispatch parity checks: covered by ..._does_not_match_canned_user_message, ..._does_not_match_ordinary_provider_error, ..._does_not_halt_shell_jobs.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none (verified — no open PR on cron 401/scheduler retry session/TAURI-RUST-N keywords or execute_job_with_retry surface)
  • Canonical PR: this
  • Resolution (closed/superseded/updated): N/A

Summary by CodeRabbit

  • Bug Fixes

    • Session expiration handling improved—when authentication tokens become invalid, the system now immediately halts retry attempts instead of exhausting the retry queue, preventing misleading "retries exhausted" error messages.
  • Tests

    • Added comprehensive test coverage for session expiration detection and retry halt behavior.

…I-RUST-N)

After the user's OpenHuman backend JWT lapses, the inference layer's
`api_error` already publishes `DomainEvent::SessionExpired` for the 401
and intentionally skips its own `report_error` call. But the cron retry
loop in `execute_job_with_retry` doesn't consult that signal — it sleeps
with exponential backoff, retries the same job N times (every attempt
hitting the same global 401), then unconditionally calls `report_error`
with `failure=retries_exhausted`. That generated TAURI-RUST-N: 7,038
events / 5 users, all `domain=cron operation=agent_job` from a
`morning_briefing` agent grinding through retries after one lapse.

Fix mirrors PR tinyhumansai#3334's halt-on-first-occurrence pattern, but at the
cron retry layer instead of the agent tool loop:

- New `is_session_expired_failure` predicate consults the existing
  `core::observability::is_session_expired_message` classifier (the
  `OpenHuman API error (401` + `"error":"Invalid token"` conjunction
  was already added for OPENHUMAN-TAURI-4P0). Matches on
  `last_agent_error` first (carries the raw provider wire chain),
  falls back to `last_output` for defense-in-depth.
- `execute_job_with_retry` breaks out of the loop on the first
  occurrence — no further attempts, no `report_error` call. The
  inference layer's `SessionExpired` publish already drives the
  credentials/scheduler-gate handshake; retries can't recover until
  the user re-auths.
- Restricted to `JobType::Agent`: shell jobs that happen to echo a
  401-shaped string keep their existing retry semantics (no
  `SessionExpired` publish from shell stdout, no reason to flip the
  gate).

Five focused unit tests pin the predicate behaviour: the 401 wire
shape trips the halt via `last_agent_error` or `last_output`; the
canned user message, ordinary provider errors (incl. third-party
BYO-key 401s), and shell-job invocations all stay on the retry path.

Sentry-Issue: TAURI-RUST-N
@CodeGhost21 CodeGhost21 requested a review from a team June 4, 2026 09:50
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The scheduler now detects when an agent job fails due to session expiration (a 401 "Invalid token" backend response) and immediately halts the retry loop instead of continuing exponential backoff. Retry-exhaustion error reporting is suppressed for these expected authentication states. Comprehensive tests verify the classifier matches only the specific wire shape and respects job-type boundaries.

Changes

Session-Expired Detection and Early Halt

Layer / File(s) Summary
Session-expired classifier and retry-loop integration
src/openhuman/cron/scheduler.rs
is_session_expired_failure helper detects agent-job 401 "Invalid token" failures by routing last_agent_error or last_output through the core observability session-expired classifier. The retry loop introduces a session_expired flag, breaks on first match, and suppresses "retries exhausted" reporting for agent jobs where the flag is set.
Classifier and halt behavior test coverage
src/openhuman/cron/scheduler_tests.rs
Five new test functions establish classifier boundaries: positive match on OpenHuman 401 wire, fallback matching via last_output only, non-matching of generic agent-failure messages and third-party 401s, and scope enforcement that only JobType::Agent (not JobType::Shell) halts on session expiration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2188: Pins the core is_session_expired_message matcher via sentinel-string unit tests, which the main PR now uses to classify agent-job 401s.
  • tinyhumansai/openhuman#2924: Extends the is_session_expired_message classifier to recognize a billing 401 "session token rejected" string, directly affecting when the scheduler will halt and suppress reporting.
  • tinyhumansai/openhuman#2200: Similar fail-fast pattern applied to a different retry loop (ReliableProvider) using the same is_session_expired_message classifier.

Suggested labels

rust-core, bug, agent, sentry-traced-bug

Suggested reviewers

  • oxoxDev
  • graycyrus
  • senamakel

Poem

🐰 A token expires, the agent cries out,
No more backoff loops, we pivot about!
One 401 check, then halt and report,
No false "retries spent" at the court,
Swift fail-fast logic in Rust, without doubt! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: halting agent-job retries when a backend session-expired condition (401 JWT lapse) is detected, with a reference ticket.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. sentry-traced-bug Bug identified via Sentry triage bug labels Jun 4, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openhuman/cron/scheduler.rs (1)

277-289: 💤 Low value

Consider adding a debug log when halting on session-expired.

A brief tracing::debug! here would help operators diagnose why a cron job stopped retrying, and aligns with the coding guideline requiring logs at state transitions.

🔧 Suggested trace-level log
         if is_session_expired_failure(
             &job.job_type,
             last_agent_error.as_deref(),
             last_output.as_str(),
         ) {
+            tracing::debug!(
+                job_id = %job.id,
+                attempt = attempt,
+                "[cron] halting retry loop — backend session expired"
+            );
             // Halt on the first occurrence — the inference layer already
             // published `SessionExpired`, retries cannot recover until the
             // user re-auths, and the classifier considers this expected
             // user state (TAURI-RUST-N). See `is_session_expired_failure`
             // for the full rationale.
             session_expired = true;
             break;
         }

As per coding guidelines: "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with stable grep-friendly prefixes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/cron/scheduler.rs` around lines 277 - 289, Add a debug log
immediately before breaking when is_session_expired_failure(...) returns true:
log the job identifier (use job.job_type or job id if available), the
last_agent_error and last_output values, and a stable grep-friendly prefix like
"cron:session-expired". Update the block around is_session_expired_failure to
call tracing::debug! with those fields right after setting session_expired =
true and before break so operators can see why the cron stopped retrying.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/cron/scheduler.rs`:
- Around line 277-289: Add a debug log immediately before breaking when
is_session_expired_failure(...) returns true: log the job identifier (use
job.job_type or job id if available), the last_agent_error and last_output
values, and a stable grep-friendly prefix like "cron:session-expired". Update
the block around is_session_expired_failure to call tracing::debug! with those
fields right after setting session_expired = true and before break so operators
can see why the cron stopped retrying.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cb08cc99-36e9-402b-96b2-ddceb898aa51

📥 Commits

Reviewing files that changed from the base of the PR and between 87a91ae and b9dd2d4.

📒 Files selected for processing (2)
  • src/openhuman/cron/scheduler.rs
  • src/openhuman/cron/scheduler_tests.rs

@senamakel senamakel merged commit 2e16d98 into tinyhumansai:main Jun 4, 2026
38 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. sentry-traced-bug Bug identified via Sentry triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants