Skip to content

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342

Merged
mihow merged 4 commits into
mainfrom
fix/1337-terminal-transition-chokepoint
Jun 22, 2026
Merged

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342
mihow merged 4 commits into
mainfrom
fix/1337-terminal-transition-chokepoint

Conversation

@mihow

@mihow mihow commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

A focused fast-follow on #1338 (now merged). #1338 made the result handler's terminal status write safe under concurrency, but the job's status is written from several places. This PR routes the other lock-free terminal writers — cancel() and the two Celery task-signal handlers — through one shared guarded transition, so a stale writer can't resurrect a job another writer already finished.

The motivation is the same lost-update race as #1337: a plain save() of the whole Job row from an out-of-date snapshot can overwrite a status another worker just set. #1338 stopped the result handler; this stops cancel() and the signal handlers doing it from the other direction (e.g. a late task-success signal flipping a just-cancelled job back to SUCCESS).

On main (no longer stacked — #1338 is merged).

List of Changes

  1. One guarded helper for the lock-free terminal writers. Added Job._guarded_status_update(to_status, from_statuses, *, set_finished=False): a statement-scope UPDATE ... WHERE status IN (from_statuses) that holds no row lock (so it doesn't reintroduce the contention fix(jobs): fixes for concurrent ML processing jobs #1261 removed) and advances the in-memory instance only when a row actually changes. Default from-set is JobState.finalizable_states().
  2. cancel() no longer clobbers a finished job, and async cancel no longer SIGTERMs the bootstrap. CANCELING/REVOKED now go through the guarded helper, so cancelling an already-finished job leaves its terminal status intact. Additionally — folding the one useful change from Improve celery task dispatch and cancellation to prevent stuck jobs #1324 — an ASYNC_API cancel now revokes the local run_job without terminate: that task only queues images and has usually finished, and the remote ADC work is stopped by the NATS/Redis teardown, not by killing the bootstrap. Sync/internal jobs still terminate (that task is the work). Teardown (cleanup_async_job_if_needed) runs unconditionally.
  3. The task-success and task-failure signal handlers can no longer resurrect a terminal job. The terminal SUCCESS write in update_job_status and the FAILURE write in update_job_failure go through the guarded helper; their existing pre-checks are unchanged. Minor behaviour change: a FAILURE set via the task-failure signal now also records finished_at, matching _fail_job and the result handler.
  4. Tests. TestTerminalTransitionChokepoint covers the guard for each writer; TestCancelCompletionRace reproduces the real concurrent interleave between a cancel and a completing result batch in both directions (mirrors Stop a finished job from being pulled back to running by a slower worker #1338's TestConcurrentStatusRace). 107 pass. Also validated on a dev deployment: cancelling a job mid-flight leaves it REVOKED with no resurrection and full NATS/Redis teardown.

Detailed Description

There are six terminal-status writers (the earlier "five" count missed the reaper). This PR brings the three lock-free ones onto the guarded transition; the two lock-based ones are already safe; the reaper is intentionally left broader:

Writer Discipline
_update_job_progress (result handler) guarded conditional UPDATE (#1338)
Job.cancel() guarded conditional UPDATE (this PR)
update_job_status (task_postrun) guarded conditional UPDATE (this PR)
update_job_failure (task_failure) guarded conditional UPDATE (this PR)
_fail_job select_for_update + terminal/CANCELING precondition (already safe)
check_stale_jobs (reaper) select_for_update; intentionally keeps a broader from-set so it can force a genuinely stuck CANCELING/UNKNOWN job terminal as last resort

So _guarded_status_update is the chokepoint for the lock-free writers — not a literal "single chokepoint" for all status writes; the two lock-based writers enforce the same no-resurrect invariant under their row lock. (An earlier docstring overclaimed this; corrected here.)

cancel()'s REVOKED transition includes CANCELING in its from-set, since cancel sets CANCELING itself and must complete that progression.

Supersedes the cancel() rewrite in #1324 (its skip-terminate idea is folded in here). #1324's other parts — a no-op CELERY_WORKER_POOL_OPTIMIZATION="fair" setting and a consumer_timeout-sensitive acks_late change — are tracked separately; see the note on #1324.

How to Test the Changes

  • pytest ami/jobs/tests/test_tasks.py::TestTerminalTransitionChokepoint
  • pytest ami/jobs/tests/test_tasks.py::TestCancelCompletionRace
  • Full ami/jobs/tests/test_tasks.py and ami/jobs/tests/test_jobs.py pass locally (107 passed).

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Refs #1337. Supersedes the cancel rewrite in #1324.

Summary by CodeRabbit

  • Bug Fixes

    • Hardened job status transitions so late completion/failure updates can’t overwrite already-terminal outcomes.
    • Improved job cancellation reliability, including safer revocation behavior for async jobs and ensuring cleanup runs as expected.
    • Completion/failure handlers now only finalize jobs when they’re in a valid pre-terminal state.
  • Tests

    • Added regression tests covering terminal transition race conditions and cancellation/completion interleavings to verify terminal states can’t be resurrected.

@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit a8bce8d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a349762e518b20007c62536

@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 51b8ec7
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a39c5905fc148000894025c

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16719398-9696-4c22-aae9-cad5967a4931

📥 Commits

Reviewing files that changed from the base of the PR and between f3cf90a and 51b8ec7.

📒 Files selected for processing (1)
  • ami/jobs/models.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • ami/jobs/models.py

📝 Walkthrough

Walkthrough

Adds Job._guarded_status_update(), a filter-based ORM helper that transitions job status only from an allowed source set and returns whether the transition fired. Job.cancel() and the Celery task_postrun/task_failure signal handlers are rewired to use this helper, preventing concurrent terminal state resurrection or clobbering. New tests cover both static terminal-state protection and synchronized cancel/completion race interleavings.

Changes

Terminal Transition Hardening

Layer / File(s) Summary
Guarded status update helper
ami/jobs/models.py
Adds _guarded_status_update() which performs a conditional ORM filter-based UPDATE guarded by status__in=from_statuses, updates in-memory status, optionally finished_at, and progress.summary.status, and returns the row count (0 or 1) to signal whether the transition fired.
Job.cancel() with guarded transitions
ami/jobs/models.py
Rewires Job.cancel() to advance through CANCELING then REVOKED only via guarded transitions, uses termination-free revocation for ASYNC_API jobs, and guarantees cleanup_async_job_if_needed teardown after guarded status updates, preventing resurrection of already-terminal states.
Celery signal handlers with guarded terminal transitions
ami/jobs/tasks.py
Routes terminal SUCCESS in update_job_status and terminal FAILURE in update_job_failure through _guarded_status_update constrained to finalizable_states, with conditional field saves only when the guarded update indicates a transition, replacing unconditional update_status + save calls.
Regression and race-condition tests
ami/jobs/tests/test_tasks.py
Adds TestTerminalTransitionChokepoint for static scenarios and TestCancelCompletionRace for synchronized thread interleavings, asserting first-writer-wins semantics and that no terminal state can be resurrected or clobbered by a late cancel or completion signal.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • RolnickLab/antenna#1062: Modifies the same update_job_status and update_job_failure Celery signal handlers in ami/jobs/tasks.py that this PR now hardens with guarded transitions.
  • RolnickLab/antenna#1114: Adds an earlier guard in update_job_status for premature SUCCESS, directly related to the same terminal SUCCESS handling surface this PR now locks via _guarded_status_update.
  • RolnickLab/antenna#1338: Changes the same files (ami/jobs/tasks.py, ami/jobs/tests/test_tasks.py) with the same goal of preventing terminal job state resurrection via conditional update guards.

Suggested labels

PSv2

🐇 Hoppity hop, the states won't collide,
A guarded UPDATE keeps the winners inside.
No REVOKED resurrected by a late SUCCESS ghost,
The first terminal writer wins — that's what matters most!
🌿 Two races now tested, synchronized with care,
The rabbit says: concurrent bugs, beware! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the PR's main change: guarding job-status writes in cancel() and signal handlers as a fast-follow to #1338.
Description check ✅ Passed The description covers all required sections: summary, list of changes, detailed explanation, testing instructions, and completed checklist items matching the repository template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1337-terminal-transition-chokepoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Base automatically changed from fix/1337-conditional-status-transition to main June 19, 2026 21:18
@mihow mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from a8bce8d to aaa5365 Compare June 19, 2026 21:21
@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 51b8ec7
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a39c5902003a100083ef2dc

mihow added a commit that referenced this pull request Jun 19, 2026
…gnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mihow added a commit that referenced this pull request Jun 19, 2026
…for #1337) (#1343)

* feat(jobs): log premature-terminal and missing-state failures for diagnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(jobs): downgrade missing-state log to info when the job is already terminal

The missing-state diagnostic logged a WARNING saying 'Failing job' for every
in-flight result that arrived after a job finished — but _fail_job no-ops on a
terminal job, so after a cancel (which deletes the Redis state) this fired once
per in-flight batch and misdescribed normal cleanup as a failure. Now: a
terminal job logs at info ('ignoring in-flight result for already-terminal
job'); only a NON-terminal job with missing state logs the warning, which is the
case actually worth investigating.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(jobs): make the diagnostic log lines operator-readable and leaner

The missing-state and completed-after-terminal logs read like insider notes —
ticket numbers and race-theory in the runtime message. Move the rationale and
the issue reference into code comments and make the log lines plain operational
statements an operator can act on without chasing a ticket. Also drop the
redundant dispatch_mode field and the extra status re-query.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(jobs): address review on missing-state diagnostics

- Treat CANCELING as terminal-like in the missing-state classification so a
  cancel-in-flight result logs the benign info line instead of the misleading
  'still running / marking it failed' warning (matches _fail_job's no-op set).
  Caught by CodeRabbit and Copilot.
- Rename the values() dict from 'row' to 'job_values' (per review).
- Log the completed-after-terminal case via job.logger and include the stage and
  attempted terminal state, without an extra status re-query (per Copilot).

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…point

Issue #1337 is a lost-update race on the job status column. PR #1338 fixed
one writer — `_update_job_progress` — by splitting the terminal status write
out of the progress-blob save and performing it as a guarded, statement-scope
UPDATE that only fires from a pre-terminal status. The other four terminal
writers still did an unguarded full-row `save()` and could clobber a terminal
status from the opposite direction: a cancel could overwrite a just-committed
SUCCESS with REVOKED, and a stale `task_postrun` SUCCESS or `task_failure`
FAILURE could resurrect a job another writer had already revoked.

This change adds a single `Job._guarded_status_update(to_status, from_statuses,
*, set_finished=False)` helper that performs the guarded UPDATE (no row lock,
so it does not reintroduce the contention #1261 removed) and advances the
in-memory instance only when the transition actually fires. The remaining
terminal writers are routed through it:

- `Job.cancel()`: CANCELING and REVOKED are now guarded UPDATEs. The
  `task.revoke()` and `cleanup_async_job_if_needed()` calls still run
  regardless of whether the guard fired, since a job may already be terminal
  but still need its NATS/Redis resources released.
- `update_job_status` (task_postrun): only the terminal SUCCESS path is
  guarded; non-terminal celery states still flow through the dual-use
  `update_status()` unchanged.
- `update_job_failure` (task_failure): the terminal FAILURE write is guarded,
  keeping the existing in-flight-async deferral guard intact.

`_update_job_progress` and `_fail_job` are left as-is: the former is already
guarded by #1338, and the latter is already safe via `select_for_update` plus a
status precondition.

After a guarded transition, callers persist `progress.summary.status` into the
JSONB with a narrow `save(update_fields=["progress", ...])` rather than a full
save, matching #1338 and avoiding clobbering other columns. The save only
happens when the guard fired, so an already-terminal job keeps both its status
column and its summary.status.

One intentional behavior change: `update_job_failure` now sets `finished_at`
when it marks FAILURE (it previously left it unset), making a failed terminal
job consistent with `_fail_job` and the result handler.

Adds sequential regression tests (postrun/failure cannot resurrect a REVOKED
job; cancel of an already-SUCCESS job no-ops on status but still cleans up) and
two real-concurrency tests that interleave cancel against a completing result
batch in both directions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e into cancel

- The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not
  routed through _guarded_status_update. Correct the docstring's false 'single
  chokepoint' claim: this helper is the chokepoint for the lock-free writers;
  _fail_job and the reaper enforce the same no-resurrect invariant under
  select_for_update (the reaper deliberately keeps a broader from-set so it can
  still force a stuck CANCELING/UNKNOWN job terminal as last resort).
- Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes
  the local run_job WITHOUT terminate. That task only queues images and has
  usually finished; the remote ADC work is stopped by the NATS/Redis teardown,
  not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate.

Refs #1337. Supersedes the cancel rewrite in #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mihow mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from aaa5365 to f112e6d Compare June 20, 2026 01:01
@mihow mihow marked this pull request as ready for review June 22, 2026 17:19
Copilot AI review requested due to automatic review settings June 22, 2026 17:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens terminal job-status writes against lost-update races by routing previously lock-free terminal writers (job cancel path and Celery task signal handlers) through a shared guarded, conditional UPDATE ... WHERE status IN (...) transition so stale writers can’t resurrect already-finished jobs.

Changes:

  • Added Job._guarded_status_update(...) and updated Job.cancel() to use guarded transitions for CANCELING/REVOKED while still performing async resource teardown.
  • Updated Celery task_postrun (SUCCESS) and task_failure (FAILURE) handlers to use the guarded transition and set finished_at on FAILURE when the transition fires.
  • Added regression and concurrency-interleaving tests covering the guarded chokepoint and cancel-vs-complete races.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ami/jobs/models.py Introduces _guarded_status_update and updates cancel() to prevent terminal-status clobbering while preserving teardown behavior.
ami/jobs/tasks.py Routes terminal SUCCESS/FAILURE writes in Celery signal handlers through the guarded transition to prevent resurrection/clobber.
ami/jobs/tests/test_tasks.py Adds regression tests for the guarded terminal-writer chokepoint and reproduces the real cancel/completion race via threaded interleavings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/tasks.py Outdated
Comment thread ami/jobs/tasks.py Outdated
Fix the double '#' in the issue reference and reword the task_postrun /
task_failure comments in plainer, team-readable terms (drop session
shorthand): describe the guarded write as only transitioning a job still
in a non-terminal state.
…g effect

Rewrite the helper's docstring to open with what it prevents (concurrent,
stale status writes flipping a finished job back to running or a
cancelled/failed job to SUCCESS, plus the knock-on bugs since job status
drives claimability, cleanup, and the UI) before the implementation, and
in plainer prose.
@mihow mihow merged commit c7adbab into main Jun 22, 2026
7 checks passed
@mihow mihow deleted the fix/1337-terminal-transition-chokepoint branch June 22, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants