Guard the cancel and signal-handler job-status writes too (fast-follow on #1338) by mihow · Pull Request #1342 · RolnickLab/antenna

mihow · 2026-06-19T01:11:59Z

Summary

A focused fast-follow on #1338 (now merged). #1338 made the result handler's terminal status write safe under concurrency, but the job's status is written from several places. This PR routes the other lock-free terminal writers — cancel() and the two Celery task-signal handlers — through one shared guarded transition, so a stale writer can't resurrect a job another writer already finished.

The motivation is the same lost-update race as #1337: a plain save() of the whole Job row from an out-of-date snapshot can overwrite a status another worker just set. #1338 stopped the result handler; this stops cancel() and the signal handlers doing it from the other direction (e.g. a late task-success signal flipping a just-cancelled job back to SUCCESS).

On main (no longer stacked — #1338 is merged).

List of Changes

One guarded helper for the lock-free terminal writers. Added Job._guarded_status_update(to_status, from_statuses, *, set_finished=False): a statement-scope UPDATE ... WHERE status IN (from_statuses) that holds no row lock (so it doesn't reintroduce the contention fix(jobs): fixes for concurrent ML processing jobs #1261 removed) and advances the in-memory instance only when a row actually changes. Default from-set is JobState.finalizable_states().
cancel() no longer clobbers a finished job, and async cancel no longer SIGTERMs the bootstrap. CANCELING/REVOKED now go through the guarded helper, so cancelling an already-finished job leaves its terminal status intact. Additionally — folding the one useful change from Improve celery task dispatch and cancellation to prevent stuck jobs #1324 — an ASYNC_API cancel now revokes the local run_job without terminate: that task only queues images and has usually finished, and the remote ADC work is stopped by the NATS/Redis teardown, not by killing the bootstrap. Sync/internal jobs still terminate (that task is the work). Teardown (cleanup_async_job_if_needed) runs unconditionally.
The task-success and task-failure signal handlers can no longer resurrect a terminal job. The terminal SUCCESS write in update_job_status and the FAILURE write in update_job_failure go through the guarded helper; their existing pre-checks are unchanged. Minor behaviour change: a FAILURE set via the task-failure signal now also records finished_at, matching _fail_job and the result handler.
Tests. TestTerminalTransitionChokepoint covers the guard for each writer; TestCancelCompletionRace reproduces the real concurrent interleave between a cancel and a completing result batch in both directions (mirrors Stop a finished job from being pulled back to running by a slower worker #1338's TestConcurrentStatusRace). 107 pass. Also validated on a dev deployment: cancelling a job mid-flight leaves it REVOKED with no resurrection and full NATS/Redis teardown.

Detailed Description

There are six terminal-status writers (the earlier "five" count missed the reaper). This PR brings the three lock-free ones onto the guarded transition; the two lock-based ones are already safe; the reaper is intentionally left broader:

Writer	Discipline
`_update_job_progress` (result handler)	guarded conditional UPDATE (#1338)
`Job.cancel()`	guarded conditional UPDATE (this PR)
`update_job_status` (task_postrun)	guarded conditional UPDATE (this PR)
`update_job_failure` (task_failure)	guarded conditional UPDATE (this PR)
`_fail_job`	`select_for_update` + terminal/CANCELING precondition (already safe)
`check_stale_jobs` (reaper)	`select_for_update`; intentionally keeps a broader from-set so it can force a genuinely stuck CANCELING/UNKNOWN job terminal as last resort

So _guarded_status_update is the chokepoint for the lock-free writers — not a literal "single chokepoint" for all status writes; the two lock-based writers enforce the same no-resurrect invariant under their row lock. (An earlier docstring overclaimed this; corrected here.)

cancel()'s REVOKED transition includes CANCELING in its from-set, since cancel sets CANCELING itself and must complete that progression.

Supersedes the cancel() rewrite in #1324 (its skip-terminate idea is folded in here). #1324's other parts — a no-op CELERY_WORKER_POOL_OPTIMIZATION="fair" setting and a consumer_timeout-sensitive acks_late change — are tracked separately; see the note on #1324.

How to Test the Changes

pytest ami/jobs/tests/test_tasks.py::TestTerminalTransitionChokepoint
pytest ami/jobs/tests/test_tasks.py::TestCancelCompletionRace
Full ami/jobs/tests/test_tasks.py and ami/jobs/tests/test_jobs.py pass locally (107 passed).

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

Refs #1337. Supersedes the cancel rewrite in #1324.

Summary by CodeRabbit

Bug Fixes
- Hardened job status transitions so late completion/failure updates can’t overwrite already-terminal outcomes.
- Improved job cancellation reliability, including safer revocation behavior for async jobs and ensuring cleanup runs as expected.
- Completion/failure handlers now only finalize jobs when they’re in a valid pre-terminal state.
Tests
- Added regression tests covering terminal transition race conditions and cancellation/completion interleavings to verify terminal states can’t be resurrected.

netlify · 2026-06-19T01:12:05Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`a8bce8d`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a349762e518b20007c62536

netlify · 2026-06-19T01:12:06Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`51b8ec7`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a39c5905fc148000894025c

coderabbitai · 2026-06-19T01:12:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16719398-9696-4c22-aae9-cad5967a4931

📥 Commits

Reviewing files that changed from the base of the PR and between f3cf90a and 51b8ec7.

📒 Files selected for processing (1)

ami/jobs/models.py

🚧 Files skipped from review as they are similar to previous changes (1)

ami/jobs/models.py

📝 Walkthrough

Walkthrough

Adds Job._guarded_status_update(), a filter-based ORM helper that transitions job status only from an allowed source set and returns whether the transition fired. Job.cancel() and the Celery task_postrun/task_failure signal handlers are rewired to use this helper, preventing concurrent terminal state resurrection or clobbering. New tests cover both static terminal-state protection and synchronized cancel/completion race interleavings.

Changes

Terminal Transition Hardening

Layer / File(s)	Summary
Guarded status update helper `ami/jobs/models.py`	Adds `_guarded_status_update()` which performs a conditional ORM filter-based `UPDATE` guarded by `status__in=from_statuses`, updates in-memory `status`, optionally `finished_at`, and `progress.summary.status`, and returns the row count (0 or 1) to signal whether the transition fired.
`Job.cancel()` with guarded transitions `ami/jobs/models.py`	Rewires `Job.cancel()` to advance through `CANCELING` then `REVOKED` only via guarded transitions, uses termination-free revocation for `ASYNC_API` jobs, and guarantees `cleanup_async_job_if_needed` teardown after guarded status updates, preventing resurrection of already-terminal states.
Celery signal handlers with guarded terminal transitions `ami/jobs/tasks.py`	Routes terminal `SUCCESS` in `update_job_status` and terminal `FAILURE` in `update_job_failure` through `_guarded_status_update` constrained to `finalizable_states`, with conditional field saves only when the guarded update indicates a transition, replacing unconditional `update_status` + `save` calls.
Regression and race-condition tests `ami/jobs/tests/test_tasks.py`	Adds `TestTerminalTransitionChokepoint` for static scenarios and `TestCancelCompletionRace` for synchronized thread interleavings, asserting first-writer-wins semantics and that no terminal state can be resurrected or clobbered by a late cancel or completion signal.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337: This PR directly implements the atomic filter-based status__in guarded update pattern proposed in that issue to prevent concurrent terminal state clobbering across cancel, SUCCESS, and FAILURE paths.

Possibly related PRs

RolnickLab/antenna#1062: Modifies the same update_job_status and update_job_failure Celery signal handlers in ami/jobs/tasks.py that this PR now hardens with guarded transitions.
RolnickLab/antenna#1114: Adds an earlier guard in update_job_status for premature SUCCESS, directly related to the same terminal SUCCESS handling surface this PR now locks via _guarded_status_update.
RolnickLab/antenna#1338: Changes the same files (ami/jobs/tasks.py, ami/jobs/tests/test_tasks.py) with the same goal of preventing terminal job state resurrection via conditional update guards.

Suggested labels

PSv2

🐇 Hoppity hop, the states won't collide,
A guarded UPDATE keeps the winners inside.
No REVOKED resurrected by a late SUCCESS ghost,
The first terminal writer wins — that's what matters most!
🌿 Two races now tested, synchronized with care,
The rabbit says: concurrent bugs, beware! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the PR's main change: guarding job-status writes in cancel() and signal handlers as a fast-follow to `#1338`.
Description check	✅ Passed	The description covers all required sections: summary, list of changes, detailed explanation, testing instructions, and completed checklist items matching the repository template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/1337-terminal-transition-chokepoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

netlify · 2026-06-19T21:21:06Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`51b8ec7`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a39c5902003a100083ef2dc

…gnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…for #1337) (#1343) * feat(jobs): log premature-terminal and missing-state failures for diagnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(jobs): downgrade missing-state log to info when the job is already terminal The missing-state diagnostic logged a WARNING saying 'Failing job' for every in-flight result that arrived after a job finished — but _fail_job no-ops on a terminal job, so after a cancel (which deletes the Redis state) this fired once per in-flight batch and misdescribed normal cleanup as a failure. Now: a terminal job logs at info ('ignoring in-flight result for already-terminal job'); only a NON-terminal job with missing state logs the warning, which is the case actually worth investigating. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): make the diagnostic log lines operator-readable and leaner The missing-state and completed-after-terminal logs read like insider notes — ticket numbers and race-theory in the runtime message. Move the rationale and the issue reference into code comments and make the log lines plain operational statements an operator can act on without chasing a ticket. Also drop the redundant dispatch_mode field and the extra status re-query. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): address review on missing-state diagnostics - Treat CANCELING as terminal-like in the missing-state classification so a cancel-in-flight result logs the benign info line instead of the misleading 'still running / marking it failed' warning (matches _fail_job's no-op set). Caught by CodeRabbit and Copilot. - Rename the values() dict from 'row' to 'job_values' (per review). - Log the completed-after-terminal case via job.logger and include the stage and attempted terminal state, without an extra status re-query (per Copilot). Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…point Issue #1337 is a lost-update race on the job status column. PR #1338 fixed one writer — `_update_job_progress` — by splitting the terminal status write out of the progress-blob save and performing it as a guarded, statement-scope UPDATE that only fires from a pre-terminal status. The other four terminal writers still did an unguarded full-row `save()` and could clobber a terminal status from the opposite direction: a cancel could overwrite a just-committed SUCCESS with REVOKED, and a stale `task_postrun` SUCCESS or `task_failure` FAILURE could resurrect a job another writer had already revoked. This change adds a single `Job._guarded_status_update(to_status, from_statuses, *, set_finished=False)` helper that performs the guarded UPDATE (no row lock, so it does not reintroduce the contention #1261 removed) and advances the in-memory instance only when the transition actually fires. The remaining terminal writers are routed through it: - `Job.cancel()`: CANCELING and REVOKED are now guarded UPDATEs. The `task.revoke()` and `cleanup_async_job_if_needed()` calls still run regardless of whether the guard fired, since a job may already be terminal but still need its NATS/Redis resources released. - `update_job_status` (task_postrun): only the terminal SUCCESS path is guarded; non-terminal celery states still flow through the dual-use `update_status()` unchanged. - `update_job_failure` (task_failure): the terminal FAILURE write is guarded, keeping the existing in-flight-async deferral guard intact. `_update_job_progress` and `_fail_job` are left as-is: the former is already guarded by #1338, and the latter is already safe via `select_for_update` plus a status precondition. After a guarded transition, callers persist `progress.summary.status` into the JSONB with a narrow `save(update_fields=["progress", ...])` rather than a full save, matching #1338 and avoiding clobbering other columns. The save only happens when the guard fired, so an already-terminal job keeps both its status column and its summary.status. One intentional behavior change: `update_job_failure` now sets `finished_at` when it marks FAILURE (it previously left it unset), making a failed terminal job consistent with `_fail_job` and the result handler. Adds sequential regression tests (postrun/failure cannot resurrect a REVOKED job; cancel of an already-SUCCESS job no-ops on status but still cleans up) and two real-concurrency tests that interleave cancel against a completing result batch in both directions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e into cancel - The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not routed through _guarded_status_update. Correct the docstring's false 'single chokepoint' claim: this helper is the chokepoint for the lock-free writers; _fail_job and the reaper enforce the same no-resurrect invariant under select_for_update (the reaper deliberately keeps a broader from-set so it can still force a stuck CANCELING/UNKNOWN job terminal as last resort). - Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes the local run_job WITHOUT terminate. That task only queues images and has usually finished; the remote ADC work is stopped by the NATS/Redis teardown, not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate. Refs #1337. Supersedes the cancel rewrite in #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR hardens terminal job-status writes against lost-update races by routing previously lock-free terminal writers (job cancel path and Celery task signal handlers) through a shared guarded, conditional UPDATE ... WHERE status IN (...) transition so stale writers can’t resurrect already-finished jobs.

Changes:

Added Job._guarded_status_update(...) and updated Job.cancel() to use guarded transitions for CANCELING/REVOKED while still performing async resource teardown.
Updated Celery task_postrun (SUCCESS) and task_failure (FAILURE) handlers to use the guarded transition and set finished_at on FAILURE when the transition fires.
Added regression and concurrency-interleaving tests covering the guarded chokepoint and cancel-vs-complete races.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`ami/jobs/models.py`	Introduces `_guarded_status_update` and updates `cancel()` to prevent terminal-status clobbering while preserving teardown behavior.
`ami/jobs/tasks.py`	Routes terminal SUCCESS/FAILURE writes in Celery signal handlers through the guarded transition to prevent resurrection/clobber.
`ami/jobs/tests/test_tasks.py`	Adds regression tests for the guarded terminal-writer chokepoint and reproduces the real cancel/completion race via threaded interleavings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix the double '#' in the issue reference and reword the task_postrun / task_failure comments in plainer, team-readable terms (drop session shorthand): describe the guarded write as only transitioning a job still in a non-terminal state.

…g effect Rewrite the helper's docstring to open with what it prevents (concurrent, stale status writes flipping a finished job back to running or a cancelled/failed job to SUCCESS, plus the knock-on bugs since job status drives claimability, cleanup, and the UI) before the implementation, and in plainer prose.

mihow mentioned this pull request Jun 19, 2026

Log premature-terminal and missing-state job failures (observability for #1337) #1343

Merged

5 tasks

Base automatically changed from fix/1337-conditional-status-transition to main June 19, 2026 21:18

mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from a8bce8d to aaa5365 Compare June 19, 2026 21:21

This was referenced Jun 20, 2026

Improve celery task dispatch and cancellation to prevent stuck jobs #1324

Closed

[Draft] Don't overwrite logs & status in concurrent background tasks #1026

Closed

mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from aaa5365 to f112e6d Compare June 20, 2026 01:01

mihow marked this pull request as ready for review June 22, 2026 17:19

Copilot AI review requested due to automatic review settings June 22, 2026 17:19

Copilot started reviewing on behalf of mihow June 22, 2026 17:20 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread ami/jobs/tasks.py Outdated

Comment thread ami/jobs/tasks.py Outdated

mihow mentioned this pull request Jun 22, 2026

fix(jobs): fix dangling jobs from going to revoked #1276

Closed

5 tasks

mihow merged commit c7adbab into main Jun 22, 2026
7 checks passed

mihow deleted the fix/1337-terminal-transition-chokepoint branch June 22, 2026 23:37

mihow mentioned this pull request Jun 27, 2026

CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342
mihow merged 4 commits into
mainfrom
fix/1337-terminal-transition-chokepoint

mihow commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

netlify Bot commented Jun 19, 2026

Uh oh!

netlify Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

netlify Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mihow commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Detailed Description

How to Test the Changes

Checklist

Summary by CodeRabbit

Uh oh!

netlify Bot commented Jun 19, 2026

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

netlify Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

netlify Bot commented Jun 19, 2026 •

edited

Loading