Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds ChangesTerminal Transition Hardening
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a8bce8d to
aaa5365
Compare
✅ Deploy Preview for antenna-ssec canceled.
|
…gnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…for #1337) (#1343) * feat(jobs): log premature-terminal and missing-state failures for diagnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(jobs): downgrade missing-state log to info when the job is already terminal The missing-state diagnostic logged a WARNING saying 'Failing job' for every in-flight result that arrived after a job finished — but _fail_job no-ops on a terminal job, so after a cancel (which deletes the Redis state) this fired once per in-flight batch and misdescribed normal cleanup as a failure. Now: a terminal job logs at info ('ignoring in-flight result for already-terminal job'); only a NON-terminal job with missing state logs the warning, which is the case actually worth investigating. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): make the diagnostic log lines operator-readable and leaner The missing-state and completed-after-terminal logs read like insider notes — ticket numbers and race-theory in the runtime message. Move the rationale and the issue reference into code comments and make the log lines plain operational statements an operator can act on without chasing a ticket. Also drop the redundant dispatch_mode field and the extra status re-query. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): address review on missing-state diagnostics - Treat CANCELING as terminal-like in the missing-state classification so a cancel-in-flight result logs the benign info line instead of the misleading 'still running / marking it failed' warning (matches _fail_job's no-op set). Caught by CodeRabbit and Copilot. - Rename the values() dict from 'row' to 'job_values' (per review). - Log the completed-after-terminal case via job.logger and include the stage and attempted terminal state, without an extra status re-query (per Copilot). Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…point Issue #1337 is a lost-update race on the job status column. PR #1338 fixed one writer — `_update_job_progress` — by splitting the terminal status write out of the progress-blob save and performing it as a guarded, statement-scope UPDATE that only fires from a pre-terminal status. The other four terminal writers still did an unguarded full-row `save()` and could clobber a terminal status from the opposite direction: a cancel could overwrite a just-committed SUCCESS with REVOKED, and a stale `task_postrun` SUCCESS or `task_failure` FAILURE could resurrect a job another writer had already revoked. This change adds a single `Job._guarded_status_update(to_status, from_statuses, *, set_finished=False)` helper that performs the guarded UPDATE (no row lock, so it does not reintroduce the contention #1261 removed) and advances the in-memory instance only when the transition actually fires. The remaining terminal writers are routed through it: - `Job.cancel()`: CANCELING and REVOKED are now guarded UPDATEs. The `task.revoke()` and `cleanup_async_job_if_needed()` calls still run regardless of whether the guard fired, since a job may already be terminal but still need its NATS/Redis resources released. - `update_job_status` (task_postrun): only the terminal SUCCESS path is guarded; non-terminal celery states still flow through the dual-use `update_status()` unchanged. - `update_job_failure` (task_failure): the terminal FAILURE write is guarded, keeping the existing in-flight-async deferral guard intact. `_update_job_progress` and `_fail_job` are left as-is: the former is already guarded by #1338, and the latter is already safe via `select_for_update` plus a status precondition. After a guarded transition, callers persist `progress.summary.status` into the JSONB with a narrow `save(update_fields=["progress", ...])` rather than a full save, matching #1338 and avoiding clobbering other columns. The save only happens when the guard fired, so an already-terminal job keeps both its status column and its summary.status. One intentional behavior change: `update_job_failure` now sets `finished_at` when it marks FAILURE (it previously left it unset), making a failed terminal job consistent with `_fail_job` and the result handler. Adds sequential regression tests (postrun/failure cannot resurrect a REVOKED job; cancel of an already-SUCCESS job no-ops on status but still cleans up) and two real-concurrency tests that interleave cancel against a completing result batch in both directions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e into cancel - The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not routed through _guarded_status_update. Correct the docstring's false 'single chokepoint' claim: this helper is the chokepoint for the lock-free writers; _fail_job and the reaper enforce the same no-resurrect invariant under select_for_update (the reaper deliberately keeps a broader from-set so it can still force a stuck CANCELING/UNKNOWN job terminal as last resort). - Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes the local run_job WITHOUT terminate. That task only queues images and has usually finished; the remote ADC work is stopped by the NATS/Redis teardown, not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate. Refs #1337. Supersedes the cancel rewrite in #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aaa5365 to
f112e6d
Compare
There was a problem hiding this comment.
Pull request overview
This PR hardens terminal job-status writes against lost-update races by routing previously lock-free terminal writers (job cancel path and Celery task signal handlers) through a shared guarded, conditional UPDATE ... WHERE status IN (...) transition so stale writers can’t resurrect already-finished jobs.
Changes:
- Added
Job._guarded_status_update(...)and updatedJob.cancel()to use guarded transitions forCANCELING/REVOKEDwhile still performing async resource teardown. - Updated Celery
task_postrun(SUCCESS) andtask_failure(FAILURE) handlers to use the guarded transition and setfinished_aton FAILURE when the transition fires. - Added regression and concurrency-interleaving tests covering the guarded chokepoint and cancel-vs-complete races.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ami/jobs/models.py |
Introduces _guarded_status_update and updates cancel() to prevent terminal-status clobbering while preserving teardown behavior. |
ami/jobs/tasks.py |
Routes terminal SUCCESS/FAILURE writes in Celery signal handlers through the guarded transition to prevent resurrection/clobber. |
ami/jobs/tests/test_tasks.py |
Adds regression tests for the guarded terminal-writer chokepoint and reproduces the real cancel/completion race via threaded interleavings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fix the double '#' in the issue reference and reword the task_postrun / task_failure comments in plainer, team-readable terms (drop session shorthand): describe the guarded write as only transitioning a job still in a non-terminal state.
…g effect Rewrite the helper's docstring to open with what it prevents (concurrent, stale status writes flipping a finished job back to running or a cancelled/failed job to SUCCESS, plus the knock-on bugs since job status drives claimability, cleanup, and the UI) before the implementation, and in plainer prose.
Summary
A focused fast-follow on #1338 (now merged). #1338 made the result handler's terminal status write safe under concurrency, but the job's status is written from several places. This PR routes the other lock-free terminal writers —
cancel()and the two Celery task-signal handlers — through one shared guarded transition, so a stale writer can't resurrect a job another writer already finished.The motivation is the same lost-update race as #1337: a plain
save()of the wholeJobrow from an out-of-date snapshot can overwrite a status another worker just set. #1338 stopped the result handler; this stopscancel()and the signal handlers doing it from the other direction (e.g. a late task-success signal flipping a just-cancelled job back to SUCCESS).On
main(no longer stacked — #1338 is merged).List of Changes
Job._guarded_status_update(to_status, from_statuses, *, set_finished=False): a statement-scopeUPDATE ... WHERE status IN (from_statuses)that holds no row lock (so it doesn't reintroduce the contention fix(jobs): fixes for concurrent ML processing jobs #1261 removed) and advances the in-memory instance only when a row actually changes. Default from-set isJobState.finalizable_states().cancel()no longer clobbers a finished job, and async cancel no longer SIGTERMs the bootstrap. CANCELING/REVOKED now go through the guarded helper, so cancelling an already-finished job leaves its terminal status intact. Additionally — folding the one useful change from Improve celery task dispatch and cancellation to prevent stuck jobs #1324 — an ASYNC_API cancel now revokes the localrun_jobwithoutterminate: that task only queues images and has usually finished, and the remote ADC work is stopped by the NATS/Redis teardown, not by killing the bootstrap. Sync/internal jobs still terminate (that task is the work). Teardown (cleanup_async_job_if_needed) runs unconditionally.update_job_statusand the FAILURE write inupdate_job_failurego through the guarded helper; their existing pre-checks are unchanged. Minor behaviour change: a FAILURE set via the task-failure signal now also recordsfinished_at, matching_fail_joband the result handler.TestTerminalTransitionChokepointcovers the guard for each writer;TestCancelCompletionRacereproduces the real concurrent interleave between a cancel and a completing result batch in both directions (mirrors Stop a finished job from being pulled back to running by a slower worker #1338'sTestConcurrentStatusRace). 107 pass. Also validated on a dev deployment: cancelling a job mid-flight leaves it REVOKED with no resurrection and full NATS/Redis teardown.Detailed Description
There are six terminal-status writers (the earlier "five" count missed the reaper). This PR brings the three lock-free ones onto the guarded transition; the two lock-based ones are already safe; the reaper is intentionally left broader:
_update_job_progress(result handler)Job.cancel()update_job_status(task_postrun)update_job_failure(task_failure)_fail_jobselect_for_update+ terminal/CANCELING precondition (already safe)check_stale_jobs(reaper)select_for_update; intentionally keeps a broader from-set so it can force a genuinely stuck CANCELING/UNKNOWN job terminal as last resortSo
_guarded_status_updateis the chokepoint for the lock-free writers — not a literal "single chokepoint" for all status writes; the two lock-based writers enforce the same no-resurrect invariant under their row lock. (An earlier docstring overclaimed this; corrected here.)cancel()'s REVOKED transition includes CANCELING in its from-set, since cancel sets CANCELING itself and must complete that progression.Supersedes the
cancel()rewrite in #1324 (its skip-terminate idea is folded in here). #1324's other parts — a no-opCELERY_WORKER_POOL_OPTIMIZATION="fair"setting and aconsumer_timeout-sensitiveacks_latechange — are tracked separately; see the note on #1324.How to Test the Changes
pytest ami/jobs/tests/test_tasks.py::TestTerminalTransitionChokepointpytest ami/jobs/tests/test_tasks.py::TestCancelCompletionRaceami/jobs/tests/test_tasks.pyandami/jobs/tests/test_jobs.pypass locally (107 passed).Checklist
Refs #1337. Supersedes the cancel rewrite in #1324.
Summary by CodeRabbit
Bug Fixes
Tests