Skip to content

fix(tasks): complete finalization when reconciler loses race to runner report#3964

Draft
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-95da
Draft

fix(tasks): complete finalization when reconciler loses race to runner report#3964
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-95da

Conversation

@cursor

@cursor cursor Bot commented Jun 15, 2026

Copy link
Copy Markdown

Bug and impact

In HA deployments, when a runner reports a terminal task status on one node while the runner reconciler on another node concurrently calls failTaskRunnerLost, the reconciler can win the cluster finalize lock. If the DB already has a terminal status, failTaskRunnerLost returned early without releasing pool/Redis state (running set, active-by-project, claims) and without setting End, running autorun children, or progressing workflows.

This leaves tasks showing as finished in the DB while still occupying concurrency slots until restart or manual cleanup.

Root cause

failTaskRunnerLost treated an already-finished DB row as a no-op after winning TryFinalize, but FinalizeRemoteTask on the other node then failed to acquire the lock and also did nothing. The e0ef3d05 backstop in finalizeRemoteTaskLocked only ran when FinalizeRemoteTask won the lock.

Secondary issues in requeueTaskRunnerOffline:

  • TOCTOU between DB refresh and persist could requeue a task that another node had already marked running
  • Failed UpdateTask left in-memory runner_id cleared while DB still had it, causing subsequent reconciles to mis-route the task as undispatched

Fix

  • When failTaskRunnerLost observes a finished status after DB refresh, delegate to finalizeRemoteTaskLocked instead of returning
  • Make finalizeRemoteTaskLocked skip re-finalization when End is already set (not only in HA mode)
  • Re-check DB immediately before requeue mutation; roll back in-memory state on persist failure

Validation

  • Added/updated unit tests in runner_reconciler_test.go for HA finalize race, requeue TOCTOU skip, and persist-error rollback
  • go test ./services/tasks/... passes
Open in Web View Automation 

…r report

When failTaskRunnerLost wins the cluster finalize lock over
FinalizeRemoteTask but the DB already has a terminal status, the early
return leaked running/active pool state and skipped End, autorun, and
workflow progression.

Also harden requeueTaskRunnerOffline: re-check DB before mutating to
avoid requeueing a concurrently running task, and roll back in-memory
state when persist fails so the next reconcile retries correctly.

Make finalizeRemoteTaskLocked idempotent when End is already set in
non-HA mode as well.

Co-authored-by: Denis Gukov <fiftin@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant