fix(tasks): complete finalization when reconciler loses race to runner report#3964
Draft
cursor[bot] wants to merge 1 commit into
Draft
fix(tasks): complete finalization when reconciler loses race to runner report#3964cursor[bot] wants to merge 1 commit into
cursor[bot] wants to merge 1 commit into
Conversation
…r report When failTaskRunnerLost wins the cluster finalize lock over FinalizeRemoteTask but the DB already has a terminal status, the early return leaked running/active pool state and skipped End, autorun, and workflow progression. Also harden requeueTaskRunnerOffline: re-check DB before mutating to avoid requeueing a concurrently running task, and roll back in-memory state when persist fails so the next reconcile retries correctly. Make finalizeRemoteTaskLocked idempotent when End is already set in non-HA mode as well. Co-authored-by: Denis Gukov <fiftin@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug and impact
In HA deployments, when a runner reports a terminal task status on one node while the runner reconciler on another node concurrently calls
failTaskRunnerLost, the reconciler can win the cluster finalize lock. If the DB already has a terminal status,failTaskRunnerLostreturned early without releasing pool/Redis state (runningset, active-by-project, claims) and without settingEnd, running autorun children, or progressing workflows.This leaves tasks showing as finished in the DB while still occupying concurrency slots until restart or manual cleanup.
Root cause
failTaskRunnerLosttreated an already-finished DB row as a no-op after winningTryFinalize, butFinalizeRemoteTaskon the other node then failed to acquire the lock and also did nothing. Thee0ef3d05backstop infinalizeRemoteTaskLockedonly ran whenFinalizeRemoteTaskwon the lock.Secondary issues in
requeueTaskRunnerOffline:runningUpdateTaskleft in-memoryrunner_idcleared while DB still had it, causing subsequent reconciles to mis-route the task as undispatchedFix
failTaskRunnerLostobserves a finished status after DB refresh, delegate tofinalizeRemoteTaskLockedinstead of returningfinalizeRemoteTaskLockedskip re-finalization whenEndis already set (not only in HA mode)Validation
runner_reconciler_test.gofor HA finalize race, requeue TOCTOU skip, and persist-error rollbackgo test ./services/tasks/...passes