Skip to content

Fix Windows orchestrator test race condition#311

Open
circleci-app[bot] wants to merge 1 commit into
mainfrom
chunk/fix-windows-orchestrator-test-race
Open

Fix Windows orchestrator test race condition#311
circleci-app[bot] wants to merge 1 commit into
mainfrom
chunk/fix-windows-orchestrator-test-race

Conversation

@circleci-app
Copy link
Copy Markdown
Contributor

@circleci-app circleci-app Bot commented May 20, 2026

Summary

This PR fixes a race condition in the orchestrator tests that was causing failures on Windows, specifically in the test case "error: task agent encountered fatal error".

Problem

The test was failing on Windows with the following error:

=== FAIL: task TestOrchestrator/error:_task_agent_encountered_fatal_error (48.50s)
    orchestrator_test.go:327: assertion failed: 
        --- ←
        +++ tt.wantTaskEvents
          []fakerunnerapi.TaskEvent(
        -       nil,
        +       {
        +               {
        +                       Allocation:     "testalloc",
        +                       TimestampMilli: 1779310915952,
        +                       Message:        []uint8("error while executing task agent: task agent command exited with"...),
        +               },
        +       },
          )

The orchestrator was unable to send the failure event to the fake runner API, with connection refused errors occurring across 16 retry attempts over 48 seconds.

Root Cause

On Windows, there's a race condition between:

  1. The orchestrator's asynchronous error handling (which sends failure events to the runner API)
  2. The test's synchronous assertions (which immediately check for those events)

Windows uses Job Objects for process cleanup, which is synchronous and creates tight timing windows. The test was asserting before the orchestrator had completed sending the failure event.

Solution

This PR adds polling logic to wait for the expected number of task events and unclaims (with a 30-second timeout) before performing the assertions. This approach:

  • Eliminates the race condition by waiting for async operations to complete
  • Still catches genuine failures (via the timeout)
  • Doesn't slow down tests that complete quickly (polling succeeds immediately when data is available)

Testing

  • Test compiles successfully
  • Solution mirrors the fix from commit 7b927ad which addressed the same issue

Original failing job
Agent run

Poll for expected task unclaims/events before asserting, to prevent
the test from racing with the orchestrator's final updates on Windows.

On Windows, when the task agent process exits with a fatal error, there
is a race condition between the orchestrator sending the failure event
to the runner API and the test assertions checking for those events.
The orchestrator's error handling happens asynchronously, and on Windows
the process cleanup is synchronous via Job Objects, which can cause
tight timing windows.

This fix adds polling with a 30-second timeout to wait for the expected
number of task events and unclaims before performing the assertions.
This ensures the test doesn't fail due to timing issues while still
catching genuine failures.

Fixes the "error: task agent encountered fatal error" test failure
on Windows where the test expected a TaskEvent but received nil due
to the race condition.
@circleci-app circleci-app Bot requested a review from a team as a code owner May 20, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants