Fix Windows orchestrator test race condition by circleci-app[bot] · Pull Request #311 · circleci/runner-init

circleci-app · 2026-05-20T21:14:03Z

Summary

This PR fixes a race condition in the orchestrator tests that was causing failures on Windows, specifically in the test case "error: task agent encountered fatal error".

Problem

The test was failing on Windows with the following error:

=== FAIL: task TestOrchestrator/error:_task_agent_encountered_fatal_error (48.50s)
    orchestrator_test.go:327: assertion failed: 
        --- ←
        +++ tt.wantTaskEvents
          []fakerunnerapi.TaskEvent(
        -       nil,
        +       {
        +               {
        +                       Allocation:     "testalloc",
        +                       TimestampMilli: 1779310915952,
        +                       Message:        []uint8("error while executing task agent: task agent command exited with"...),
        +               },
        +       },
          )

The orchestrator was unable to send the failure event to the fake runner API, with connection refused errors occurring across 16 retry attempts over 48 seconds.

Root Cause

On Windows, there's a race condition between:

The orchestrator's asynchronous error handling (which sends failure events to the runner API)
The test's synchronous assertions (which immediately check for those events)

Windows uses Job Objects for process cleanup, which is synchronous and creates tight timing windows. The test was asserting before the orchestrator had completed sending the failure event.

Solution

This PR adds polling logic to wait for the expected number of task events and unclaims (with a 30-second timeout) before performing the assertions. This approach:

Eliminates the race condition by waiting for async operations to complete
Still catches genuine failures (via the timeout)
Doesn't slow down tests that complete quickly (polling succeeds immediately when data is available)

Testing

Test compiles successfully
Solution mirrors the fix from commit 7b927ad which addressed the same issue

Original failing job
Agent run

Poll for expected task unclaims/events before asserting, to prevent the test from racing with the orchestrator's final updates on Windows. On Windows, when the task agent process exits with a fatal error, there is a race condition between the orchestrator sending the failure event to the runner API and the test assertions checking for those events. The orchestrator's error handling happens asynchronously, and on Windows the process cleanup is synchronous via Job Objects, which can cause tight timing windows. This fix adds polling with a 30-second timeout to wait for the expected number of task events and unclaims before performing the assertions. This ensures the test doesn't fail due to timing issues while still catching genuine failures. Fixes the "error: task agent encountered fatal error" test failure on Windows where the test expected a TaskEvent but received nil due to the race condition.

circleci-app Bot requested a review from a team as a code owner May 20, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Windows orchestrator test race condition#311

Fix Windows orchestrator test race condition#311
circleci-app[bot] wants to merge 1 commit into
mainfrom
chunk/fix-windows-orchestrator-test-race

circleci-app Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

circleci-app Bot commented May 20, 2026

Summary

Problem

Root Cause

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants