Skip to content

Fix flaky //engine:data_loader_test on macOS-15 CI runners#1521

Open
Senthil455 wants to merge 1 commit into
google:masterfrom
Senthil455:fix/flaky-data-loader-test-macos
Open

Fix flaky //engine:data_loader_test on macOS-15 CI runners#1521
Senthil455 wants to merge 1 commit into
google:masterfrom
Senthil455:fix/flaky-data-loader-test-macos

Conversation

@Senthil455
Copy link
Copy Markdown

@Senthil455 Senthil455 commented Jun 5, 2026

Three distinct race conditions were identified and fixed in the DataLoader async loading mechanism.


Race 1: WaitHighPriorityDataTest - Background thread races with 100ms timeout

Root cause

The background thread used absl::Notification::WaitForNotificationWithTimeout(100ms) to wait for high-priority data to be registered.

On slow or heavily loaded CI runners, the main thread could be preempted for more than 100ms after scheduling the background thread. As a result, the timeout could expire before:

StartNewDataBuildTask(make_request(10), callback);

was called.

The background thread would then proceed to build the low-priority request (priority 50), causing the callback assertion to fail:

EXPECT_EQ(response->response.request().priority(), 10);

Fix

Replace the one-shot absl::Notification with:

absl::Mutex
absl::CondVar
bool high_priority_notified_

When a high-priority request is registered via StartNewDataBuildTask(), CondVar::SignalAll() wakes the waiting background thread immediately, eliminating the race window.

The existing 100ms timeout is retained only as a safety fallback.

Files

  • src/engine/data_loader.h
  • src/engine/data_loader.cc (lines 198-212)

Race 2: LowPriorityRequestTest - RegisterRequest() rejects valid low-priority requests

Root cause

RegisterRequest() returned:

current_request_id_ != requests_.front().id

When a new request was added but the highest-priority request remained unchanged (for example, adding priority 100 while priority 50 was already loaded and remained at the front), RegisterRequest() returned false.

This caused:

StartNewDataBuildTask(...)

to return false, failing:

EXPECT_TRUE(...)

On fast systems (such as macOS-15 ARM runners), the loading thread could finish processing the first request before the main thread added the second, making the failure deterministic.

Fix

Track whether the request is genuinely new:

bool is_new = true;

when a new ID is inserted into requests_.

Return:

is_new || current_request_id_ != requests_.front().id

This ensures that newly registered requests (with different fingerprints) are always accepted, even when they do not change the sorted front request.

File

  • src/engine/data_loader.cc (lines 110-130)

Race 3: Thread exits while new requests are being added

Root cause

After StartReloadLoop() exited because:

GetPendingRequestData() == std::nullopt

(a condition where the front request matched current_request_id_), there was a small window between:

  1. The worker thread returning, and
  2. BackgroundFuture::Ready() becoming true.

During this window, IsRunning() could incorrectly return true because the BackgroundFuture object still existed and had not yet been marked ready.

As a result, StartNewDataBuildTask() would not schedule a replacement worker thread, leaving the newly registered request unprocessed.

Fix

Signal:

signal_cv_.SignalAll();

for every new request registration, not only high-priority requests.

This immediately wakes any thread currently blocked in WaitWithTimeout(). Combined with the CondVar mechanism introduced in Race 1, the worker thread re-evaluates pending requests whenever new work arrives and no longer misses queued requests.

File

  • src/engine/data_loader.cc (lines 239-247)

Changes

src/engine/data_loader.h

Removed

#include "absl/synchronization/notification.h"

Replaced

absl::Notification high_priority_data_registered_;

with:

mutable absl::Mutex signal_mu_;
bool high_priority_notified_ ABSL_GUARDED_BY(signal_mu_) = false;
absl::CondVar signal_cv_;

Updated

NotifyHighPriorityDataRegisteredForTesting() now uses the mutex and condition-variable signaling mechanism.


src/engine/data_loader.cc

Lines Change
110-130 RegisterRequest(): track is_new for newly inserted request IDs; return `is_new
198-212 StartReloadLoop(): replace notification wait with CondVar::WaitWithTimeout() and track whether wake-up was caused by a high-priority signal or timeout
239-247 StartNewDataBuildTask(): call signal_cv_.SignalAll() for every newly registered request

Testing

  • data_loader_test no longer depends on timing-sensitive absl::Notification behavior.
  • WaitHighPriorityDataTest now wakes the background thread immediately via CondVar::SignalAll(), eliminating the 100ms race.
  • LowPriorityRequestTest now consistently accepts new request IDs regardless of timing.
  • All existing assertions in data_loader_test.cc remain unchanged and pass with the new implementation.

Replace the one-shot absl::Notification with absl::CondVar + bool for
reliable thread wakeup, and fix RegisterRequest to accept low-priority
requests even when the sorted front doesn't change.

Root causes:
1. WaitHighPriorityDataTest: Notification::WaitForNotificationWithTimeout
   (100ms) could expire on slow CI before the main thread registered the
   high-priority request. Using CondVar::WaitWithTimeout allows immediate
   wakeup via SignalAll() when any new request is registered.

2. LowPriorityRequestTest: RegisterRequest returned false for new requests
   that didn't change the sorted front (lower priority), causing
   StartNewDataBuildTask to reject valid requests on fast systems where
   the loading thread finishes before the main thread adds the next request.

3. Thread-exit race: StartReloadLoop could exit while new requests were
   being added, with a brief window where IsRunning() incorrectly returned
   true, preventing a new thread from being scheduled.

Changes:
- Replace absl::Notification with absl::Mutex + absl::CondVar + bool
- RegisterRequest now returns true for genuinely new requests (different ID)
- Signal loading thread for ALL new requests, not just high-priority ones
- NotifyHighPriorityDataRegisteredForTesting uses the new signaling mechanism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant