Fix flaky //engine:data_loader_test on macOS-15 CI runners#1521
Open
Senthil455 wants to merge 1 commit into
Open
Fix flaky //engine:data_loader_test on macOS-15 CI runners#1521Senthil455 wants to merge 1 commit into
Senthil455 wants to merge 1 commit into
Conversation
Replace the one-shot absl::Notification with absl::CondVar + bool for reliable thread wakeup, and fix RegisterRequest to accept low-priority requests even when the sorted front doesn't change. Root causes: 1. WaitHighPriorityDataTest: Notification::WaitForNotificationWithTimeout (100ms) could expire on slow CI before the main thread registered the high-priority request. Using CondVar::WaitWithTimeout allows immediate wakeup via SignalAll() when any new request is registered. 2. LowPriorityRequestTest: RegisterRequest returned false for new requests that didn't change the sorted front (lower priority), causing StartNewDataBuildTask to reject valid requests on fast systems where the loading thread finishes before the main thread adds the next request. 3. Thread-exit race: StartReloadLoop could exit while new requests were being added, with a brief window where IsRunning() incorrectly returned true, preventing a new thread from being scheduled. Changes: - Replace absl::Notification with absl::Mutex + absl::CondVar + bool - RegisterRequest now returns true for genuinely new requests (different ID) - Signal loading thread for ALL new requests, not just high-priority ones - NotifyHighPriorityDataRegisteredForTesting uses the new signaling mechanism
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three distinct race conditions were identified and fixed in the
DataLoaderasync loading mechanism.Race 1:
WaitHighPriorityDataTest- Background thread races with 100ms timeoutRoot cause
The background thread used
absl::Notification::WaitForNotificationWithTimeout(100ms)to wait for high-priority data to be registered.On slow or heavily loaded CI runners, the main thread could be preempted for more than 100ms after scheduling the background thread. As a result, the timeout could expire before:
was called.
The background thread would then proceed to build the low-priority request (priority 50), causing the callback assertion to fail:
EXPECT_EQ(response->response.request().priority(), 10);Fix
Replace the one-shot
absl::Notificationwith:absl::Mutex absl::CondVar bool high_priority_notified_When a high-priority request is registered via
StartNewDataBuildTask(),CondVar::SignalAll()wakes the waiting background thread immediately, eliminating the race window.The existing 100ms timeout is retained only as a safety fallback.
Files
src/engine/data_loader.hsrc/engine/data_loader.cc(lines 198-212)Race 2:
LowPriorityRequestTest-RegisterRequest()rejects valid low-priority requestsRoot cause
RegisterRequest()returned:When a new request was added but the highest-priority request remained unchanged (for example, adding priority 100 while priority 50 was already loaded and remained at the front),
RegisterRequest()returnedfalse.This caused:
StartNewDataBuildTask(...)to return
false, failing:EXPECT_TRUE(...)On fast systems (such as macOS-15 ARM runners), the loading thread could finish processing the first request before the main thread added the second, making the failure deterministic.
Fix
Track whether the request is genuinely new:
when a new ID is inserted into
requests_.Return:
This ensures that newly registered requests (with different fingerprints) are always accepted, even when they do not change the sorted front request.
File
src/engine/data_loader.cc(lines 110-130)Race 3: Thread exits while new requests are being added
Root cause
After
StartReloadLoop()exited because:GetPendingRequestData() == std::nullopt(a condition where the front request matched
current_request_id_), there was a small window between:BackgroundFuture::Ready()becomingtrue.During this window,
IsRunning()could incorrectly returntruebecause theBackgroundFutureobject still existed and had not yet been marked ready.As a result,
StartNewDataBuildTask()would not schedule a replacement worker thread, leaving the newly registered request unprocessed.Fix
Signal:
for every new request registration, not only high-priority requests.
This immediately wakes any thread currently blocked in
WaitWithTimeout(). Combined with theCondVarmechanism introduced in Race 1, the worker thread re-evaluates pending requests whenever new work arrives and no longer misses queued requests.File
src/engine/data_loader.cc(lines 239-247)Changes
src/engine/data_loader.hRemoved
Replaced
with:
Updated
NotifyHighPriorityDataRegisteredForTesting()now uses the mutex and condition-variable signaling mechanism.src/engine/data_loader.ccRegisterRequest(): trackis_newfor newly inserted request IDs; return `is_newStartReloadLoop(): replace notification wait withCondVar::WaitWithTimeout()and track whether wake-up was caused by a high-priority signal or timeoutStartNewDataBuildTask(): callsignal_cv_.SignalAll()for every newly registered requestTesting
data_loader_testno longer depends on timing-sensitiveabsl::Notificationbehavior.WaitHighPriorityDataTestnow wakes the background thread immediately viaCondVar::SignalAll(), eliminating the 100ms race.LowPriorityRequestTestnow consistently accepts new request IDs regardless of timing.data_loader_test.ccremain unchanged and pass with the new implementation.