Fix flaky //engine:data_loader_test on macOS-15 CI runners#1521
Fix flaky //engine:data_loader_test on macOS-15 CI runners#1521Senthil455 wants to merge 1 commit into
Conversation
Replace the one-shot absl::Notification with absl::CondVar + bool for reliable thread wakeup, and fix RegisterRequest to accept low-priority requests even when the sorted front doesn't change. Root causes: 1. WaitHighPriorityDataTest: Notification::WaitForNotificationWithTimeout (100ms) could expire on slow CI before the main thread registered the high-priority request. Using CondVar::WaitWithTimeout allows immediate wakeup via SignalAll() when any new request is registered. 2. LowPriorityRequestTest: RegisterRequest returned false for new requests that didn't change the sorted front (lower priority), causing StartNewDataBuildTask to reject valid requests on fast systems where the loading thread finishes before the main thread adds the next request. 3. Thread-exit race: StartReloadLoop could exit while new requests were being added, with a brief window where IsRunning() incorrectly returned true, preventing a new thread from being scheduled. Changes: - Replace absl::Notification with absl::Mutex + absl::CondVar + bool - RegisterRequest now returns true for genuinely new requests (different ID) - Signal loading thread for ALL new requests, not just high-priority ones - NotifyHighPriorityDataRegisteredForTesting uses the new signaling mechanism
|
Thanks for digging into this, and for the detailed writeup — a couple of the underlying diagnoses are genuinely correct (the CI evidenceI built the PR branch and ran
Failures by test case (true
So ~18% of runs now fail, the original Suggested path forwardMy understanding is that the original two flaky tests (
Both of these are already implemented in my PR #1522 (test-only, +11/-5), which passes 300/300 (3 macOS-15 shards × 100 runs) with production code unchanged. Given that, I'm going to close this PR in favor of #1522 — but thank you, the timing-fragility diagnosis here directly led to finding and fixing both flakes. |
Three distinct race conditions were identified and fixed in the
DataLoaderasync loading mechanism.Race 1:
WaitHighPriorityDataTest- Background thread races with 100ms timeoutRoot cause
The background thread used
absl::Notification::WaitForNotificationWithTimeout(100ms)to wait for high-priority data to be registered.On slow or heavily loaded CI runners, the main thread could be preempted for more than 100ms after scheduling the background thread. As a result, the timeout could expire before:
was called.
The background thread would then proceed to build the low-priority request (priority 50), causing the callback assertion to fail:
EXPECT_EQ(response->response.request().priority(), 10);Fix
Replace the one-shot
absl::Notificationwith:absl::Mutex absl::CondVar bool high_priority_notified_When a high-priority request is registered via
StartNewDataBuildTask(),CondVar::SignalAll()wakes the waiting background thread immediately, eliminating the race window.The existing 100ms timeout is retained only as a safety fallback.
Files
src/engine/data_loader.hsrc/engine/data_loader.cc(lines 198-212)Race 2:
LowPriorityRequestTest-RegisterRequest()rejects valid low-priority requestsRoot cause
RegisterRequest()returned:When a new request was added but the highest-priority request remained unchanged (for example, adding priority 100 while priority 50 was already loaded and remained at the front),
RegisterRequest()returnedfalse.This caused:
StartNewDataBuildTask(...)to return
false, failing:EXPECT_TRUE(...)On fast systems (such as macOS-15 ARM runners), the loading thread could finish processing the first request before the main thread added the second, making the failure deterministic.
Fix
Track whether the request is genuinely new:
when a new ID is inserted into
requests_.Return:
This ensures that newly registered requests (with different fingerprints) are always accepted, even when they do not change the sorted front request.
File
src/engine/data_loader.cc(lines 110-130)Race 3: Thread exits while new requests are being added
Root cause
After
StartReloadLoop()exited because:GetPendingRequestData() == std::nullopt(a condition where the front request matched
current_request_id_), there was a small window between:BackgroundFuture::Ready()becomingtrue.During this window,
IsRunning()could incorrectly returntruebecause theBackgroundFutureobject still existed and had not yet been marked ready.As a result,
StartNewDataBuildTask()would not schedule a replacement worker thread, leaving the newly registered request unprocessed.Fix
Signal:
for every new request registration, not only high-priority requests.
This immediately wakes any thread currently blocked in
WaitWithTimeout(). Combined with theCondVarmechanism introduced in Race 1, the worker thread re-evaluates pending requests whenever new work arrives and no longer misses queued requests.File
src/engine/data_loader.cc(lines 239-247)Changes
src/engine/data_loader.hRemoved
Replaced
with:
Updated
NotifyHighPriorityDataRegisteredForTesting()now uses the mutex and condition-variable signaling mechanism.src/engine/data_loader.ccRegisterRequest(): trackis_newfor newly inserted request IDs; return `is_newStartReloadLoop(): replace notification wait withCondVar::WaitWithTimeout()and track whether wake-up was caused by a high-priority signal or timeoutStartNewDataBuildTask(): callsignal_cv_.SignalAll()for every newly registered requestTesting
data_loader_testno longer depends on timing-sensitiveabsl::Notificationbehavior.WaitHighPriorityDataTestnow wakes the background thread immediately viaCondVar::SignalAll(), eliminating the 100ms race.LowPriorityRequestTestnow consistently accepts new request IDs regardless of timing.data_loader_test.ccremain unchanged and pass with the new implementation.