Skip empty old log router ranges#13163
Conversation
tclinkenbeard-oai
left a comment
There was a problem hiding this comment.
Generated by Codex.
What is it trying to do?
Skip old log-router generations whose effective range is already empty before constructing a SetPeekCursor, so recovery does not hand log routers a cursor that is born exhausted and can then stall in repeated slow-peek retries.
Is it correct?
Yes, by inspection this looks correct. The patch factors the effective old-generation end into oldEnd, skips when begin >= oldEnd, and reuses the same bound when constructing the cursor. That matches the existing range semantics in this path: the first old generation is capped at recoveredAt + 1 when present, later old generations use epochEnd, and an empty half-open range should not produce a cursor at all.
I inspected the changed branch in peekLogRouter, the SetPeekCursor behavior it feeds, and the log-router consumer loop that retries slow peeks. I did not run builds or tests. GitHub currently reports no status checks on the branch, so there are no failing checks to weigh, but there is also no CI signal yet.
Are there bugs?
I did not find any correctness bugs in this change.
Are there omissions?
There is no dedicated regression test added for the empty-old-range case. The PR body documents a deterministic simulation replay that exercised the failure and the fix, so I do not think that omission blocks this small change, but a permanent test would make the edge case harder to reintroduce later.
Are there better ways of doing things?
No materially better implementation stands out. Computing the effective old-generation end once and using it both for the skip decision and the returned cursor is the cleanest version of this fix.
Should this CL be LGTMd?
Yes, LGTM. The change is narrow, the range check is placed before the problematic cursor construction, and it preserves the existing first-old-generation boundary rule. The main remaining risk is only the lack of a committed regression test; I do not see a correctness reason to hold the patch for that.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
|
@sbodagala can you look at this PR |
| Version oldEnd = firstOld && recoveredAt.present() ? recoveredAt.get() + 1 : old.epochEnd; | ||
| if (begin >= oldEnd) { | ||
| firstOld = false; | ||
| continue; |
There was a problem hiding this comment.
Can you add a trace event here? Something like:
TraceEvent("TLogPeekLogRouterSkipEmptyOldRange", dbgid)
.detail("Tag", tag.toString())
.detail("Begin", begin)
.detail("OldEnd", oldEnd)
.detail("FirstOld", firstOld);
…-router-empty-old-range
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
|
@tclinkenbeard-oai lgtm, we generally try to post 100K joshua summaries in the PR. If you've one, can you add it to the PR description? Thanks. |
Done, thanks! |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Summary
Skip old log-router generations when the requested range is already empty.
Why
A recovery path could return a
SetPeekCursorfor an old log generation even whenbegin >= end. That cursor is born exhausted, so downstream log routers wait forever, repeatedly emitLogRouterSlowPeek/NoPrimaryPeekLocationForLR, and can eventually fail the test withTracedTooManyLines.Fix
Compute the effective old-generation end once and skip generations whose range is already empty before constructing a cursor.
Validation
tests/fast/Sideband.toml3182378863TracedTooManyLines0LogRouterSlowPeek0NoPrimaryPeekLocationForLR0empty old-range peeks