Skip to content

Skip empty old log router ranges#13163

Open
tclinkenbeard-oai wants to merge 4 commits intoapple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/log-router-empty-old-range
Open

Skip empty old log router ranges#13163
tclinkenbeard-oai wants to merge 4 commits intoapple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/log-router-empty-old-range

Conversation

@tclinkenbeard-oai
Copy link
Copy Markdown
Collaborator

Summary

Skip old log-router generations when the requested range is already empty.

Why

A recovery path could return a SetPeekCursor for an old log generation even when begin >= end. That cursor is born exhausted, so downstream log routers wait forever, repeatedly emit LogRouterSlowPeek / NoPrimaryPeekLocationForLR, and can eventually fail the test with TracedTooManyLines.

Fix

Compute the effective old-generation end once and skip generations whose range is already empty before constructing a cursor.

Validation

  • Reproduced the failure with:
    • tests/fast/Sideband.toml
    • seed 3182378863
    • buggify enabled
  • Before the fix: deterministic failure with repeated slow-peek warnings and TracedTooManyLines
  • After the fix:
    • same replay passes
    • 0 LogRouterSlowPeek
    • 0 NoPrimaryPeekLocationForLR
    • 0 empty old-range peeks

Copy link
Copy Markdown
Collaborator Author

@tclinkenbeard-oai tclinkenbeard-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by Codex.

What is it trying to do?

Skip old log-router generations whose effective range is already empty before constructing a SetPeekCursor, so recovery does not hand log routers a cursor that is born exhausted and can then stall in repeated slow-peek retries.

Is it correct?

Yes, by inspection this looks correct. The patch factors the effective old-generation end into oldEnd, skips when begin >= oldEnd, and reuses the same bound when constructing the cursor. That matches the existing range semantics in this path: the first old generation is capped at recoveredAt + 1 when present, later old generations use epochEnd, and an empty half-open range should not produce a cursor at all.

I inspected the changed branch in peekLogRouter, the SetPeekCursor behavior it feeds, and the log-router consumer loop that retries slow peeks. I did not run builds or tests. GitHub currently reports no status checks on the branch, so there are no failing checks to weigh, but there is also no CI signal yet.

Are there bugs?

I did not find any correctness bugs in this change.

Are there omissions?

There is no dedicated regression test added for the empty-old-range case. The PR body documents a deterministic simulation replay that exercised the failure and the fix, so I do not think that omission blocks this small change, but a permanent test would make the edge case harder to reintroduce later.

Are there better ways of doing things?

No materially better implementation stands out. Computing the effective old-generation end once and using it both for the skip decision and the returned cursor is the cleanest version of this fix.

Should this CL be LGTMd?

Yes, LGTM. The change is narrow, the range check is placed before the problematic cursor construction, and it preserves the existing first-old-generation boundary rule. The main remaining risk is only the lack of a committed regression test; I do not see a correctness reason to hold the patch for that.

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: c4fcecc
  • Duration 0:28:59
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: c4fcecc
  • Duration 0:45:04
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: c4fcecc
  • Duration 0:47:05
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: c4fcecc
  • Duration 0:49:24
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: c4fcecc
  • Duration 1:13:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: c4fcecc
  • Duration 1:54:43
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} rm -rf foundationdb. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: c4fcecc
  • Duration 1:55:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented May 6, 2026

@sbodagala can you look at this PR

@spraza spraza self-requested a review May 7, 2026 00:12
Version oldEnd = firstOld && recoveredAt.present() ? recoveredAt.get() + 1 : old.epochEnd;
if (begin >= oldEnd) {
firstOld = false;
continue;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a trace event here? Something like:

  TraceEvent("TLogPeekLogRouterSkipEmptyOldRange", dbgid)
      .detail("Tag", tag.toString())
      .detail("Begin", begin)
      .detail("OldEnd", oldEnd)
      .detail("FirstOld", firstOld);

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 04f688a
  • Duration 0:38:41
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 04f688a
  • Duration 0:46:06
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 04f688a
  • Duration 1:11:07
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 04f688a
  • Duration 1:11:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 04f688a
  • Duration 1:15:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 04f688a
  • Duration 2:02:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 04f688a
  • Duration 2:24:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@spraza
Copy link
Copy Markdown
Collaborator

spraza commented May 7, 2026

@tclinkenbeard-oai lgtm, we generally try to post 100K joshua summaries in the PR. If you've one, can you add it to the PR description? Thanks.

@sbodagala sbodagala self-requested a review May 7, 2026 19:49
@sbodagala
Copy link
Copy Markdown
Contributor

@sbodagala can you look at this PR

Done, thanks!

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 5a63d11
  • Duration 0:24:44
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 5a63d11
  • Duration 0:35:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 5a63d11
  • Duration 0:46:06
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 5a63d11
  • Duration 0:46:08
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 5a63d11
  • Duration 0:46:22
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 5a63d11
  • Duration 1:02:29
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 5a63d11
  • Duration 1:04:54
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants