fix(l1): final mempool re-drain in build_payload_loop#6596
Conversation
Lines of code reportTotal lines added: Detailed view |
🤖 Kimi Code ReviewReview of The fix for the race condition between mempool insertion and 1. Performance: Unconditional clone of // Only rebuild if mempool has new content since last iteration
if mempool.has_new_transactions(since_timestamp) {
// spawn rebuild
}2. Missing timeout on blocking operation (lines 448-452) match tokio::time::timeout(Duration::from_millis(500),
tokio::task::spawn_blocking(...)
).await { ... }3. 4. Error handling consistency warn!(%err, payload_id = %payload.id, "Final payload rebuild failed");5. Cancellation safety Nit: The variable name Overall, the logic correctly addresses the race condition, but the lack of timeout is a reliability concern for production validator operations. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Claude Code ReviewNow I have a full picture. Here is my review: PR Review: fix(l1): final mempool re-drain in
|
Greptile SummaryThis PR fixes a race condition in
Confidence Score: 4/5The change is safe to merge; the extra rebuild after a natural slot-timeout is wasteful but not incorrect. The fix is small and well-contained: one extra crates/blockchain/payload.rs — specifically the guard condition for the final rebuild in
|
| Filename | Overview |
|---|---|
| crates/blockchain/payload.rs | Adds a final synchronous build_payload call after build_payload_loop exits to close the race window between cancellation and the last rebuild; error handling correctly downgrades failures to warnings and preserves the previous result. |
Sequence Diagram
sequenceDiagram
participant CL as Consensus Layer
participant BPL as build_payload_loop
participant MP as Mempool
participant BP as build_payload
CL->>BPL: engine_forkchoiceUpdated (start loop)
BPL->>BP: initial build → res (empty, no txs yet)
MP-->>BPL: tx_added.notify (eth_sendRawTransaction)
BPL->>BP: spawn_blocking (pass 1, would include tx)
CL->>BPL: engine_getPayload → cancel_token.cancel()
Note over BPL: run_until_cancelled returns None, pass 1 result dropped
Note over BPL,BP: FIX: final synchronous rebuild
BPL->>BP: spawn_blocking (final pass, reads fresh mempool)
BP-->>BPL: "res = payload with tx"
BPL-->>CL: return res (non-empty)
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
crates/blockchain/payload.rs:450-454
**Final rebuild runs on every loop exit, including slot-timeout**
The final `build_payload` runs unconditionally regardless of whether the loop exited because of cancellation (the intended race fix) or because `SECONDS_PER_SLOT` elapsed naturally. In the timeout path the loop has already been rebuilding continuously for 12 seconds and the last successful rebuild is always fresh, so this extra pass adds latency without any benefit. Since `engine_getPayload` awaits the task to completion before returning to the CL, the surplus rebuild is on the block-proposal critical path every time. A simple guard like `if cancel_token.is_cancelled()` before the final `spawn_blocking` would limit the extra work to the cases where the race can actually occur.
Reviews (1): Last reviewed commit: "fix(l1): final mempool re-drain in build..." | Re-trigger Greptile
| match tokio::task::spawn_blocking(move || final_self.build_payload(payload)).await { | ||
| Ok(Ok(final_res)) => res = final_res, | ||
| Ok(Err(err)) => warn!(%err, "Final payload rebuild failed; returning previous result"), | ||
| Err(err) => warn!(%err, "Final payload rebuild task panicked"), | ||
| } |
There was a problem hiding this comment.
Final rebuild runs on every loop exit, including slot-timeout
The final build_payload runs unconditionally regardless of whether the loop exited because of cancellation (the intended race fix) or because SECONDS_PER_SLOT elapsed naturally. In the timeout path the loop has already been rebuilding continuously for 12 seconds and the last successful rebuild is always fresh, so this extra pass adds latency without any benefit. Since engine_getPayload awaits the task to completion before returning to the CL, the surplus rebuild is on the block-proposal critical path every time. A simple guard like if cancel_token.is_cancelled() before the final spawn_blocking would limit the extra work to the cases where the race can actually occur.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/payload.rs
Line: 450-454
Comment:
**Final rebuild runs on every loop exit, including slot-timeout**
The final `build_payload` runs unconditionally regardless of whether the loop exited because of cancellation (the intended race fix) or because `SECONDS_PER_SLOT` elapsed naturally. In the timeout path the loop has already been rebuilding continuously for 12 seconds and the last successful rebuild is always fresh, so this extra pass adds latency without any benefit. Since `engine_getPayload` awaits the task to completion before returning to the CL, the surplus rebuild is on the block-proposal critical path every time. A simple guard like `if cancel_token.is_cancelled()` before the final `spawn_blocking` would limit the extra work to the cases where the race can actually occur.
How can I resolve this? If you propose a fix, please make it concise.
🤖 Codex Code Review
I don’t see a direct EVM/gas/RLP/consensus correctness bug in the diff itself, but I would not merge this as-is because of the proposer-path latency regression above. I couldn’t run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
9bfb9d4 to
b028dd2
Compare
`build_payload_loop` rebuilds the payload on every `tx_added` notification, but when `engine_getPayload` cancels the loop mid-rebuild the in-progress build is dropped (`run_until_cancelled` returns `None`) and the returned `res` falls back to the previous, possibly empty, payload. Race timeline: T0 FCU-with-attrs -> build pass 0, res = empty T1 tx arrives -> notify_waiters fires T2 loop wakes -> spawns pass 1 (would include the tx) T3 getPayload calls cancel.cancel() T4 run_until_cancelled returns None, pass 1 result dropped T5 loop exits, returns res = empty Add one final synchronous `build_payload` call after the loop exits. The mempool is read fresh, so any tx that landed during the race window is included. Cost: one extra block-build pass per `engine_getPayload` (few ms for typical mempool sizes; geth/reth pay the same). The latency is on the critical path of block proposal but is bounded by the existing payload build time.
Bot review consensus (Claude/Codex/Greptile/Kimi) flagged that the
final rebuild ran on every loop exit, including the slot-timeout
path where the loop already drained the mempool continuously for
12s. Gate it on `cancel_token.is_cancelled()` so only the race
path pays the extra build.
Also re-word the comment ("synchronous" was misleading — the call
is async via `spawn_blocking + await`) and rename `final_self` to
`blockchain`.
Remove the three Hive Paris "Invalid Missing Ancestor Syncing ReOrg"
entries from KNOWN_FLAKY_TESTS in check-hive-results.sh. Local Hive
runs on this branch passed the full reorg-family pattern twice; with
the race closed there is no reason to keep bypassing them. CI will
verify on the real ethpandaops hive image.
Replace the is_cancelled() guard on the final build_payload pass with a monotonic mempool tx_seq comparison. Catches the slot-timeout select! race in addition to the cancellation race, and skips the rebuild when res is already fresh.
b028dd2 to
ce610ad
Compare
|
| EIP | Bucket | Count |
|---|---|---|
| EIP-7702 | set_code_txs |
24 |
| EIP-7702 | set_code_txs_2 |
15 |
| EIP-7702 | gas |
1 |
| EIP-8037 | state_gas_set_code |
17 |
| EIP-8037 | state_gas_pricing |
1 |
| EIP-8037 | state_gas_sstore |
1 |
| EIP-7928 | block_access_lists_eip7702 |
8 |
| EIP-7928 | block_access_lists |
1 |
| EIP-7778 | gas_accounting |
3 |
| EIP-7708 | transfer_logs |
1 |
| EIP-7976 | refunds |
1 |
| EIP-1344 | chainid (Amsterdam fork-transition fixture) |
1 |
| Total | 74 |
Re-enable once we either:
- (a) bump fixtures to a snobal-devnet-7 release that locks in the new
accounting; or - (b) revert the bal-devnet-7-prep subtraction for bal-devnet-6
compatibility.
Full test list (74)
EIP-7702 — for_amsterdam/prague/eip7702_set_code_tx/set_code_txs/
delegation_clearingdelegation_clearing_and_setdelegation_clearing_failing_txdelegation_clearing_tx_toeoa_tx_after_set_codeext_code_on_chain_delegating_set_codeext_code_on_self_delegating_set_codeext_code_on_self_set_codeext_code_on_set_codemany_delegationsnonce_overflow_after_first_authorizationnonce_validityreset_codeself_code_on_set_codeself_sponsored_set_codeset_code_multiple_valid_authorization_tuples_same_signer_increasing_nonceset_code_multiple_valid_authorization_tuples_same_signer_increasing_nonce_self_sponsoredset_code_to_logset_code_to_non_empty_storage_non_zero_nonceset_code_to_self_destructset_code_to_self_destructing_account_deployed_in_same_txset_code_to_sstoreset_code_to_sstore_then_sloadset_code_to_system_contract
EIP-7702 — for_amsterdam/prague/eip7702_set_code_tx/set_code_txs_2/
call_pointer_to_created_from_create_after_oog_call_againcall_to_precompile_in_pointer_contextcontract_storage_to_pointer_with_storagedelegation_replacement_call_previous_contractdouble_authpointer_measurementspointer_normalpointer_reentrypointer_resets_an_empty_code_account_with_storagepointer_revertspointer_to_pointerpointer_to_precompilepointer_to_staticpointer_to_static_reentrystatic_to_pointer
EIP-7702 — for_amsterdam/prague/eip7702_set_code_tx/gas/
account_warming
EIP-8037 — for_amsterdam/amsterdam/eip8037_state_creation_gas_cost_increase/state_gas_set_code/
auth_refund_block_gas_accountingauth_refund_bypasses_one_fifth_capauth_with_calldata_and_access_listauth_with_multiple_sstoresauthorization_exact_state_gas_boundaryauthorization_to_precompile_addressauthorization_with_sstoreduplicate_signer_authorizationsexisting_account_auth_header_gas_used_uses_worst_caseexisting_account_refundexisting_account_refund_enables_sstoreexisting_auth_with_reverted_execution_preserves_intrinsicmany_authorizations_state_gasmixed_auths_header_gas_used_uses_worst_casemixed_new_and_existing_authsmixed_valid_and_invalid_authsmulti_tx_block_auth_refund_and_sstore
EIP-8037 — state_gas_pricing/
auth_state_gas_scales_with_cpsb
EIP-8037 — state_gas_sstore/
sstore_state_gas_all_tx_types
EIP-7928 — for_amsterdam/amsterdam/eip7928_block_level_access_lists/block_access_lists_eip7702/
bal_7702_delegation_clearbal_7702_delegation_createbal_7702_delegation_updatebal_7702_double_auth_resetbal_7702_double_auth_swapbal_7702_null_address_delegation_no_code_changebal_selfdestruct_to_7702_delegationbal_withdrawal_to_7702_delegation
EIP-7928 — block_access_lists/
bal_all_transaction_types
EIP-7778 — for_amsterdam/amsterdam/eip7778_block_gas_accounting_without_refunds/gas_accounting/
multiple_refund_types_in_one_txsimple_gas_accountingvarying_calldata_costs
EIP-7708 — for_amsterdam/amsterdam/eip7708_eth_transfer_logs/transfer_logs/
transfer_with_all_tx_types
EIP-7976 — for_amsterdam/amsterdam/eip7976_increase_calldata_floor_cost/refunds/
gas_refunds_from_data_floor
EIP-1344 — for_amsterdam/istanbul/eip1344_chainid/chainid/
chainid(Amsterdam fork-transition fixture)
Summary
Closes a race in the payload builder that causes
engine_getPayloadto occasionally return a stale, often empty, payload when a transaction arrives in the narrow window between the loop's most recent rebuild and the cancellation triggered byengine_getPayload.Fix: one additional synchronous
build_payloadcall at the end ofbuild_payload_loop, just before returning. It re-reads the mempool one last time so any transaction that landed during the race window is included.Aim: kill the Hive Paris reorg-flake family
There is a family of four Hive Paris Engine tests that have been flaking on this repo for weeks. Three of them are already on a known-flaky ignore list in
.github/scripts/check-hive-results.sh:62-66:The argument that this fix addresses all four follows.
Shared precondition
The four test names share parameter
EmptyTxs=False. That parameter asserts "this test expects a payload containing at least one transaction." It is set at test-instance generation time, not a behavioral choice the runtime makes.Shared script structure
The
Invalid Missing Ancestor Syncing ReOrgfamily runs this exact sequence:eth_sendRawTransaction— seed the mempoolengine_forkchoiceUpdatedV1with attrs — start buildingengine_getPayloadV1— pull the built payloadengine_newPayloadV1with mutated payload — submit, expect INVALIDThe mutated field changes per variant (
Timestamp,Transaction Value,Transaction Nonce), but step 4 only succeeds when step 3 produced a non-empty payload. So all four share the precondition "step 3 returns a payload with the seeded tx in it." The mutated field is irrelevant to the precondition.The race produces exactly this precondition failure
Captured Hive log on the new failing variant shows:
i.e. step 3 returned an empty payload. The race is in
build_payload_loop:The mempool has the tx since T1 (we verified
add_transactionholds the write lock until insert is complete, so the tx is durably visible to readers). The bug is solely on the build side: pass 1's result is dropped on cancellation.Why this family is the canary
Other engine-api tests either:
EmptyTxs=Truevariants) — empty payload is fine.Modified Geth Modulein the log) — bypasses ethrex's builder.The reorg family doesn't pad — reorg semantics require tight timing between FCU and getPayload, so this family exposes the race more than any other.
Why the fix targets all four
After the loop exits — regardless of how it exited — the fix runs one more synchronous
build_payload(payload). That call:fetch_mempool_transactionsresbefore returningThe fix doesn't care which field the test will mutate. It only guarantees that the returned payload reflects the current mempool state. If the mempool had the tx at any point before getPayload returned, the payload will include it.
Honest limits
check-hive-results.sh:61is wrong. It says "These are hive framework issues, not ethrex bugs." Framework behavior is correct; the empty payload is the bug. Worth re-examining as a follow-up.Fix
crates/blockchain/payload.rs— add one finalbuild_payloadcall after the rebuild loop exits. Errors are downgraded to a warning that keeps the previousres; we never make the payload worse than the loop already produced.Cost
One extra full payload build per
engine_getPayload. For typical mempool depths this is a few milliseconds; geth and reth pay the same cost. The latency lands on the block-proposal critical path but is bounded by the same code path that runs in the rebuild loop today.Affected paths
engine_getPayloadV1/V2/V3/V4only. Non-builder full nodes never callget_payloadand do not hit this code.Test plan
cargo fmt --allcargo check --workspace --all-targetsmake lint-l1make run-hive SIMULATION=ethereum/engine TEST_PATTERN='engine-api/Invalid Missing Ancestor Syncing ReOrg, (Timestamp|Transaction (Value|Nonce))'Hive - Paris Engine testspasses without the entries inKNOWN_FLAKY_TESTSHive - Cancun Engine tests/Engine withdrawal tests/Reorg TestsFollow-up (not in this PR)
After this PR lands and Hive shows clean on the four variants for a few runs:
KNOWN_FLAKY_TESTSin.github/scripts/check-hive-results.sh.getPayloadtime (snapshot-then-build model). That removes the timing dependency by construction.