perf(l1): reduce BAL parallel-path overhead#6639
Conversation
|
Lines of code reportTotal lines added: Detailed view |
Benchmark Results ComparisonNo significant difference was registered for any benchmark run. Detailed ResultsBenchmark Results: BubbleSort
Benchmark Results: ERC20Approval
Benchmark Results: ERC20Mint
Benchmark Results: ERC20Transfer
Benchmark Results: Factorial
Benchmark Results: FactorialRecursive
Benchmark Results: Fibonacci
Benchmark Results: FibonacciRecursive
Benchmark Results: ManyHashes
Benchmark Results: MstoreBench
Benchmark Results: Push
Benchmark Results: SstoreBench_no_opt
|
75c47b2 to
e792377
Compare
Benchmark Block Execution Results Comparison Against Main
|
e792377 to
161f583
Compare
Bundle of independent improvements to the BAL parallel-execution path
(execute_block_parallel + handle_merkleization_bal + warm_block_from_bal +
CachingDatabase), validated against a 149-block stress fixture (100M gas,
200-500 tx/block, ~25M-gas median blocks).
The changes (each is independently shippable; combined here for atomic
review since they touch overlapping code):
A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs)
`for updates in rx { ... }` blocked until channel close (= exec end).
execute_block_parallel sends exactly one batch up front from
bal_to_account_updates, so draining nothing useful serialized Stage B
(parallel storage roots) after exec instead of overlapping with it.
Replaced with a single rx.recv() and dropped the FxHashMap merge step
(BAL guarantees one entry per address).
C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs)
Bench tooling change. The sleep gates background trie-layer writeback
from bleeding into the next block's per-block timer; 100ms is well
above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without
affecting the per-block metric. NO effect on production paths.
Q1. Skip prestate read in bal_to_account_updates when BAL covers all info
fields (crates/vm/backends/levm/mod.rs). Two fast paths added:
storage-only updates (info: None, removed: false by construction);
full info coverage with non-empty post (removal impossible, info from
BAL alone). Slow path keeps existing behavior for partial coverage.
Q2. Per-tx GeneralizedDatabase capacity cap at 32
(crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously
sized to bal.accounts().len() (often 100s on stress blocks); p50 tx
touches <10 accounts. Reduced allocator pressure across rayon workers.
Q3. Memoize code_from_bal results across seed_db_from_bal calls
(crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash +
jump_targets) once per BAL code change before the par_iter; pass cache
via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target
scans per code change per block (N = tx count).
Q8. Move per-tx BAL validation into the rayon par_iter closure
(crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a
serial post-exec validation pass (~3 ms median across 200 txs). Drops
current_state and codes inside the closure after validation runs —
they no longer cross the rayon boundary, reducing per-tx allocator
pressure. Closure returns deferred Option<EvmError> so gas-limit check
still takes priority over BAL mismatch errors.
DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher>
(crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was
RwLock::read_contended on the single account RwLock with 16 rayon
workers hammering it. Sharded concurrent map (64 default shards)
eliminates contention. Sequential paths unaffected (only 2 threads
access the cache, weren't contended).
Effect on non-BAL paths (block production, pre-Amsterdam, sequential
fallback): DashMap is neutral (low contention); other changes only fire
on the BAL parallel-validation path. No regressions in non-parallel paths.
161f583 to
7e081a6
Compare
…tence-idle wait The trie-update worker channel is sync_channel(0) (rendezvous), so a successful send proves the worker drained its previous iteration and returned to recv(). Add TrieMessage::Ping (no-op) plus a Store::wait_for_persistence_idle() that sends one from spawn_blocking, and call it from the import-bench loop instead of sleeping 100ms. Removes the magic-number sleep and tightens the per-block timer against the actual idle signal rather than a worst-case Phase 2/3 estimate. Bench-tooling change only; no effect on production paths.
Q8 in the BAL parallel-path perf bundle (7e081a6) moved per-tx BAL validation into the rayon closure. As part of the refactor the `unaccessed_pure_accounts.remove(&header.coinbase)` call was hoisted out of the per-tx loop to run unconditionally on every parallel-path invocation. For 0-tx blocks (empty / withdrawal-only on Amsterdam+) that unconditional removal silently exempts a BAL entry the protocol calls extraneous: fee finalization never runs without a tx, so geth's readerTracker never touches the coinbase either. A BAL coinbase entry on such a block is by construction extraneous and must surface as a validation error. Restoring the original gate (only exempt when at least one tx ran) re-rejects the block. Verified against EELS test_bal_invalid_extraneous_coinbase[empty_block] and [withdrawal_only].
CallFrameBackup::original_account_storage_slots starts each fresh account's inner FxHashMap at capacity 0. The first few SSTOREs in any new tx trigger hashbrown::reserve_rehash 3-4 times in sequence (0 → 4 → 8 → 16). perf record on a 460-block bal-devnet-7 mainnet-mix fixture (200 tx/block, ~65 Mgas) showed hashbrown::reserve_rehash as the 7th hottest leaf at 3.02B samples. After pre-sizing to 8 the same leaf drops to 2.19B, a 27% reduction in that frame and ~0.8% of total CPU recovered. Wall-clock impact is sub-noise on this workload (per-tx CPU savings happen inside rayon workers; wall-clock is bound by the longest tx per block) but the CPU savings compound on heavier workloads where critical-path txs hit the rehash chain. Wastes ~256 B per untouched account; negligible.
🤖 Claude Code ReviewNow I have enough context for a thorough review. PR Review:
|
🤖 Codex Code ReviewNo blocking correctness or security findings from static review. The deferred BAL-validation ordering and the trie-worker idle handshake both look coherent. Non-blocking perf note:
Testing gap:
Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR is a performance bundle for the BAL parallel-execution path rebased onto
Note: change B in the PR description ( Confidence Score: 4/5The parallel execution path changes are logically correct and well-tested; the two minor concerns do not affect production block processing. Core logic is sound. Main risk is the implicit one-batch invariant in handle_merkleization_bal — a future modification sending a second batch would silently produce a wrong state root. A debug_assert after the Ok arm would fully mitigate this. crates/blockchain/blockchain.rs (single-recv invariant and error-path Stage C) and crates/vm/backends/levm/mod.rs (deferred BAL validation fast paths)
|
| Filename | Overview |
|---|---|
| crates/blockchain/blockchain.rs | Stage A changed from channel drain to single rx.recv(); allows Stage B to overlap with parallel exec. Introduces a fragile one-batch invariant and triggers Stage C (16 trie opens) on the error path. |
| crates/vm/backends/levm/mod.rs | BAL validation moved into rayon closures, code_cache pre-computed, per-tx DB capacity capped, two fast paths added to bal_to_account_updates. Logic correct; deferred-error order preserved. |
| crates/storage/store.rs | Adds TrieMessage enum with Ping variant; wait_for_persistence_idle() correctly exploits sync_channel(0) rendezvous to signal worker idle state. |
| crates/vm/levm/src/db/mod.rs | RwLock replaced with DashMap<_, _, FxBuildHasher> for all three caches; prefetch methods simplified with par_iter + DashMap entry API. |
| cmd/ethrex/cli.rs | Magic-number sleep replaced by wait_for_persistence_idle(); bench-tool only change with no production effect. |
| crates/vm/levm/Cargo.toml | Adds dashmap 6.1 as a direct dep rather than a workspace dep; minor consistency nit. |
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
crates/blockchain/blockchain.rs:870-882
**Single-recv invariant has no defensive check**
`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.
### Issue 2 of 3
crates/blockchain/blockchain.rs:876-881
**`Err(_)` path continues through Stage C (16 trie opens)**
When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.
### Issue 3 of 3
crates/vm/levm/Cargo.toml:23
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.
```suggestion
dashmap.workspace = true
```
Reviews (1): Last reviewed commit: "perf(l1): pre-size backup_storage_slot i..." | Re-trigger Greptile
| let updates: Vec<AccountUpdate> = match rx.recv() { | ||
| Ok(updates) => { | ||
| let current_length = queue_length.fetch_sub(1, Ordering::Acquire); | ||
| *max_queue_length = current_length.max(*max_queue_length); | ||
| updates | ||
| } | ||
| } | ||
| Err(_) => { | ||
| // Channel closed without a message — execution failed before | ||
| // bal_to_account_updates ran. Return empty work so the exec | ||
| // error surfaces in execution_result rather than being masked. | ||
| Vec::new() | ||
| } | ||
| }; |
There was a problem hiding this comment.
Single-recv invariant has no defensive check
handle_merkleization_bal now consumes exactly one message and then never touches rx again. If execute_block_parallel is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel") after the Ok arm would catch any accidental protocol change during development.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 870-882
Comment:
**Single-recv invariant has no defensive check**
`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.
How can I resolve this? If you propose a fix, please make it concise.| Err(_) => { | ||
| // Channel closed without a message — execution failed before | ||
| // bal_to_account_updates ran. Return empty work so the exec | ||
| // error surfaces in execution_result rather than being masked. | ||
| Vec::new() | ||
| } |
There was a problem hiding this comment.
Err(_) path continues through Stage C (16 trie opens)
When the channel is closed without a message (execution failure before bal_to_account_updates), the function returns Vec::new() and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty AccountUpdatesList early in the Err arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via execution_result? regardless.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 876-881
Comment:
**`Err(_)` path continues through Stage C (16 trie opens)**
When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.
How can I resolve this? If you propose a fix, please make it concise.| strum = { version = "0.27.1", features = ["derive"] } | ||
| rustc-hash.workspace = true | ||
| rayon.workspace = true | ||
| dashmap = "6.1" |
There was a problem hiding this comment.
dashmap is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses workspace = true. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.
| dashmap = "6.1" | |
| dashmap.workspace = true |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/vm/levm/Cargo.toml
Line: 23
Comment:
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.
```suggestion
dashmap.workspace = true
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
Rebase of #6543's perf bundle on top of
bal-devnet-7-pr(previous iteration of this branch sat onbal-devnet-6-pr; re-cherry-picked the single perf commit onto devnet-7 with 1 conflict incrates/vm/backends/levm/mod.rs— see rebase notes).Bundle of independent improvements to the BAL parallel-execution path (
execute_block_parallel+handle_merkleization_bal+warm_block_from_bal+CachingDatabase), validated against a 149-block stress fixture (100M gas, 200-500 tx/block, ~25M-gas median blocks).Headline (per-block medians)
Bundle doubles the speedup margin the parallel path was already providing over sequential.
Changes (independently shippable; combined here for atomic review)
handle_merkleization_baloverlap fix (crates/blockchain/blockchain.rs):for updates in rx { ... }blocked until channel close (= exec end).execute_block_parallelsends exactly one batch up front frombal_to_account_updates, so draining nothing useful serialized Stage B (parallel storage roots) after exec instead of overlapping with it. Replaced with a singlerx.recv()and dropped theFxHashMapmerge step (BAL guarantees one entry per address).crates/vm/backends/levm/mod.rs): addedBAL_PARALLEL_TX_THRESHOLD = 5. Below threshold falls through to the sequential path which produces a BAL during exec;blockchain.rshash-compares produced vs header BAL — same correctness, no parallel constants. Mirrors reth'sSMALL_BLOCK_TX_THRESHOLD; trips on <1% of mainnet blocks (100-block sample).cmd/ethrex/cli.rs): bench tooling change. The sleep gates background trie-layer writeback from bleeding into the next block's per-block timer; 100ms is well above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without affecting the per-block metric. NO effect on production paths.bal_to_account_updateswhen BAL covers all info fields (crates/vm/backends/levm/mod.rs): two fast paths added — storage-only updates (info: None, removed: false by construction); full info coverage with non-empty post (removal impossible, info from BAL alone). Slow path keeps existing behavior for partial coverage.GeneralizedDatabasecapacity cap at 32 (execute_block_parallel): previously sized tobal.accounts().len()(often 100s on stress blocks); p50 tx touches <10 accounts. Reduced allocator pressure across rayon workers.code_from_balresults acrossseed_db_from_balcalls: pre-computeCodeobjects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param toseed_db_from_bal. Saves N-1 keccak+jump-target scans per code change per block (N = tx count).execute_block_parallel): eliminates a serial post-exec validation pass (~3 ms median across 200 txs). Dropscurrent_stateandcodesinside the closure after validation runs — they no longer cross the rayon boundary, reducing per-tx allocator pressure. Closure returns deferredOption<EvmError>so gas-limit check still takes priority over BAL mismatch errors.CachingDatabaseRwLock<HashMap>->DashMap<_, _, FxBuildHasher>(crates/vm/levm/src/db/mod.rs): found viaperf record— 11% of CPU wasRwLock::read_contendedon the single account RwLock with 16 rayon workers hammering it. Sharded concurrent map (64 default shards) eliminates contention. Sequential paths unaffected (only 2 threads access the cache, weren't contended).import-benchinter-block sleep with explicitStore::wait_for_persistence_idle()(cmd/ethrex/cli.rs,crates/storage/store.rs): supersedes change C's magic-number sleep. The trie-update worker channel issync_channel(0)(rendezvous), so a successful send proves the worker drained Phase 2 (disk write of bottom-most diff layer) and Phase 3 (in-memory layer removal) for the previous block. AddedTrieMessage::{Update, Ping}enum and await_for_persistence_idle()that postsPingfromspawn_blocking; bench loop calls it between blocks. Bench-tool only — production paths untouched. Wall-time impact on the 460-block fixture below: ~10× faster bench iteration (~51 s → ~5 s end-to-end), since the previous 100 ms sleep dominated inter-block gap.Effect on non-BAL paths (block production, pre-Amsterdam, sequential fallback): DashMap is neutral (low contention), threshold-fallback adds a protective branch, other changes only fire on the BAL parallel-validation path. No regressions in non-parallel paths.
Larger-block sanity run (460-block mainnet-mix, bal-devnet-7 localnet)
To re-validate on a heavier workload than the original 149-block fixture, generated a fresh fixture from a bal-devnet-7 kurtosis localnet (
fixtures/networks/bal-devnet-7-ethrex.yaml, mainnet-shape spamoor mix at ~670 tx/s: erc20/uniswap/eoa/storagespam/erc721/blobs + devnet-7 storagerefundtx/setcodetx). 460 blocks, 65 Mgas median (44 % of 150 M limit), ~200 txs/block in steady state, 452/460 blocks carry BAL.Single-run end-to-end on the rebased branch (perf bundle applied):
Per-block breakdown in the dense Amsterdam region (blocks 400-459): validate ~0.17 ms (2%), exec ~7.0 ms (88% — bottleneck), merkle ~0.04 ms (1%, overlap 99%, queue 1), store ~0.75 ms (9%), warmer ~6.2 ms (finishes ~1 ms before exec).
Rebase notes vs prior iteration (on
bal-devnet-6-pr)origin/bal-devnet-7-prand the single perf commit cherry-picked back on top. 1 conflict resolved incrates/vm/backends/levm/mod.rs: bal-devnet-7-pr added aVec<TxGasBreakdown>return field toexecute_block_parallel(and a per-tx breakdown push); the perf commit reshaped the innerexec_resultstuple from 8 to 7 fields (Q8 dropscurrent_state/codesinside the closure). Resolution keeps both — the 5-field return tuple includingtx_gas_breakdowns, and the 7-field per-tx tuple iterated by the post-exec gas-accounting loop.type BalAccountCodeCache = Vec<(H256, Option<Code>)>;to keepseed_db_from_bal'scode_cacheparameter under clippy'stype_complexitylint (-D warningsinmake lint-l1).unaccessed_pure_accountscleanup on!exec_results.is_empty()to fix the 0-tx-block regression in EELStest_bal_invalid_extraneous_coinbase[empty_block|withdrawal_only](Q8 refactor had hoisted the removal outside the per-tx loop and so silently exempted extraneous coinbase entries on empty / withdrawal-only blocks).Test plan
make test(CI:Test - ubuntu-22.04,Test - ubuntu-22.04-arm,Test - macos-15all green)make -C tooling/ef_tests/blockchain test(CI:EF Tests Check,EF Tests Check main,EF Tests (no_std crypto),EF Tests Compareall green)Hive - Consume Engine Amsterdamgreen; other Hive groups — Cancun Engine, Paris Engine, Devp2p, Engine Auth/EC, Engine withdrawal, Rpc Compat — also green)