perf(l1): reduce BAL parallel-path overhead by edg-l · Pull Request #6639 · lambdaclass/ethrex

edg-l · 2026-05-13T09:17:53Z

Summary

Rebase of #6543's perf bundle on top of bal-devnet-7-pr (previous iteration of this branch sat on bal-devnet-6-pr; re-cherry-picked the single perf commit onto devnet-7 with 1 conflict in crates/vm/backends/levm/mod.rs — see rebase notes).

Bundle of independent improvements to the BAL parallel-execution path (execute_block_parallel + handle_merkleization_bal + warm_block_from_bal + CachingDatabase), validated against a 149-block stress fixture (100M gas, 200-500 tx/block, ~25M-gas median blocks).

Headline (per-block medians)

Metric	Sequential	Parallel (no bundle)	+ bundle	vs seq	vs par-base
Ggas/s	1.78	2.88	3.64	+104.3%	+26.4%
total (ms)	23.86	14.43	11.44	-52.1%	-20.7%
exec (ms)	21.97	12.94	6.67	-69.6%	-48.5%
warmer (ms)	7.41	5.39	3.93	-47.0%	-27.1%
store (ms)	1.60	1.19	1.25	-21.9%	+5.0%

Bundle doubles the speedup margin the parallel path was already providing over sequential.

Changes (independently shippable; combined here for atomic review)

A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs): for updates in rx { ... } blocked until channel close (= exec end). execute_block_parallel sends exactly one batch up front from bal_to_account_updates, so draining nothing useful serialized Stage B (parallel storage roots) after exec instead of overlapping with it. Replaced with a single rx.recv() and dropped the FxHashMap merge step (BAL guarantees one entry per address).
B. Adaptive threshold for BAL parallel exec (crates/vm/backends/levm/mod.rs): added BAL_PARALLEL_TX_THRESHOLD = 5. Below threshold falls through to the sequential path which produces a BAL during exec; blockchain.rs hash-compares produced vs header BAL — same correctness, no parallel constants. Mirrors reth's SMALL_BLOCK_TX_THRESHOLD; trips on <1% of mainnet blocks (100-block sample).
C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs): bench tooling change. The sleep gates background trie-layer writeback from bleeding into the next block's per-block timer; 100ms is well above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without affecting the per-block metric. NO effect on production paths.
Q1. Skip prestate read in bal_to_account_updates when BAL covers all info fields (crates/vm/backends/levm/mod.rs): two fast paths added — storage-only updates (info: None, removed: false by construction); full info coverage with non-empty post (removal impossible, info from BAL alone). Slow path keeps existing behavior for partial coverage.
Q2. Per-tx GeneralizedDatabase capacity cap at 32 (execute_block_parallel): previously sized to bal.accounts().len() (often 100s on stress blocks); p50 tx touches <10 accounts. Reduced allocator pressure across rayon workers.
Q3. Memoize code_from_bal results across seed_db_from_bal calls: pre-compute Code objects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target scans per code change per block (N = tx count).
Q8. Move per-tx BAL validation into the rayon par_iter closure (execute_block_parallel): eliminates a serial post-exec validation pass (~3 ms median across 200 txs). Drops current_state and codes inside the closure after validation runs — they no longer cross the rayon boundary, reducing per-tx allocator pressure. Closure returns deferred Option<EvmError> so gas-limit check still takes priority over BAL mismatch errors.
DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher> (crates/vm/levm/src/db/mod.rs): found via perf record — 11% of CPU was RwLock::read_contended on the single account RwLock with 16 rayon workers hammering it. Sharded concurrent map (64 default shards) eliminates contention. Sequential paths unaffected (only 2 threads access the cache, weren't contended).
D. Replace import-bench inter-block sleep with explicit Store::wait_for_persistence_idle() (cmd/ethrex/cli.rs, crates/storage/store.rs): supersedes change C's magic-number sleep. The trie-update worker channel is sync_channel(0) (rendezvous), so a successful send proves the worker drained Phase 2 (disk write of bottom-most diff layer) and Phase 3 (in-memory layer removal) for the previous block. Added TrieMessage::{Update, Ping} enum and a wait_for_persistence_idle() that posts Ping from spawn_blocking; bench loop calls it between blocks. Bench-tool only — production paths untouched. Wall-time impact on the 460-block fixture below: ~10× faster bench iteration (~51 s → ~5 s end-to-end), since the previous 100 ms sleep dominated inter-block gap.

Effect on non-BAL paths (block production, pre-Amsterdam, sequential fallback): DashMap is neutral (low contention), threshold-fallback adds a protective branch, other changes only fire on the BAL parallel-validation path. No regressions in non-parallel paths.

Larger-block sanity run (460-block mainnet-mix, bal-devnet-7 localnet)

To re-validate on a heavier workload than the original 149-block fixture, generated a fresh fixture from a bal-devnet-7 kurtosis localnet (fixtures/networks/bal-devnet-7-ethrex.yaml, mainnet-shape spamoor mix at ~670 tx/s: erc20/uniswap/eoa/storagespam/erc721/blobs + devnet-7 storagerefundtx/setcodetx). 460 blocks, 65 Mgas median (44 % of 150 M limit), ~200 txs/block in steady state, 452/460 blocks carry BAL.

Single-run end-to-end on the rebased branch (perf bundle applied):

Metric	Value
Wall time	5.06 s
Per-block median	~8 ms
Per-block Ggas/s median	7.67
Per-block Ggas/s p90	8.32
Per-block Ggas/s max	13.49
Median tx count	~200
Warmer/exec overlap	99 %

Per-block breakdown in the dense Amsterdam region (blocks 400-459): validate ~0.17 ms (2%), exec ~7.0 ms (88% — bottleneck), merkle ~0.04 ms (1%, overlap 99%, queue 1), store ~0.75 ms (9%), warmer ~6.2 ms (finishes ~1 ms before exec).

Rebase notes vs prior iteration (on `bal-devnet-6-pr`)

Branch hard-reset to origin/bal-devnet-7-pr and the single perf commit cherry-picked back on top. 1 conflict resolved in crates/vm/backends/levm/mod.rs: bal-devnet-7-pr added a Vec<TxGasBreakdown> return field to execute_block_parallel (and a per-tx breakdown push); the perf commit reshaped the inner exec_results tuple from 8 to 7 fields (Q8 drops current_state/codes inside the closure). Resolution keeps both — the 5-field return tuple including tx_gas_breakdowns, and the 7-field per-tx tuple iterated by the post-exec gas-accounting loop.
Type-alias touch-up: introduced type BalAccountCodeCache = Vec<(H256, Option<Code>)>; to keep seed_db_from_bal's code_cache parameter under clippy's type_complexity lint (-D warnings in make lint-l1).
Follow-up: gated coinbase exemption in the post-exec unaccessed_pure_accounts cleanup on !exec_results.is_empty() to fix the 0-tx-block regression in EELS test_bal_invalid_extraneous_coinbase[empty_block|withdrawal_only] (Q8 refactor had hoisted the removal outside the per-tx loop and so silently exempted extraneous coinbase entries on empty / withdrawal-only blocks).

Test plan

make test (CI: Test - ubuntu-22.04, Test - ubuntu-22.04-arm, Test - macos-15 all green)
make -C tooling/ef_tests/blockchain test (CI: EF Tests Check, EF Tests Check main, EF Tests (no_std crypto), EF Tests Compare all green)
Hive BAL coverage (CI: Hive - Consume Engine Amsterdam green; other Hive groups — Cancun Engine, Paris Engine, Devp2p, Engine Auth/EC, Engine withdrawal, Rpc Compat — also green)
Larger 460-block mainnet-mix fixture replaces the original 149-block stress fixture re-run; bench numbers reported in the "Larger-block sanity run" section above

github-actions · 2026-05-13T09:18:07Z

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

github-actions · 2026-05-13T09:20:26Z

Lines of code report

Total lines added: 82
Total lines removed: 34
Total lines changed: 116

Detailed view

+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2529  | -5   |
+----------------------------------------+-------+------+
| ethrex/crates/storage/store.rs         | 2671  | +15  |
+----------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs  | 2482  | +67  |
+----------------------------------------+-------+------+
| ethrex/crates/vm/levm/src/db/mod.rs    | 116   | -29  |
+----------------------------------------+-------+------+

github-actions · 2026-05-13T09:36:04Z

Benchmark Results Comparison

No significant difference was registered for any benchmark run.

Detailed Results

Benchmark Results: BubbleSort

Command	Mean [s]	Min [s]	Max [s]	Relative
`main_revm_BubbleSort`	2.977 ± 0.014	2.961	3.007	1.08 ± 0.01
`main_levm_BubbleSort`	2.758 ± 0.026	2.725	2.802	1.00 ± 0.01
`pr_revm_BubbleSort`	2.963 ± 0.016	2.951	2.995	1.08 ± 0.01
`pr_levm_BubbleSort`	2.747 ± 0.024	2.719	2.794	1.00

Benchmark Results: ERC20Approval

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ERC20Approval`	986.9 ± 8.8	976.1	1004.8	1.00
`main_levm_ERC20Approval`	1063.3 ± 7.3	1052.6	1080.8	1.08 ± 0.01
`pr_revm_ERC20Approval`	1001.5 ± 10.0	989.5	1014.9	1.01 ± 0.01
`pr_levm_ERC20Approval`	1054.4 ± 8.6	1041.8	1066.5	1.07 ± 0.01

Benchmark Results: ERC20Mint

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ERC20Mint`	136.0 ± 1.5	133.3	138.1	1.01 ± 0.02
`main_levm_ERC20Mint`	163.9 ± 3.2	161.9	172.8	1.22 ± 0.03
`pr_revm_ERC20Mint`	134.3 ± 2.3	132.4	140.1	1.00
`pr_levm_ERC20Mint`	162.2 ± 0.7	161.3	163.3	1.21 ± 0.02

Benchmark Results: ERC20Transfer

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ERC20Transfer`	234.0 ± 1.8	231.5	237.1	1.00
`main_levm_ERC20Transfer`	269.9 ± 8.9	264.4	294.7	1.15 ± 0.04
`pr_revm_ERC20Transfer`	235.4 ± 2.6	232.3	240.2	1.01 ± 0.01
`pr_levm_ERC20Transfer`	267.0 ± 1.8	264.7	270.0	1.14 ± 0.01

Benchmark Results: Factorial

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_Factorial`	222.2 ± 1.5	221.0	225.4	1.00
`main_levm_Factorial`	263.5 ± 50.1	243.6	404.1	1.19 ± 0.23
`pr_revm_Factorial`	223.1 ± 0.7	221.9	223.9	1.00 ± 0.01
`pr_levm_Factorial`	245.6 ± 3.9	242.2	255.8	1.11 ± 0.02

Benchmark Results: FactorialRecursive

Command	Mean [s]	Min [s]	Max [s]	Relative
`main_revm_FactorialRecursive`	1.634 ± 0.035	1.577	1.707	1.01 ± 0.04
`main_levm_FactorialRecursive`	9.216 ± 0.062	9.159	9.371	5.67 ± 0.19
`pr_revm_FactorialRecursive`	1.624 ± 0.053	1.511	1.702	1.00
`pr_levm_FactorialRecursive`	9.148 ± 0.029	9.093	9.180	5.63 ± 0.19

Benchmark Results: Fibonacci

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_Fibonacci`	198.8 ± 1.1	196.8	200.1	1.00 ± 0.01
`main_levm_Fibonacci`	225.1 ± 4.6	221.4	233.0	1.14 ± 0.03
`pr_revm_Fibonacci`	198.3 ± 1.9	194.0	200.5	1.00
`pr_levm_Fibonacci`	223.1 ± 4.2	219.3	232.2	1.13 ± 0.02

Benchmark Results: FibonacciRecursive

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_FibonacciRecursive`	852.8 ± 6.3	839.6	859.4	1.37 ± 0.02
`main_levm_FibonacciRecursive`	625.6 ± 2.7	621.7	630.7	1.01 ± 0.01
`pr_revm_FibonacciRecursive`	843.8 ± 8.1	828.7	854.6	1.36 ± 0.02
`pr_levm_FibonacciRecursive`	621.4 ± 7.2	611.6	634.6	1.00

Benchmark Results: ManyHashes

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ManyHashes`	8.3 ± 0.1	8.2	8.3	1.00
`main_levm_ManyHashes`	9.8 ± 0.1	9.7	9.9	1.19 ± 0.01
`pr_revm_ManyHashes`	8.3 ± 0.1	8.2	8.4	1.00 ± 0.01
`pr_levm_ManyHashes`	9.8 ± 0.1	9.6	10.1	1.18 ± 0.02

Benchmark Results: MstoreBench

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_MstoreBench`	258.5 ± 6.8	253.3	270.4	1.15 ± 0.03
`main_levm_MstoreBench`	224.6 ± 1.1	223.2	226.6	1.00
`pr_revm_MstoreBench`	258.7 ± 6.2	253.5	274.3	1.15 ± 0.03
`pr_levm_MstoreBench`	224.7 ± 1.0	223.5	226.8	1.00 ± 0.01

Benchmark Results: Push

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_Push`	294.0 ± 1.4	292.4	296.6	1.07 ± 0.01
`main_levm_Push`	275.6 ± 1.5	273.2	277.2	1.00
`pr_revm_Push`	295.3 ± 2.2	292.6	299.1	1.07 ± 0.01
`pr_levm_Push`	276.1 ± 1.7	274.1	278.8	1.00 ± 0.01

Benchmark Results: SstoreBench_no_opt

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_SstoreBench_no_opt`	165.3 ± 2.1	162.5	168.2	1.65 ± 0.02
`main_levm_SstoreBench_no_opt`	101.9 ± 3.4	100.0	111.2	1.02 ± 0.03
`pr_revm_SstoreBench_no_opt`	166.0 ± 1.9	162.2	168.6	1.65 ± 0.02
`pr_levm_SstoreBench_no_opt`	100.4 ± 0.4	99.9	100.9	1.00

github-actions · 2026-05-13T10:20:46Z

Benchmark Block Execution Results Comparison Against Main

Command	Mean [s]	Min [s]	Max [s]	Relative
`base`	60.862 ± 0.144	60.662	61.090	1.00 ± 0.00
`head`	60.679 ± 0.158	60.431	60.957	1.00

Bundle of independent improvements to the BAL parallel-execution path (execute_block_parallel + handle_merkleization_bal + warm_block_from_bal + CachingDatabase), validated against a 149-block stress fixture (100M gas, 200-500 tx/block, ~25M-gas median blocks). The changes (each is independently shippable; combined here for atomic review since they touch overlapping code): A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs) `for updates in rx { ... }` blocked until channel close (= exec end). execute_block_parallel sends exactly one batch up front from bal_to_account_updates, so draining nothing useful serialized Stage B (parallel storage roots) after exec instead of overlapping with it. Replaced with a single rx.recv() and dropped the FxHashMap merge step (BAL guarantees one entry per address). C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs) Bench tooling change. The sleep gates background trie-layer writeback from bleeding into the next block's per-block timer; 100ms is well above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without affecting the per-block metric. NO effect on production paths. Q1. Skip prestate read in bal_to_account_updates when BAL covers all info fields (crates/vm/backends/levm/mod.rs). Two fast paths added: storage-only updates (info: None, removed: false by construction); full info coverage with non-empty post (removal impossible, info from BAL alone). Slow path keeps existing behavior for partial coverage. Q2. Per-tx GeneralizedDatabase capacity cap at 32 (crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously sized to bal.accounts().len() (often 100s on stress blocks); p50 tx touches <10 accounts. Reduced allocator pressure across rayon workers. Q3. Memoize code_from_bal results across seed_db_from_bal calls (crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target scans per code change per block (N = tx count). Q8. Move per-tx BAL validation into the rayon par_iter closure (crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a serial post-exec validation pass (~3 ms median across 200 txs). Drops current_state and codes inside the closure after validation runs — they no longer cross the rayon boundary, reducing per-tx allocator pressure. Closure returns deferred Option<EvmError> so gas-limit check still takes priority over BAL mismatch errors. DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher> (crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was RwLock::read_contended on the single account RwLock with 16 rayon workers hammering it. Sharded concurrent map (64 default shards) eliminates contention. Sequential paths unaffected (only 2 threads access the cache, weren't contended). Effect on non-BAL paths (block production, pre-Amsterdam, sequential fallback): DashMap is neutral (low contention); other changes only fire on the BAL parallel-validation path. No regressions in non-parallel paths.

…tence-idle wait The trie-update worker channel is sync_channel(0) (rendezvous), so a successful send proves the worker drained its previous iteration and returned to recv(). Add TrieMessage::Ping (no-op) plus a Store::wait_for_persistence_idle() that sends one from spawn_blocking, and call it from the import-bench loop instead of sleeping 100ms. Removes the magic-number sleep and tightens the per-block timer against the actual idle signal rather than a worst-case Phase 2/3 estimate. Bench-tooling change only; no effect on production paths.

Q8 in the BAL parallel-path perf bundle (7e081a6) moved per-tx BAL validation into the rayon closure. As part of the refactor the `unaccessed_pure_accounts.remove(&header.coinbase)` call was hoisted out of the per-tx loop to run unconditionally on every parallel-path invocation. For 0-tx blocks (empty / withdrawal-only on Amsterdam+) that unconditional removal silently exempts a BAL entry the protocol calls extraneous: fee finalization never runs without a tx, so geth's readerTracker never touches the coinbase either. A BAL coinbase entry on such a block is by construction extraneous and must surface as a validation error. Restoring the original gate (only exempt when at least one tx ran) re-rejects the block. Verified against EELS test_bal_invalid_extraneous_coinbase[empty_block] and [withdrawal_only].

CallFrameBackup::original_account_storage_slots starts each fresh account's inner FxHashMap at capacity 0. The first few SSTOREs in any new tx trigger hashbrown::reserve_rehash 3-4 times in sequence (0 → 4 → 8 → 16). perf record on a 460-block bal-devnet-7 mainnet-mix fixture (200 tx/block, ~65 Mgas) showed hashbrown::reserve_rehash as the 7th hottest leaf at 3.02B samples. After pre-sizing to 8 the same leaf drops to 2.19B, a 27% reduction in that frame and ~0.8% of total CPU recovered. Wall-clock impact is sub-noise on this workload (per-tx CPU savings happen inside rayon workers; wall-clock is bound by the longest tx per block) but the CPU savings compound on heavier workloads where critical-path txs hit the rehash chain. Wastes ~256 B per untouched account; negligible.

github-actions · 2026-05-15T14:56:59Z

🤖 Claude Code Review

Now I have enough context for a thorough review.

PR Review: `perf(l1): reduce BAL parallel-path overhead`

Overall Assessment

This is a well-engineered performance bundle with clear, documented reasoning for each optimization. The changes show strong understanding of the BAL parallel execution model. Test coverage is solid (EF tests, Hive, mainnet-mix fixture). A few correctness subtleties warrant attention.

Correctness Concerns

blockchain.rs: Single-recv assumption is fragile (medium severity)

// Line 82
let updates: Vec<AccountUpdate> = match rx.recv() {
    Ok(updates) => { ... updates }
    Err(_) => Vec::new()
};

The change replaces a draining loop with a single recv(), relying on the invariant that execute_block_parallel sends exactly one batch. This is documented in the comment, and verified to be true today (mod.rs:1086-1089). However:

The contract lives entirely in code comments. If a future refactor sends a second batch (e.g. for incremental streaming), the second batch is silently dropped — no panic, no error, just a truncated state root.
Consider adding a debug assertion: debug_assert!(rx.try_recv().is_err(), "merkleizer received unexpected second batch") after the recv(). This costs nothing in release builds but would catch protocol drift in CI.

bal_to_account_updates: Fast path 1 removed: false assumption

// ~line 740 (mod.rs)
if !has_info_changes {
    updates.push(AccountUpdate { ..., removed: false, info: None, ... });
    continue;
}

The logic is correct: without info changes, balance/nonce/code_hash are identical in pre- and post-state, so the account cannot transition to empty (removed = post_empty && !pre_empty is always false). The reasoning holds. But the comment ("can't be a removal … by definition") glosses over the "pre-state might already be empty" edge case — it's actually fine because if pre was empty AND storage changes arrived, the pre-empty invariant was already false. Worth noting for future readers.

unread_storage_reads cleanup: semantic equivalence confirmed

The refactoring from per-tx current_state iteration to pre-computed destroyed_addresses/read_keys vecs is semantically identical to the serial loop. The retain(|&(a, _)| a != *addr) call for destroyed accounts is O(n) per destroyed account, same as before — acceptable given destroyed accounts are rare.

Validation error ordering change

In the old serial loop, BAL validation for tx N short-circuited before tx N+1 ran. Now all tx validations execute in parallel, and only the first error (by tx index, since exec_results is sorted) is surfaced after the gas check. This is a behavior-preserving change for the common case, but if validation error messages ever become important for debugging, the silent discard of later errors could be surprising. Consider logging suppressed errors at warn! level.

Correctness Fix (Positive)

Coinbase exemption for 0-tx blocks (mod.rs:1395):

if !unaccessed_pure_accounts.is_empty() && !exec_results.is_empty() {
    unaccessed_pure_accounts.remove(&header.coinbase);

The added && !exec_results.is_empty() guard is a genuine correctness improvement. Empty blocks / withdrawal-only blocks never trigger fee finalization, so a BAL entry for the coinbase in such blocks is legitimately extraneous and should fail validation. This matches the EELS test (test_bal_invalid_extraneous_coinbase) and is correct.

Design / Structural Notes

wait_for_persistence_idle: Rendezvous correctness verified

The Ping-based idle detection is elegant. Confirmed that trie_upd_tx/trie_upd_rx are created with sync_channel(0) (store.rs:1557), so the rendezvous semantics hold: a successful send proves the worker is back at recv(). The only gap is that this invariant isn't enforced at the type level — if someone changes sync_channel(0) to sync_channel(1), Pings could buffer and the method would return a false positive. A comment at the channel creation site pointing to wait_for_persistence_idle would help protect against this:

// INVARIANT: capacity must stay 0 (rendezvous) — wait_for_persistence_idle
// depends on send-blocking to detect worker idle. See Store::wait_for_persistence_idle.
let (trie_upd_tx, trie_upd_rx) = std::sync::mpsc::sync_channel(0);

code_cache index binding (mod.rs:1138–1147 and ~947–967)

code_cache[acct_idx][code_idx] assumes code_cache is built from bal.accounts() in the same order as accounts_by_min_index resolves its .1 field. Both are derived from the same bal.accounts() slice, so this holds today. It's an implicit invariant — a brief comment at the cache[acct_idx] access site noting "acct_idx is an index into bal.accounts()" would make the coupling explicit.

BalAccountCodeCache type name

type BalAccountCodeCache = Vec<(H256, Option<Code>)>;

The name is slightly misleading: it's not a cache for the whole BAL, it's a per-account cache of code-change entries. PerAccountCodeChanges or BallAccountCodeChanges would be more descriptive, though this is minor.

Performance Observations

DashMap migration is well-motivated. The switch from RwLock<FxHashMap> removes the write-lock bottleneck under rayon parallelism. Using FxBuildHasher preserves the fast hash for short fixed-size keys (Address, H256). The TOCTOU double-fetch pattern on cache miss is retained from the old design and is acceptable (idempotent reads from backing store).

PER_TX_DB_CAPACITY = 32 is a reasonable constant. The comment notes "most txs touch <10 accounts" — with a load factor of ~0.87 in hashbrown, capacity 32 → 28 usable entries before a resize. For typical txs this should be sufficient.

or_insert_with(|| FxHashMap::with_capacity_and_hasher(8, ...)) in gen_db.rs: The pre-sizing comment is well-supported by the flamegraph reference. A capacity of 8 means one initial alloc handles the common case of 5–8 SSTOREs per account per call frame cleanly.

Minor Nits

prefetch_accounts and prefetch_storage now return the try_for_each result directly (no trailing Ok(())). This is idiomatic Rust and correct, but is a style break from the rest of the file which uses explicit Ok(()). Consistent either way.
The expect("checked by has_full_info_coverage") calls in fast path 2 are safe and the message is useful. Prefer keeping them over unwrap() for diagnosability.
drop(current_state); drop(codes); on lines 1322-1323 are explicit drops that the comment justifies (to avoid crossing the rayon boundary). The drops are correct but redundant since both are consumed by the validation closure before the Ok((...)) return. Rust would drop them at end-of-scope anyway, but the explicit drops reinforce the intent.

Summary

Merkleization overlap fix: correct, but the single-recv contract should be guarded with a debug assertion.
Fast paths in bal_to_account_updates: logically sound; the storage-only fast path corner case around pre-empty accounts is correctly handled.
TrieMessage::Ping: elegant, correct, but the sync_channel(0) invariant needs protection at the creation site.
DashMap migration: clean, removes the PoisonError code path, measurably correct.
Coinbase exemption fix: genuine correctness improvement.
Code cache: correct, implicit index invariant worth documenting.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-05-15T15:01:13Z

🤖 Codex Code Review

No blocking correctness or security findings from static review. The deferred BAL-validation ordering and the trie-worker idle handshake both look coherent.

Non-blocking perf note:

crates/vm/backends/levm/mod.rs:747 still calls prefetch_accounts for every write account before the new fast-path split at crates/vm/backends/levm/mod.rs:791. That means storage-only and full-info/non-empty accounts still pay the underlying get_account_state lookup, so the new fast paths mostly avoid a later cache hit rather than the actual DB read. If this PR’s goal is BAL-path latency reduction, filtering prefetches down to slow-path candidates would capture more of the intended win.

Testing gap:

I couldn’t run the targeted EIP-7928/EIP-8037 test slice here. cargo test is blocked in this environment by read-only ~/.cargo / ~/.rustup state plus an unfetched git dependency (libssz), so this review is source-only.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

greptile-apps · 2026-05-15T15:03:23Z

Greptile Summary

This PR is a performance bundle for the BAL parallel-execution path rebased onto bal-devnet-7-pr, delivering ~+26% throughput over the unpatched parallel baseline through eight independent improvements.

Overlap fix (blockchain.rs): Stage A now recv() once instead of draining the channel, allowing Stage B (parallel storage roots) to overlap with still-in-flight rayon execution; FxHashMap merge step dropped since BAL guarantees one entry per address.
DB contention eliminated (db/mod.rs): RwLock<FxHashMap> caches replaced with DashMap<_, _, FxBuildHasher>, removing ~11% CPU cost from read_contended under 16 rayon workers; prefetch_accounts/prefetch_storage simplified with par_iter + entry.or_insert.
Per-tx allocator pressure and BAL validation parallelised (levm/mod.rs): GeneralizedDatabase capacity capped at 32, code objects pre-computed once per BAL code change, per-tx validation moved inside the rayon closure with deferred errors surfaced after the gas-limit check.

Note: change B in the PR description (BAL_PARALLEL_TX_THRESHOLD = 5 adaptive fallback) is described but not present in the diff — worth confirming whether it was intentionally dropped during the rebase.

Confidence Score: 4/5

The parallel execution path changes are logically correct and well-tested; the two minor concerns do not affect production block processing.

Core logic is sound. Main risk is the implicit one-batch invariant in handle_merkleization_bal — a future modification sending a second batch would silently produce a wrong state root. A debug_assert after the Ok arm would fully mitigate this.

crates/blockchain/blockchain.rs (single-recv invariant and error-path Stage C) and crates/vm/backends/levm/mod.rs (deferred BAL validation fast paths)

Important Files Changed

Filename	Overview
crates/blockchain/blockchain.rs	Stage A changed from channel drain to single rx.recv(); allows Stage B to overlap with parallel exec. Introduces a fragile one-batch invariant and triggers Stage C (16 trie opens) on the error path.
crates/vm/backends/levm/mod.rs	BAL validation moved into rayon closures, code_cache pre-computed, per-tx DB capacity capped, two fast paths added to bal_to_account_updates. Logic correct; deferred-error order preserved.
crates/storage/store.rs	Adds TrieMessage enum with Ping variant; wait_for_persistence_idle() correctly exploits sync_channel(0) rendezvous to signal worker idle state.
crates/vm/levm/src/db/mod.rs	RwLock replaced with DashMap<_, _, FxBuildHasher> for all three caches; prefetch methods simplified with par_iter + DashMap entry API.
cmd/ethrex/cli.rs	Magic-number sleep replaced by wait_for_persistence_idle(); bench-tool only change with no production effect.
crates/vm/levm/Cargo.toml	Adds dashmap 6.1 as a direct dep rather than a workspace dep; minor consistency nit.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/blockchain/blockchain.rs:870-882
**Single-recv invariant has no defensive check**

`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.

### Issue 2 of 3
crates/blockchain/blockchain.rs:876-881
**`Err(_)` path continues through Stage C (16 trie opens)**

When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.

### Issue 3 of 3
crates/vm/levm/Cargo.toml:23
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

```suggestion
dashmap.workspace = true
```

_{Reviews (1): Last reviewed commit: "perf(l1): pre-size backup_storage_slot i..." | Re-trigger Greptile}

greptile-apps · 2026-05-15T15:03:27Z

+        let updates: Vec<AccountUpdate> = match rx.recv() {
+            Ok(updates) => {
+                let current_length = queue_length.fetch_sub(1, Ordering::Acquire);
+                *max_queue_length = current_length.max(*max_queue_length);
+                updates
            }
-        }
+            Err(_) => {
+                // Channel closed without a message — execution failed before
+                // bal_to_account_updates ran. Return empty work so the exec
+                // error surfaces in execution_result rather than being masked.
+                Vec::new()
+            }
+        };


Single-recv invariant has no defensive check

handle_merkleization_bal now consumes exactly one message and then never touches rx again. If execute_block_parallel is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel") after the Ok arm would catch any accidental protocol change during development.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/blockchain/blockchain.rs Line: 870-882 Comment: **Single-recv invariant has no defensive check** `handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-15T15:03:28Z

+            Err(_) => {
+                // Channel closed without a message — execution failed before
+                // bal_to_account_updates ran. Return empty work so the exec
+                // error surfaces in execution_result rather than being masked.
+                Vec::new()
+            }


Err(_) path continues through Stage C (16 trie opens)

When the channel is closed without a message (execution failure before bal_to_account_updates), the function returns Vec::new() and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty AccountUpdatesList early in the Err arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via execution_result? regardless.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/blockchain/blockchain.rs Line: 876-881 Comment: **`Err(_)` path continues through Stage C (16 trie opens)** When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-15T15:03:29Z

 strum = { version = "0.27.1", features = ["derive"] }
 rustc-hash.workspace = true
 rayon.workspace = true
+dashmap = "6.1"


dashmap is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses workspace = true. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

Suggested change

dashmap = "6.1"

dashmap.workspace = true

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/vm/levm/Cargo.toml Line: 23 Comment: `dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal. ```suggestion dashmap.workspace = true ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions Bot assigned edg-l May 13, 2026

github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels May 13, 2026

github-project-automation Bot added this to ethrex_performance and ethrex_l1 May 13, 2026

github-project-automation Bot moved this to Todo in ethrex_performance May 13, 2026

edg-l mentioned this pull request May 13, 2026

perf(l1): reduce BAL parallel-path overhead #6543

Closed

4 tasks

edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 75c47b2 to e792377 Compare May 13, 2026 09:36

edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from e792377 to 161f583 Compare May 13, 2026 11:00

Base automatically changed from bal-devnet-6-pr to main May 13, 2026 13:07

edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 161f583 to 7e081a6 Compare May 14, 2026 08:05

edg-l changed the base branch from main to bal-devnet-7-pr May 14, 2026 08:06

edg-l added 5 commits May 14, 2026 10:16

chore: refresh sub-manifest Cargo.lock files

d7d9434

docs: add CHANGELOG entry for BAL parallel-path perf bundle

939a773

edg-l marked this pull request as ready for review May 15, 2026 14:52

edg-l requested review from a team, ManuelBilbao and ilitteri as code owners May 15, 2026 14:52

ethrex-project-sync Bot moved this to In Review in ethrex_l1 May 15, 2026

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Conversation

edg-l commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline (per-block medians)

Changes (independently shippable; combined here for atomic review)

Larger-block sanity run (460-block mainnet-mix, bal-devnet-7 localnet)

Rebase notes vs prior iteration (on bal-devnet-6-pr)

Test plan

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Known Issues — intentionally skipped tests

Known Issues

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results Comparison

Benchmark Results: BubbleSort

Benchmark Results: ERC20Approval

Benchmark Results: ERC20Mint

Benchmark Results: ERC20Transfer

Benchmark Results: Factorial

Benchmark Results: FactorialRecursive

Benchmark Results: Fibonacci

Benchmark Results: FibonacciRecursive

Benchmark Results: ManyHashes

Benchmark Results: MstoreBench

Benchmark Results: Push

Benchmark Results: SstoreBench_no_opt

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Block Execution Results Comparison Against Main

Uh oh!

github-actions Bot commented May 15, 2026

🤖 Claude Code Review

PR Review: perf(l1): reduce BAL parallel-path overhead

Overall Assessment

Correctness Concerns

Correctness Fix (Positive)

Design / Structural Notes

Performance Observations

Minor Nits

Summary

Uh oh!

github-actions Bot commented May 15, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps Bot commented May 15, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edg-l commented May 13, 2026 •

edited

Loading

Rebase notes vs prior iteration (on `bal-devnet-6-pr`)

github-actions Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

PR Review: `perf(l1): reduce BAL parallel-path overhead`