Skip to content

perf(l1): reduce BAL parallel-path overhead#6639

Open
edg-l wants to merge 6 commits into
bal-devnet-7-prfrom
perf/bal-parallel-overhead-rebased
Open

perf(l1): reduce BAL parallel-path overhead#6639
edg-l wants to merge 6 commits into
bal-devnet-7-prfrom
perf/bal-parallel-overhead-rebased

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented May 13, 2026

Summary

Rebase of #6543's perf bundle on top of bal-devnet-7-pr (previous iteration of this branch sat on bal-devnet-6-pr; re-cherry-picked the single perf commit onto devnet-7 with 1 conflict in crates/vm/backends/levm/mod.rs — see rebase notes).

Bundle of independent improvements to the BAL parallel-execution path (execute_block_parallel + handle_merkleization_bal + warm_block_from_bal + CachingDatabase), validated against a 149-block stress fixture (100M gas, 200-500 tx/block, ~25M-gas median blocks).

Headline (per-block medians)

Metric Sequential Parallel (no bundle) + bundle vs seq vs par-base
Ggas/s 1.78 2.88 3.64 +104.3% +26.4%
total (ms) 23.86 14.43 11.44 -52.1% -20.7%
exec (ms) 21.97 12.94 6.67 -69.6% -48.5%
warmer (ms) 7.41 5.39 3.93 -47.0% -27.1%
store (ms) 1.60 1.19 1.25 -21.9% +5.0%

Bundle doubles the speedup margin the parallel path was already providing over sequential.

Changes (independently shippable; combined here for atomic review)

  • A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs): for updates in rx { ... } blocked until channel close (= exec end). execute_block_parallel sends exactly one batch up front from bal_to_account_updates, so draining nothing useful serialized Stage B (parallel storage roots) after exec instead of overlapping with it. Replaced with a single rx.recv() and dropped the FxHashMap merge step (BAL guarantees one entry per address).
  • B. Adaptive threshold for BAL parallel exec (crates/vm/backends/levm/mod.rs): added BAL_PARALLEL_TX_THRESHOLD = 5. Below threshold falls through to the sequential path which produces a BAL during exec; blockchain.rs hash-compares produced vs header BAL — same correctness, no parallel constants. Mirrors reth's SMALL_BLOCK_TX_THRESHOLD; trips on <1% of mainnet blocks (100-block sample).
  • C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs): bench tooling change. The sleep gates background trie-layer writeback from bleeding into the next block's per-block timer; 100ms is well above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without affecting the per-block metric. NO effect on production paths.
  • Q1. Skip prestate read in bal_to_account_updates when BAL covers all info fields (crates/vm/backends/levm/mod.rs): two fast paths added — storage-only updates (info: None, removed: false by construction); full info coverage with non-empty post (removal impossible, info from BAL alone). Slow path keeps existing behavior for partial coverage.
  • Q2. Per-tx GeneralizedDatabase capacity cap at 32 (execute_block_parallel): previously sized to bal.accounts().len() (often 100s on stress blocks); p50 tx touches <10 accounts. Reduced allocator pressure across rayon workers.
  • Q3. Memoize code_from_bal results across seed_db_from_bal calls: pre-compute Code objects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target scans per code change per block (N = tx count).
  • Q8. Move per-tx BAL validation into the rayon par_iter closure (execute_block_parallel): eliminates a serial post-exec validation pass (~3 ms median across 200 txs). Drops current_state and codes inside the closure after validation runs — they no longer cross the rayon boundary, reducing per-tx allocator pressure. Closure returns deferred Option<EvmError> so gas-limit check still takes priority over BAL mismatch errors.
  • DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher> (crates/vm/levm/src/db/mod.rs): found via perf record — 11% of CPU was RwLock::read_contended on the single account RwLock with 16 rayon workers hammering it. Sharded concurrent map (64 default shards) eliminates contention. Sequential paths unaffected (only 2 threads access the cache, weren't contended).
  • D. Replace import-bench inter-block sleep with explicit Store::wait_for_persistence_idle() (cmd/ethrex/cli.rs, crates/storage/store.rs): supersedes change C's magic-number sleep. The trie-update worker channel is sync_channel(0) (rendezvous), so a successful send proves the worker drained Phase 2 (disk write of bottom-most diff layer) and Phase 3 (in-memory layer removal) for the previous block. Added TrieMessage::{Update, Ping} enum and a wait_for_persistence_idle() that posts Ping from spawn_blocking; bench loop calls it between blocks. Bench-tool only — production paths untouched. Wall-time impact on the 460-block fixture below: ~10× faster bench iteration (~51 s → ~5 s end-to-end), since the previous 100 ms sleep dominated inter-block gap.

Effect on non-BAL paths (block production, pre-Amsterdam, sequential fallback): DashMap is neutral (low contention), threshold-fallback adds a protective branch, other changes only fire on the BAL parallel-validation path. No regressions in non-parallel paths.

Larger-block sanity run (460-block mainnet-mix, bal-devnet-7 localnet)

To re-validate on a heavier workload than the original 149-block fixture, generated a fresh fixture from a bal-devnet-7 kurtosis localnet (fixtures/networks/bal-devnet-7-ethrex.yaml, mainnet-shape spamoor mix at ~670 tx/s: erc20/uniswap/eoa/storagespam/erc721/blobs + devnet-7 storagerefundtx/setcodetx). 460 blocks, 65 Mgas median (44 % of 150 M limit), ~200 txs/block in steady state, 452/460 blocks carry BAL.

Single-run end-to-end on the rebased branch (perf bundle applied):

Metric Value
Wall time 5.06 s
Per-block median ~8 ms
Per-block Ggas/s median 7.67
Per-block Ggas/s p90 8.32
Per-block Ggas/s max 13.49
Median tx count ~200
Warmer/exec overlap 99 %

Per-block breakdown in the dense Amsterdam region (blocks 400-459): validate ~0.17 ms (2%), exec ~7.0 ms (88% — bottleneck), merkle ~0.04 ms (1%, overlap 99%, queue 1), store ~0.75 ms (9%), warmer ~6.2 ms (finishes ~1 ms before exec).

Rebase notes vs prior iteration (on bal-devnet-6-pr)

  • Branch hard-reset to origin/bal-devnet-7-pr and the single perf commit cherry-picked back on top. 1 conflict resolved in crates/vm/backends/levm/mod.rs: bal-devnet-7-pr added a Vec<TxGasBreakdown> return field to execute_block_parallel (and a per-tx breakdown push); the perf commit reshaped the inner exec_results tuple from 8 to 7 fields (Q8 drops current_state/codes inside the closure). Resolution keeps both — the 5-field return tuple including tx_gas_breakdowns, and the 7-field per-tx tuple iterated by the post-exec gas-accounting loop.
  • Type-alias touch-up: introduced type BalAccountCodeCache = Vec<(H256, Option<Code>)>; to keep seed_db_from_bal's code_cache parameter under clippy's type_complexity lint (-D warnings in make lint-l1).
  • Follow-up: gated coinbase exemption in the post-exec unaccessed_pure_accounts cleanup on !exec_results.is_empty() to fix the 0-tx-block regression in EELS test_bal_invalid_extraneous_coinbase[empty_block|withdrawal_only] (Q8 refactor had hoisted the removal outside the per-tx loop and so silently exempted extraneous coinbase entries on empty / withdrawal-only blocks).

Test plan

  • make test (CI: Test - ubuntu-22.04, Test - ubuntu-22.04-arm, Test - macos-15 all green)
  • make -C tooling/ef_tests/blockchain test (CI: EF Tests Check, EF Tests Check main, EF Tests (no_std crypto), EF Tests Compare all green)
  • Hive BAL coverage (CI: Hive - Consume Engine Amsterdam green; other Hive groups — Cancun Engine, Paris Engine, Devp2p, Engine Auth/EC, Engine withdrawal, Rpc Compat — also green)
  • Larger 460-block mainnet-mix fixture replaces the original 149-block stress fixture re-run; bench numbers reported in the "Larger-block sanity run" section above

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

@github-actions github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels May 13, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Lines of code report

Total lines added: 82
Total lines removed: 34
Total lines changed: 116

Detailed view
+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2529  | -5   |
+----------------------------------------+-------+------+
| ethrex/crates/storage/store.rs         | 2671  | +15  |
+----------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs  | 2482  | +67  |
+----------------------------------------+-------+------+
| ethrex/crates/vm/levm/src/db/mod.rs    | 116   | -29  |
+----------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Benchmark Results Comparison

No significant difference was registered for any benchmark run.

Detailed Results

Benchmark Results: BubbleSort

Command Mean [s] Min [s] Max [s] Relative
main_revm_BubbleSort 2.977 ± 0.014 2.961 3.007 1.08 ± 0.01
main_levm_BubbleSort 2.758 ± 0.026 2.725 2.802 1.00 ± 0.01
pr_revm_BubbleSort 2.963 ± 0.016 2.951 2.995 1.08 ± 0.01
pr_levm_BubbleSort 2.747 ± 0.024 2.719 2.794 1.00

Benchmark Results: ERC20Approval

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Approval 986.9 ± 8.8 976.1 1004.8 1.00
main_levm_ERC20Approval 1063.3 ± 7.3 1052.6 1080.8 1.08 ± 0.01
pr_revm_ERC20Approval 1001.5 ± 10.0 989.5 1014.9 1.01 ± 0.01
pr_levm_ERC20Approval 1054.4 ± 8.6 1041.8 1066.5 1.07 ± 0.01

Benchmark Results: ERC20Mint

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Mint 136.0 ± 1.5 133.3 138.1 1.01 ± 0.02
main_levm_ERC20Mint 163.9 ± 3.2 161.9 172.8 1.22 ± 0.03
pr_revm_ERC20Mint 134.3 ± 2.3 132.4 140.1 1.00
pr_levm_ERC20Mint 162.2 ± 0.7 161.3 163.3 1.21 ± 0.02

Benchmark Results: ERC20Transfer

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Transfer 234.0 ± 1.8 231.5 237.1 1.00
main_levm_ERC20Transfer 269.9 ± 8.9 264.4 294.7 1.15 ± 0.04
pr_revm_ERC20Transfer 235.4 ± 2.6 232.3 240.2 1.01 ± 0.01
pr_levm_ERC20Transfer 267.0 ± 1.8 264.7 270.0 1.14 ± 0.01

Benchmark Results: Factorial

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Factorial 222.2 ± 1.5 221.0 225.4 1.00
main_levm_Factorial 263.5 ± 50.1 243.6 404.1 1.19 ± 0.23
pr_revm_Factorial 223.1 ± 0.7 221.9 223.9 1.00 ± 0.01
pr_levm_Factorial 245.6 ± 3.9 242.2 255.8 1.11 ± 0.02

Benchmark Results: FactorialRecursive

Command Mean [s] Min [s] Max [s] Relative
main_revm_FactorialRecursive 1.634 ± 0.035 1.577 1.707 1.01 ± 0.04
main_levm_FactorialRecursive 9.216 ± 0.062 9.159 9.371 5.67 ± 0.19
pr_revm_FactorialRecursive 1.624 ± 0.053 1.511 1.702 1.00
pr_levm_FactorialRecursive 9.148 ± 0.029 9.093 9.180 5.63 ± 0.19

Benchmark Results: Fibonacci

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Fibonacci 198.8 ± 1.1 196.8 200.1 1.00 ± 0.01
main_levm_Fibonacci 225.1 ± 4.6 221.4 233.0 1.14 ± 0.03
pr_revm_Fibonacci 198.3 ± 1.9 194.0 200.5 1.00
pr_levm_Fibonacci 223.1 ± 4.2 219.3 232.2 1.13 ± 0.02

Benchmark Results: FibonacciRecursive

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_FibonacciRecursive 852.8 ± 6.3 839.6 859.4 1.37 ± 0.02
main_levm_FibonacciRecursive 625.6 ± 2.7 621.7 630.7 1.01 ± 0.01
pr_revm_FibonacciRecursive 843.8 ± 8.1 828.7 854.6 1.36 ± 0.02
pr_levm_FibonacciRecursive 621.4 ± 7.2 611.6 634.6 1.00

Benchmark Results: ManyHashes

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ManyHashes 8.3 ± 0.1 8.2 8.3 1.00
main_levm_ManyHashes 9.8 ± 0.1 9.7 9.9 1.19 ± 0.01
pr_revm_ManyHashes 8.3 ± 0.1 8.2 8.4 1.00 ± 0.01
pr_levm_ManyHashes 9.8 ± 0.1 9.6 10.1 1.18 ± 0.02

Benchmark Results: MstoreBench

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_MstoreBench 258.5 ± 6.8 253.3 270.4 1.15 ± 0.03
main_levm_MstoreBench 224.6 ± 1.1 223.2 226.6 1.00
pr_revm_MstoreBench 258.7 ± 6.2 253.5 274.3 1.15 ± 0.03
pr_levm_MstoreBench 224.7 ± 1.0 223.5 226.8 1.00 ± 0.01

Benchmark Results: Push

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Push 294.0 ± 1.4 292.4 296.6 1.07 ± 0.01
main_levm_Push 275.6 ± 1.5 273.2 277.2 1.00
pr_revm_Push 295.3 ± 2.2 292.6 299.1 1.07 ± 0.01
pr_levm_Push 276.1 ± 1.7 274.1 278.8 1.00 ± 0.01

Benchmark Results: SstoreBench_no_opt

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_SstoreBench_no_opt 165.3 ± 2.1 162.5 168.2 1.65 ± 0.02
main_levm_SstoreBench_no_opt 101.9 ± 3.4 100.0 111.2 1.02 ± 0.03
pr_revm_SstoreBench_no_opt 166.0 ± 1.9 162.2 168.6 1.65 ± 0.02
pr_levm_SstoreBench_no_opt 100.4 ± 0.4 99.9 100.9 1.00

@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 75c47b2 to e792377 Compare May 13, 2026 09:36
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 60.862 ± 0.144 60.662 61.090 1.00 ± 0.00
head 60.679 ± 0.158 60.431 60.957 1.00

@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from e792377 to 161f583 Compare May 13, 2026 11:00
Base automatically changed from bal-devnet-6-pr to main May 13, 2026 13:07
Bundle of independent improvements to the BAL parallel-execution path
(execute_block_parallel + handle_merkleization_bal + warm_block_from_bal +
CachingDatabase), validated against a 149-block stress fixture (100M gas,
200-500 tx/block, ~25M-gas median blocks).

The changes (each is independently shippable; combined here for atomic
review since they touch overlapping code):

A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs)
   `for updates in rx { ... }` blocked until channel close (= exec end).
   execute_block_parallel sends exactly one batch up front from
   bal_to_account_updates, so draining nothing useful serialized Stage B
   (parallel storage roots) after exec instead of overlapping with it.
   Replaced with a single rx.recv() and dropped the FxHashMap merge step
   (BAL guarantees one entry per address).

C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs)
   Bench tooling change. The sleep gates background trie-layer writeback
   from bleeding into the next block's per-block timer; 100ms is well
   above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without
   affecting the per-block metric. NO effect on production paths.

Q1. Skip prestate read in bal_to_account_updates when BAL covers all info
    fields (crates/vm/backends/levm/mod.rs). Two fast paths added:
    storage-only updates (info: None, removed: false by construction);
    full info coverage with non-empty post (removal impossible, info from
    BAL alone). Slow path keeps existing behavior for partial coverage.

Q2. Per-tx GeneralizedDatabase capacity cap at 32
    (crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously
    sized to bal.accounts().len() (often 100s on stress blocks); p50 tx
    touches <10 accounts. Reduced allocator pressure across rayon workers.

Q3. Memoize code_from_bal results across seed_db_from_bal calls
    (crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash +
    jump_targets) once per BAL code change before the par_iter; pass cache
    via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target
    scans per code change per block (N = tx count).

Q8. Move per-tx BAL validation into the rayon par_iter closure
    (crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a
    serial post-exec validation pass (~3 ms median across 200 txs). Drops
    current_state and codes inside the closure after validation runs —
    they no longer cross the rayon boundary, reducing per-tx allocator
    pressure. Closure returns deferred Option<EvmError> so gas-limit check
    still takes priority over BAL mismatch errors.

DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher>
    (crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was
    RwLock::read_contended on the single account RwLock with 16 rayon
    workers hammering it. Sharded concurrent map (64 default shards)
    eliminates contention. Sequential paths unaffected (only 2 threads
    access the cache, weren't contended).

Effect on non-BAL paths (block production, pre-Amsterdam, sequential
fallback): DashMap is neutral (low contention); other changes only fire
on the BAL parallel-validation path. No regressions in non-parallel paths.
@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 161f583 to 7e081a6 Compare May 14, 2026 08:05
@edg-l edg-l changed the base branch from main to bal-devnet-7-pr May 14, 2026 08:06
edg-l added 5 commits May 14, 2026 10:16
…tence-idle wait

The trie-update worker channel is sync_channel(0) (rendezvous), so a
successful send proves the worker drained its previous iteration and
returned to recv(). Add TrieMessage::Ping (no-op) plus a
Store::wait_for_persistence_idle() that sends one from spawn_blocking,
and call it from the import-bench loop instead of sleeping 100ms.

Removes the magic-number sleep and tightens the per-block timer against
the actual idle signal rather than a worst-case Phase 2/3 estimate.
Bench-tooling change only; no effect on production paths.
Q8 in the BAL parallel-path perf bundle (7e081a6) moved per-tx BAL
validation into the rayon closure. As part of the refactor the
`unaccessed_pure_accounts.remove(&header.coinbase)` call was hoisted
out of the per-tx loop to run unconditionally on every parallel-path
invocation.

For 0-tx blocks (empty / withdrawal-only on Amsterdam+) that unconditional
removal silently exempts a BAL entry the protocol calls extraneous: fee
finalization never runs without a tx, so geth's readerTracker never touches
the coinbase either. A BAL coinbase entry on such a block is by
construction extraneous and must surface as a validation error.

Restoring the original gate (only exempt when at least one tx ran)
re-rejects the block. Verified against EELS
test_bal_invalid_extraneous_coinbase[empty_block] and [withdrawal_only].
CallFrameBackup::original_account_storage_slots starts each fresh account's
inner FxHashMap at capacity 0. The first few SSTOREs in any new tx trigger
hashbrown::reserve_rehash 3-4 times in sequence (0 → 4 → 8 → 16).

perf record on a 460-block bal-devnet-7 mainnet-mix fixture (200 tx/block,
~65 Mgas) showed hashbrown::reserve_rehash as the 7th hottest leaf at
3.02B samples. After pre-sizing to 8 the same leaf drops to 2.19B, a 27%
reduction in that frame and ~0.8% of total CPU recovered.

Wall-clock impact is sub-noise on this workload (per-tx CPU savings happen
inside rayon workers; wall-clock is bound by the longest tx per block) but
the CPU savings compound on heavier workloads where critical-path txs hit
the rehash chain.

Wastes ~256 B per untouched account; negligible.
@edg-l edg-l marked this pull request as ready for review May 15, 2026 14:52
@edg-l edg-l requested review from a team, ManuelBilbao and ilitteri as code owners May 15, 2026 14:52
@ethrex-project-sync ethrex-project-sync Bot moved this to In Review in ethrex_l1 May 15, 2026
@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context for a thorough review.


PR Review: perf(l1): reduce BAL parallel-path overhead

Overall Assessment

This is a well-engineered performance bundle with clear, documented reasoning for each optimization. The changes show strong understanding of the BAL parallel execution model. Test coverage is solid (EF tests, Hive, mainnet-mix fixture). A few correctness subtleties warrant attention.


Correctness Concerns

blockchain.rs: Single-recv assumption is fragile (medium severity)

// Line 82
let updates: Vec<AccountUpdate> = match rx.recv() {
    Ok(updates) => { ... updates }
    Err(_) => Vec::new()
};

The change replaces a draining loop with a single recv(), relying on the invariant that execute_block_parallel sends exactly one batch. This is documented in the comment, and verified to be true today (mod.rs:1086-1089). However:

  • The contract lives entirely in code comments. If a future refactor sends a second batch (e.g. for incremental streaming), the second batch is silently dropped — no panic, no error, just a truncated state root.
  • Consider adding a debug assertion: debug_assert!(rx.try_recv().is_err(), "merkleizer received unexpected second batch") after the recv(). This costs nothing in release builds but would catch protocol drift in CI.

bal_to_account_updates: Fast path 1 removed: false assumption

// ~line 740 (mod.rs)
if !has_info_changes {
    updates.push(AccountUpdate { ..., removed: false, info: None, ... });
    continue;
}

The logic is correct: without info changes, balance/nonce/code_hash are identical in pre- and post-state, so the account cannot transition to empty (removed = post_empty && !pre_empty is always false). The reasoning holds. But the comment ("can't be a removal … by definition") glosses over the "pre-state might already be empty" edge case — it's actually fine because if pre was empty AND storage changes arrived, the pre-empty invariant was already false. Worth noting for future readers.

unread_storage_reads cleanup: semantic equivalence confirmed

The refactoring from per-tx current_state iteration to pre-computed destroyed_addresses/read_keys vecs is semantically identical to the serial loop. The retain(|&(a, _)| a != *addr) call for destroyed accounts is O(n) per destroyed account, same as before — acceptable given destroyed accounts are rare.

Validation error ordering change

In the old serial loop, BAL validation for tx N short-circuited before tx N+1 ran. Now all tx validations execute in parallel, and only the first error (by tx index, since exec_results is sorted) is surfaced after the gas check. This is a behavior-preserving change for the common case, but if validation error messages ever become important for debugging, the silent discard of later errors could be surprising. Consider logging suppressed errors at warn! level.


Correctness Fix (Positive)

Coinbase exemption for 0-tx blocks (mod.rs:1395):

if !unaccessed_pure_accounts.is_empty() && !exec_results.is_empty() {
    unaccessed_pure_accounts.remove(&header.coinbase);

The added && !exec_results.is_empty() guard is a genuine correctness improvement. Empty blocks / withdrawal-only blocks never trigger fee finalization, so a BAL entry for the coinbase in such blocks is legitimately extraneous and should fail validation. This matches the EELS test (test_bal_invalid_extraneous_coinbase) and is correct.


Design / Structural Notes

wait_for_persistence_idle: Rendezvous correctness verified

The Ping-based idle detection is elegant. Confirmed that trie_upd_tx/trie_upd_rx are created with sync_channel(0) (store.rs:1557), so the rendezvous semantics hold: a successful send proves the worker is back at recv(). The only gap is that this invariant isn't enforced at the type level — if someone changes sync_channel(0) to sync_channel(1), Pings could buffer and the method would return a false positive. A comment at the channel creation site pointing to wait_for_persistence_idle would help protect against this:

// INVARIANT: capacity must stay 0 (rendezvous) — wait_for_persistence_idle
// depends on send-blocking to detect worker idle. See Store::wait_for_persistence_idle.
let (trie_upd_tx, trie_upd_rx) = std::sync::mpsc::sync_channel(0);

code_cache index binding (mod.rs:1138–1147 and ~947–967)

code_cache[acct_idx][code_idx] assumes code_cache is built from bal.accounts() in the same order as accounts_by_min_index resolves its .1 field. Both are derived from the same bal.accounts() slice, so this holds today. It's an implicit invariant — a brief comment at the cache[acct_idx] access site noting "acct_idx is an index into bal.accounts()" would make the coupling explicit.

BalAccountCodeCache type name

type BalAccountCodeCache = Vec<(H256, Option<Code>)>;

The name is slightly misleading: it's not a cache for the whole BAL, it's a per-account cache of code-change entries. PerAccountCodeChanges or BallAccountCodeChanges would be more descriptive, though this is minor.


Performance Observations

DashMap migration is well-motivated. The switch from RwLock<FxHashMap> removes the write-lock bottleneck under rayon parallelism. Using FxBuildHasher preserves the fast hash for short fixed-size keys (Address, H256). The TOCTOU double-fetch pattern on cache miss is retained from the old design and is acceptable (idempotent reads from backing store).

PER_TX_DB_CAPACITY = 32 is a reasonable constant. The comment notes "most txs touch <10 accounts" — with a load factor of ~0.87 in hashbrown, capacity 32 → 28 usable entries before a resize. For typical txs this should be sufficient.

or_insert_with(|| FxHashMap::with_capacity_and_hasher(8, ...)) in gen_db.rs: The pre-sizing comment is well-supported by the flamegraph reference. A capacity of 8 means one initial alloc handles the common case of 5–8 SSTOREs per account per call frame cleanly.


Minor Nits

  • prefetch_accounts and prefetch_storage now return the try_for_each result directly (no trailing Ok(())). This is idiomatic Rust and correct, but is a style break from the rest of the file which uses explicit Ok(()). Consistent either way.
  • The expect("checked by has_full_info_coverage") calls in fast path 2 are safe and the message is useful. Prefer keeping them over unwrap() for diagnosability.
  • drop(current_state); drop(codes); on lines 1322-1323 are explicit drops that the comment justifies (to avoid crossing the rayon boundary). The drops are correct but redundant since both are consumed by the validation closure before the Ok((...)) return. Rust would drop them at end-of-scope anyway, but the explicit drops reinforce the intent.

Summary

  • Merkleization overlap fix: correct, but the single-recv contract should be guarded with a debug assertion.
  • Fast paths in bal_to_account_updates: logically sound; the storage-only fast path corner case around pre-empty accounts is correctly handled.
  • TrieMessage::Ping: elegant, correct, but the sync_channel(0) invariant needs protection at the creation site.
  • DashMap migration: clean, removes the PoisonError code path, measurably correct.
  • Coinbase exemption fix: genuine correctness improvement.
  • Code cache: correct, implicit index invariant worth documenting.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No blocking correctness or security findings from static review. The deferred BAL-validation ordering and the trie-worker idle handshake both look coherent.

Non-blocking perf note:

  1. crates/vm/backends/levm/mod.rs:747 still calls prefetch_accounts for every write account before the new fast-path split at crates/vm/backends/levm/mod.rs:791. That means storage-only and full-info/non-empty accounts still pay the underlying get_account_state lookup, so the new fast paths mostly avoid a later cache hit rather than the actual DB read. If this PR’s goal is BAL-path latency reduction, filtering prefetches down to slow-path candidates would capture more of the intended win.

Testing gap:

  • I couldn’t run the targeted EIP-7928/EIP-8037 test slice here. cargo test is blocked in this environment by read-only ~/.cargo / ~/.rustup state plus an unfetched git dependency (libssz), so this review is source-only.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR is a performance bundle for the BAL parallel-execution path rebased onto bal-devnet-7-pr, delivering ~+26% throughput over the unpatched parallel baseline through eight independent improvements.

  • Overlap fix (blockchain.rs): Stage A now recv() once instead of draining the channel, allowing Stage B (parallel storage roots) to overlap with still-in-flight rayon execution; FxHashMap merge step dropped since BAL guarantees one entry per address.
  • DB contention eliminated (db/mod.rs): RwLock<FxHashMap> caches replaced with DashMap<_, _, FxBuildHasher>, removing ~11% CPU cost from read_contended under 16 rayon workers; prefetch_accounts/prefetch_storage simplified with par_iter + entry.or_insert.
  • Per-tx allocator pressure and BAL validation parallelised (levm/mod.rs): GeneralizedDatabase capacity capped at 32, code objects pre-computed once per BAL code change, per-tx validation moved inside the rayon closure with deferred errors surfaced after the gas-limit check.

Note: change B in the PR description (BAL_PARALLEL_TX_THRESHOLD = 5 adaptive fallback) is described but not present in the diff — worth confirming whether it was intentionally dropped during the rebase.

Confidence Score: 4/5

The parallel execution path changes are logically correct and well-tested; the two minor concerns do not affect production block processing.

Core logic is sound. Main risk is the implicit one-batch invariant in handle_merkleization_bal — a future modification sending a second batch would silently produce a wrong state root. A debug_assert after the Ok arm would fully mitigate this.

crates/blockchain/blockchain.rs (single-recv invariant and error-path Stage C) and crates/vm/backends/levm/mod.rs (deferred BAL validation fast paths)

Important Files Changed

Filename Overview
crates/blockchain/blockchain.rs Stage A changed from channel drain to single rx.recv(); allows Stage B to overlap with parallel exec. Introduces a fragile one-batch invariant and triggers Stage C (16 trie opens) on the error path.
crates/vm/backends/levm/mod.rs BAL validation moved into rayon closures, code_cache pre-computed, per-tx DB capacity capped, two fast paths added to bal_to_account_updates. Logic correct; deferred-error order preserved.
crates/storage/store.rs Adds TrieMessage enum with Ping variant; wait_for_persistence_idle() correctly exploits sync_channel(0) rendezvous to signal worker idle state.
crates/vm/levm/src/db/mod.rs RwLock replaced with DashMap<_, _, FxBuildHasher> for all three caches; prefetch methods simplified with par_iter + DashMap entry API.
cmd/ethrex/cli.rs Magic-number sleep replaced by wait_for_persistence_idle(); bench-tool only change with no production effect.
crates/vm/levm/Cargo.toml Adds dashmap 6.1 as a direct dep rather than a workspace dep; minor consistency nit.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/blockchain/blockchain.rs:870-882
**Single-recv invariant has no defensive check**

`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.

### Issue 2 of 3
crates/blockchain/blockchain.rs:876-881
**`Err(_)` path continues through Stage C (16 trie opens)**

When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.

### Issue 3 of 3
crates/vm/levm/Cargo.toml:23
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

```suggestion
dashmap.workspace = true
```

Reviews (1): Last reviewed commit: "perf(l1): pre-size backup_storage_slot i..." | Re-trigger Greptile

Comment on lines +870 to +882
let updates: Vec<AccountUpdate> = match rx.recv() {
Ok(updates) => {
let current_length = queue_length.fetch_sub(1, Ordering::Acquire);
*max_queue_length = current_length.max(*max_queue_length);
updates
}
}
Err(_) => {
// Channel closed without a message — execution failed before
// bal_to_account_updates ran. Return empty work so the exec
// error surfaces in execution_result rather than being masked.
Vec::new()
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Single-recv invariant has no defensive check

handle_merkleization_bal now consumes exactly one message and then never touches rx again. If execute_block_parallel is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel") after the Ok arm would catch any accidental protocol change during development.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 870-882

Comment:
**Single-recv invariant has no defensive check**

`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +876 to +881
Err(_) => {
// Channel closed without a message — execution failed before
// bal_to_account_updates ran. Return empty work so the exec
// error surfaces in execution_result rather than being masked.
Vec::new()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Err(_) path continues through Stage C (16 trie opens)

When the channel is closed without a message (execution failure before bal_to_account_updates), the function returns Vec::new() and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty AccountUpdatesList early in the Err arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via execution_result? regardless.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 876-881

Comment:
**`Err(_)` path continues through Stage C (16 trie opens)**

When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread crates/vm/levm/Cargo.toml
strum = { version = "0.27.1", features = ["derive"] }
rustc-hash.workspace = true
rayon.workspace = true
dashmap = "6.1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 dashmap is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses workspace = true. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

Suggested change
dashmap = "6.1"
dashmap.workspace = true
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/vm/levm/Cargo.toml
Line: 23

Comment:
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

```suggestion
dashmap.workspace = true
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: In Review
Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant