Skip to content

fix(net): exit solidity node block sync on shutdown#6804

Open
317787106 wants to merge 5 commits into
tronprotocol:release_v4.8.2from
317787106:hotfix/fix_reference
Open

fix(net): exit solidity node block sync on shutdown#6804
317787106 wants to merge 5 commits into
tronprotocol:release_v4.8.2from
317787106:hotfix/fix_reference

Conversation

@317787106
Copy link
Copy Markdown
Collaborator

@317787106 317787106 commented May 29, 2026

What does this PR do?

  1. TronNetDelegate — wrap LockSupport.park() in a while (!hitDown && !Thread.currentThread().isInterrupted()) loop so that spurious unparks (e.g. from Thread.interrupt() in close()) cannot trigger the System.exit(0) path prematurely, and so the hit-thread exits cleanly when the Spring context is destroyed.

  2. SolidityNode — fix a shutdown race in getBlockByNum(): when a gRPC exception is thrown while the node is shutting down (!flag || isHitDown()), break out immediately instead of sleeping exceptionSleepTime (1 s) and retrying.

  3. ShieldedReceiveTest.pushSameSkAndScanAndSpend — fix a test isolation bug and a performance regression:

    • The test relied on the genesis bypass in validBlock() (latestBlockHeaderNumber == 0 → return true) to skip DPoS schedule validation. When prior tests cause the consensus background task to produce blocks, this bypass no longer applies and the block push fails with ValidateScheduleException.
    • Replace hardcoded System.currentTimeMillis() timestamps with nextScheduledTime(), which looks up the nearest slot where the witness is actually scheduled according to DposSlot.
    • Stop ConsensusService before manual block pushes: DposTask shares the same localwitness key and would race to produce blocks at the same slot, triggering switchFork and making the test slow.
  4. reference.conf — remove the stale "Keys that cannot auto-bind" comment block; those implementation details belong in the source classes, not in the config template.

  5. README.md — surface reference.conf alongside config.conf in the configuration-file list so users know a built-in defaults template exists (added in v4.8.2).

Why are these changes required?

  • The single LockSupport.park() call could be woken by Thread.interrupt() from close(), causing the process to reach System.exit(0) before hitDown was ever set. With the while-loop fix, close() calling interrupt() caused a busy-spin because park() returns immediately when interrupted but the loop never exited — adding the isInterrupted() guard lets the thread exit cleanly.
  • During shutdown, a transient gRPC exception in getBlockByNum() triggered a 1-second sleep and an unnecessary retry loop, delaying clean exit and producing noisy error logs.
  • pushSameSkAndScanAndSpend passed in isolation because the chain started at genesis (block 0), bypassing DPoS validation. In a full test run, the consensus background task produces blocks before this test runs, advancing the head past 0 and exposing the missing schedule-aware timestamp logic.
  • The reference.conf comment described internal normalisation details that have since been refactored into the Java classes.
  • reference.conf was introduced in v4.8.2 but not mentioned in the README.

This PR has been tested by:

  • Unit Tests
    • testGetBlockByNumNoErrorOnExceptionDuringShutdown: verifies the shutdown race fix — method breaks in < 500 ms; without the fix a 1-second sleep fires first.
    • pushSameSkAndScanAndSpend: now stable in both single-test and full-suite runs.
  • Manual Testing

Follow up

Extra details

@github-actions github-actions Bot requested review from 3for and kuny0707 May 29, 2026 11:02
@halibobo1205 halibobo1205 added this to the GreatVoyage-v4.8.2 milestone Jun 1, 2026
@317787106 317787106 force-pushed the hotfix/fix_reference branch from c11f829 to fc3551b Compare June 1, 2026 09:01
Comment thread framework/src/test/java/org/tron/program/SolidityNodeTest.java
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@317787106 317787106 force-pushed the hotfix/fix_reference branch from fd65a90 to f6ccd65 Compare June 1, 2026 10:05
Comment thread framework/src/main/java/org/tron/core/net/TronNetDelegate.java
boolean ok2 = dbManager.pushTransaction(transactionCap2);
Assert.assertTrue(ok2);
} finally {
consensusService.start();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] pushSameSkAndScanAndSpend still appears to leak state into subsequent tests through the consensus background task.

ShieldedReceiveTest.java (line 2409) calls consensusService.stop(), and ShieldedReceiveTest.java (line 2537) attempts to restore it with consensusService.start() in the finally block.

However, DposTask.stop() sets isRunning = false (DposTask.java:82), while DposTask.init() does not reset it back to true (DposTask.java:46). As a result, after start() creates a new executor, the newly submitted runnable immediately exits because isRunning is still false, so block production is not actually resumed.

This means the test can still affect subsequent tests running in the same Spring context.

Consider either:

  • resetting isRunning = true in DposTask.init() before creating the executor, or
  • isolating this test in a dedicated context/fixture instead of stopping and restarting a shared singleton service.

Copy link
Copy Markdown
Collaborator Author

@317787106 317787106 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, other testcases don't use consensusService. But reset DposTask's isRunning by reflection in finally block is still ok.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] The test appears to work around a real production issue via reflection rather than fixing the root cause.

DposTask.stop() sets isRunning = false, but DposTask.init() never resets it (DposTask.java:43,84). As a result, any ConsensusService.stop() → start() cycle leaves block production permanently disabled—not just in tests, but in production as well.

The actual fix seems to be a single line:

isRunning = true;

at the beginning of DposTask.init() (or start()).

That would also allow the test to simply call stop()/start() without resorting to reflection.

As written, the underlying defect remains, and the test reaches into a private volatile field via reflection. This is fairly brittle—a field rename could silently break the test at runtime.

Recommend fixing DposTask itself and removing the reflection from the test.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main code should not be modified solely to make the test cases pass.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is not about modifying production code for the sake of a test. The test exposed a potential production issue: any ConsensusService.stop() → start() cycle leaves block production permanently disabled.

If such a lifecycle transition cannot occur in production, then this can be considered a test-only concern and safely ignored. Otherwise, it may indicate a real bug in the service lifecycle handling.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lifecycle transition ConsensusService.stop() → start() cannot occur in production. So there is no need to change it.

Comment thread framework/src/main/java/org/tron/program/SolidityNode.java Outdated
@317787106 317787106 changed the title fix(doc): optimze README.md and reference.conf comment fix(net): exit solidity node block sync on shutdown Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants