Skip to content

feat(chain): immediate RPC failover for reads and writes (multi-RPC)#1335

Open
Jurij89 wants to merge 3 commits into
mainfrom
feat/immediate-rpc-failover
Open

feat(chain): immediate RPC failover for reads and writes (multi-RPC)#1335
Jurij89 wants to merge 3 commits into
mainfrom
feat/immediate-rpc-failover

Conversation

@Jurij89

@Jurij89 Jurij89 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

On any retryable RPC error (HTTP 429 / 5xx / network), a multi-RPC node now fails over to the next configured endpoint immediately for both reads and writes, instead of burning ~7.5s of same-endpoint retry first. This makes the multiple-RPC config actually behave like failover — the whole point of configuring backups.

The failover engine shipped dormant in #684/rc.12, but multi-RPC read failover was effectively broken: the ethers FallbackProvider (quorum:1) only advances to a backup on a 4s stall, not on a fast error, so a 429 was thrown to the caller rather than failed over. This PR makes immediate failover real:

  • Reads (multi-RPC): per-endpoint FetchRequest retries=0 and the FallbackProvider is removed. Every read routes through explicit readWithFailover / contractReadWithFailover loops over the bare this.providers[]. First retryable error → next endpoint at once; all exhausted → typed RPC_ENDPOINTS_EXHAUSTED.
  • Writes: a bounded set-retry inside sendSignedTransactionAndWait re-broadcasts the same signed tx across endpoints. The seam is signer-free, so re-sign is structurally impossible (no new nonce / no second tx); the WAL onBroadcast checkpoint fires once upstream; the per-wallet KeyedSerializer lock is held across all passes.
  • OBS-1 (V10 publish): populate/sign now fails over via populateAndSignAcrossProviders with a shared forcedReapprove latch (rc.12: intermittent TooLowAllowance(TRAC,0,1) on consecutive zero-cost publishes (per-publish re-approve race; #870 residual) #888) — exactly one forced approve per publish across every endpoint, zero new on-chain txs introduced by failover.
  • Single-RPC is unchanged (per-endpoint retries=5, bounded RPC_ENDPOINTS_EXHAUSTED→503, cli: daemon-http-behavior-extra RPC-failover 503 test times out — "Daemon did not become ready within 45s" (surfaced by #892) #894).
  • Stall caps: 4s per attempt for point reads; RPC_LOG_SCAN_TIMEOUT_MS=30s for wide getLogs/queryFilter scans via a multiAttemptTimeoutMs opt that caps multi-RPC only — single-RPC stays uncapped so a cold-start backfill getLogs can't permanently stall the publisher poller.
  • Process-wide host-only failover logging extended to reads (never a full RPC URL — a configured URL may carry an API key).

Review depth: an upfront red-team validated the design (caught the FallbackProvider quorum-1 trap); a dedicated tx-safety pass verified 9 invariants + 3 OBS-1 guards line-by-line on the real dispatch seam and signed off; and an independent adversarial review of the diff caught one reliability bug (the 4s cap aborting wide getLogs → permanent poller stall), which is fixed and regression-tested here.

Related

Files changed

File What
packages/chain/src/evm-adapter-base.ts readWithFailover/contractReadWithFailover/withRunner read loops; FallbackProvider removed (bare primary); S2 set-retry in sendSignedTransactionAndWait; populateAndSignAcrossProviders + shared-latch OBS-1 recovery; all ~60 reads routed
packages/chain/src/evm-adapter-constants.ts RPC_LOG_SCAN_TIMEOUT_MS=30s, RPC_ENDPOINT_SET_RETRIES, RPC_ENDPOINT_SET_RETRY_BACKOFF_MS
packages/chain/src/evm-adapter-rpc.ts boundedRetryFetchRequest(url, maxRetries) (multi=0/single=5); withTimeout via createRpcTimeoutError; RPC_TIMEOUT retryable
packages/chain/src/evm-adapter-events.ts wide queryFilter/getLogs reads via contractReadWithFailover with multiAttemptTimeoutMs (30s) — avoids the 4s poller-stall
packages/chain/src/evm-adapter-{publish,conviction,identity,context-graph,storage-reads,random-sampling,ack-sign}.ts contract-view reads routed through the failover loops
packages/chain/src/rpc-failover-log.ts host-only failover logging extended to the read path
packages/chain/test/* (5 modified, 6 new) read-mock migration to this.providers[]; real-loopback read/write failover gates; readwithfailover-loop + the log-scan cap regression tests; write-tx-safety-invariants; mock-parity exemptions for the new protected helpers

Test plan

  • pnpm --filter @origintrail-official/dkg-chain exec vitest run701 pass / 1 skip (S3 breaker placeholder) / 0 fail (42 files)
  • pnpm run build:runtime:packages — green across all 15 packages (chain change propagates to agent/publisher/cli with no API break)
  • Real-loopback failover gates: a 429 on the primary fails over to a healthy backup for reads, writes, V10 publish, and the bespoke getMaxKaNumber loop (host-only logs verified: 127.0.0.1 present, API key never, never a full URL)
  • Tx-safety invariants asserted on the real dispatch → sendSignedTransactionAndWait → populateAndSignAcrossProviders seam: sign once, set-retry never re-signs, WAL fires once, exactly one forced approve across failover×TooLowAllowance, no resubmit on revert, lock held across passes
  • Log-scan cap regression: a multi-RPC wide read >4s completes with the 30s cap (and aborts→fails-over under the old 4s default); single-RPC stays uncapped (cli: daemon-http-behavior-extra RPC-failover 503 test times out — "Daemon did not become ready within 45s" (surfaced by #892) #894)
  • CI: full integration / Tornado EVM jobs (live-daemon paths are native-blocked locally → validated in CI)

🤖 Generated with Claude Code

On any retryable RPC error (429/5xx/network), multi-RPC adapters now fail
over to the next configured endpoint IMMEDIATELY instead of burning ~7.5s
of same-endpoint retry. Reads and writes both. Single-RPC is unchanged
(per-endpoint retries=5, #894).

- Multi-RPC: per-endpoint FetchRequest retries=0 and the ethers
  FallbackProvider REMOVED. All reads route through explicit
  readWithFailover / contractReadWithFailover loops over the bare
  this.providers[]. (The FallbackProvider's quorum:1 only advances on a 4s
  STALL, not a fast error, so retries=0 alone would have broken read
  failover — the explicit loop is what makes immediate failover real.)
- Writes: a bounded set-retry inside sendSignedTransactionAndWait
  re-broadcasts the SAME signed tx across endpoints (signer-free seam, so
  re-sign is structurally impossible; WAL onBroadcast fires once upstream;
  the per-wallet KeyedSerializer lock is held across all passes).
- OBS-1: V10 populate/sign fails over via populateAndSignAcrossProviders
  with a SHARED forcedReapprove latch (#888) — exactly one forced approve
  per publish across every endpoint, zero new on-chain txs from failover.
- Per-attempt stall caps: 4s for point reads; RPC_LOG_SCAN_TIMEOUT_MS=30s
  for wide getLogs/queryFilter scans via a multiAttemptTimeoutMs opt that
  caps multi-RPC attempts ONLY — single-RPC stays uncapped so a cold-start
  backfill getLogs can't permanently stall the publisher poller (#894).
- Process-wide host-only failover logging extended to reads.

Tx-safety: 9 invariants + 3 OBS-1 guards verified on the real dispatch
seam (no re-sign, WAL once, single shared approve latch). Chain suite 701
pass / 1 skip (S3 circuit-breaker staged as a fast-follow). Full runtime
build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp
Comment thread packages/chain/src/evm-adapter-base.ts Outdated
Comment thread packages/chain/src/evm-adapter-base.ts Outdated
Comment thread packages/chain/src/evm-adapter-base.ts
Comment thread packages/chain/src/evm-adapter-base.ts
Comment thread packages/chain/src/evm-adapter-base.ts Outdated
Comment thread packages/chain/src/evm-adapter-base.ts Outdated
…adProvider, withRunner

Addresses 6 otReviewAgent comments (all read-path):

- readWithFailover / contractReadWithFailover now take an optional
  `isRetryable` classifier (default isRetryableRpcError). getMaxKaNumberForAuthor's
  bespoke per-endpoint staticCall loop is replaced by a single readWithFailover
  call with an absent-view classifier — it now gets the per-attempt withTimeout
  it lacked (a hung primary fails over instead of hanging), with no copied
  loop/logging/exhaustion. The existing absent-view outer catch (scan /
  bytecode-confirm) is unchanged. [#1, #5]
- contractReadWithFailover defaults to a new `isContractViewRetryable`
  (= isRetryableRpcError MINUS BAD_DATA): a contract-view BAD_DATA is a
  deterministic ABI/empty-decode error, not an RPC outage, so it now surfaces
  directly instead of failing over across endpoints and masking the real error
  as RPC_ENDPOINTS_EXHAUSTED (restores pre-PR FallbackProvider behavior). Scoped
  to view reads only; global isRetryableRpcError untouched (writes/cli/agent). [#2]
- Removed the obsolete public getReadProvider() (zero callers; it returned the
  bare primary and silently bypassed multi-RPC read failover) + its mock-parity
  exemption + two mechanical test refs. [#3, #4]
- Tightened withRunner<T extends Contract | Wallet>: a concrete `.connect` probe
  (no cast-through-unknown / any on the production boundary) with the test-double
  passthrough as its own guarded branch. [#6]

Tests: BAD_DATA-not-retryable-for-views (+ 429 contrast), opts.isRetryable
precedence both ways, getMaxKaNumber hung-primary timeout-then-failover, and the
preserved absent-view boundary. Chain suite 706 pass / 1 skip; full build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp
Comment thread packages/chain/src/evm-adapter-base.ts
Comment thread packages/chain/test/readwithfailover-loop.unit.test.ts Outdated
Comment thread packages/chain/src/evm-adapter-base.ts
Comment thread packages/chain/src/evm-adapter-events.ts Outdated
Comment thread packages/chain/src/evm-adapter-base.ts
Comment thread packages/chain/src/evm-adapter-events.ts Outdated
…t rebind helpers, events scan helper

Addresses the reviewer's second round (7 items):

- Restored getReadProvider() as @deprecated (returns the bare primary and does
  NOT fail over — callers should use the adapter read methods, which route
  through readWithFailover). Removing a public method from the published
  EVMChainAdapter export is a breaking change. [B-1]
- Trimmed the large doc-comment blocks on readWithFailover and the V10
  populate/broadcast helpers down to the stable contract (retry classifier,
  timeout semantics, purity), dropping the PR-stage/incident prose. [B-3]
- Extracted queryFilterWithFailover in events.ts that bakes in the 30s
  LOG_SCAN_OPTS; all 9 event scans use it, so the wide-log timeout is owned
  once by the module instead of call-site discipline. [B-4]
- Split withRunner into rebindContract + rebindSigner with a concrete `.connect`
  probe — no cast-through-unknown, no production branch for non-ethers test
  doubles. rebindSigner is byte-identical to the old signer path. [A/#6]
- Migrated the affected tests to connect-capable mocks (a shared connectable()
  helper) so the real rebindContract/rebindSigner run; the failover test now
  uses provider-specific connect doubles and asserts the BACKUP-connected view
  is called (verifies per-provider rebinding). [B-2] Added a single-RPC
  bounded-retry-budget guard [B-5] and a listenForEvents wide-queryFilter
  log-scan-timeout test [B-6]. mock-parity exemptions updated for the new
  protected helpers (rebindContract/rebindSigner/queryFilterWithFailover).

Chain suite 709 pass / 1 skip; full runtime build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp
* `p => contract.connect(p).someView(args)`) and MUST be a PURE read — no
* sign / broadcast / WAL — since it may execute on more than one provider.
*/
protected async readWithFailover<T>(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This adds a substantial transport/failover subsystem directly into EVMChainAdapterBase, which is already a 3k+ line mix of provider wiring, contract resolution, wallet selection, approvals, publish orchestration, receipt polling, and error enrichment. The new read loop, contract rebinding, signer rebinding, populate/sign loop, and retry policy would be much easier to reason about as a focused RpcFailoverClient/ContractReadRunner module owned by the base class rather than more protected methods on the god class. That extraction would also let the tests exercise the failover policy directly instead of reaching through as any protected seams.

await this.init();
const cgs = this.requireContextGraphStorage();
return Boolean(await cgs.isContextGraphActive(contextGraphId));
return Boolean(await this.contractReadWithFailover(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This contractReadWithFailover(label, contract, c => c.method(...)) shape now recurs across the adapter modules, so every domain method has to carry transport labels, rebinding lambdas, and failover mechanics inline. That leaks the transport abstraction into business-facing code and makes future reads easy to implement inconsistently. Please hide this behind a narrower contract-view dispatcher or failover-bound contract facade, e.g. readContract(cgs, 'cgStorage.isContextGraphActive', 'isContextGraphActive', contextGraphId), so the call sites stay about the chain concept rather than the failover plumbing.

protected async readWithFailover<T>(
label: string,
fn: (provider: JsonRpcProvider) => Promise<T>,
opts?: {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The timeout policy is encoded as two optional numeric knobs whose interaction is non-obvious (attemptTimeoutMs caps single-RPC, multiAttemptTimeoutMs intentionally does not). That forces every caller to understand subtle failover policy to choose the right option and makes accidental misuse likely as more reads move onto this helper. Consider replacing these loose options with named policies/constants such as pointRead, wideLogScan, and failOpenFundingRead, with the helper owning the actual timeout matrix.

// NO-OP self-rebind (connect returns the mock), so the REAL rebindContract runs
// (a genuine rebindContract regression would still be caught). Idempotent —
// leaves a MEANINGFUL .connect (e.g. token → tokenWithSigner) untouched.
function connectable<T>(mock: T): T {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nit: connectable is now copied into multiple test files with the same mutable mock behavior. Since the production rebind boundary made this helper part of the test contract, put it in a shared test utility instead of maintaining parallel copies that can drift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants