feat(chain): immediate RPC failover for reads and writes (multi-RPC) by Jurij89 · Pull Request #1335 · OriginTrail/dkg

Jurij89 · 2026-06-25T23:33:48Z

Summary

On any retryable RPC error (HTTP 429 / 5xx / network), a multi-RPC node now fails over to the next configured endpoint immediately for both reads and writes, instead of burning ~7.5s of same-endpoint retry first. This makes the multiple-RPC config actually behave like failover — the whole point of configuring backups.

The failover engine shipped dormant in #684/rc.12, but multi-RPC read failover was effectively broken: the ethers FallbackProvider (quorum:1) only advances to a backup on a 4s stall, not on a fast error, so a 429 was thrown to the caller rather than failed over. This PR makes immediate failover real:

Reads (multi-RPC): per-endpoint FetchRequest retries=0 and the FallbackProvider is removed. Every read routes through explicit readWithFailover / contractReadWithFailover loops over the bare this.providers[]. First retryable error → next endpoint at once; all exhausted → typed RPC_ENDPOINTS_EXHAUSTED.
Writes: a bounded set-retry inside sendSignedTransactionAndWait re-broadcasts the same signed tx across endpoints. The seam is signer-free, so re-sign is structurally impossible (no new nonce / no second tx); the WAL onBroadcast checkpoint fires once upstream; the per-wallet KeyedSerializer lock is held across all passes.
OBS-1 (V10 publish): populate/sign now fails over via populateAndSignAcrossProviders with a shared forcedReapprove latch (rc.12: intermittent TooLowAllowance(TRAC,0,1) on consecutive zero-cost publishes (per-publish re-approve race; #870 residual) #888) — exactly one forced approve per publish across every endpoint, zero new on-chain txs introduced by failover.
Single-RPC is unchanged (per-endpoint retries=5, bounded RPC_ENDPOINTS_EXHAUSTED→503, cli: daemon-http-behavior-extra RPC-failover 503 test times out — "Daemon did not become ready within 45s" (surfaced by #892) #894).
Stall caps: 4s per attempt for point reads; RPC_LOG_SCAN_TIMEOUT_MS=30s for wide getLogs/queryFilter scans via a multiAttemptTimeoutMs opt that caps multi-RPC only — single-RPC stays uncapped so a cold-start backfill getLogs can't permanently stall the publisher poller.
Process-wide host-only failover logging extended to reads (never a full RPC URL — a configured URL may carry an API key).

Review depth: an upfront red-team validated the design (caught the FallbackProvider quorum-1 trap); a dedicated tx-safety pass verified 9 invariants + 3 OBS-1 guards line-by-line on the real dispatch seam and signed off; and an independent adversarial review of the diff caught one reliability bug (the 4s cap aborting wide getLogs → permanent poller stall), which is fixed and regression-tested here.

Builds on fix(multi-rpc): address PR #1329 follow-up review comments #1332 (typed ChainRpcTransportError boundary) and the multi-RPC defaults of feat(chain,cli): enable multi-RPC by default — failover, 503/504 parity, observability #1329
Failover engine: feat: add multi-rpc failover #684 / rc.12 (FallbackProvider + write failover + WAL/nonce idempotency)
Preserves cli: daemon-http-behavior-extra RPC-failover 503 test times out — "Daemon did not become ready within 45s" (surfaced by #892) #894 (single-RPC bounded retry), rc.12: intermittent TooLowAllowance(TRAC,0,1) on consecutive zero-cost publishes (per-publish re-approve race; #870 residual) #888 (forced re-approve), Publisher op-wallet nonce race on back-to-back publishes → tentative kaId:0 #953 (per-wallet nonce ordering)
getMaxKaNumberForAuthor absent-view handling preserved (dkg_knowledge_asset_create aborts on testnet/mainnet: unbounded eth_getLogs (block 0 to latest) overflows the RPC 2000-block range during KA-number-floor reconcile #1080 / fix(#1080): on-chain getMaxKaNumberForAuthor view + client wiring (no more unbounded eth_getLogs) #1082)
Deferred follow-ups: the per-endpoint circuit-breaker (S3, INV-8) and a maximally-discriminating G-OBS1b regression-test script — both intentionally out of scope to keep this PR focused; the shared-latch guarantee is already structurally proven + tested.

Files changed

File	What
`packages/chain/src/evm-adapter-base.ts`	`readWithFailover`/`contractReadWithFailover`/`withRunner` read loops; FallbackProvider removed (bare primary); S2 set-retry in `sendSignedTransactionAndWait`; `populateAndSignAcrossProviders` + shared-latch OBS-1 recovery; all ~60 reads routed
`packages/chain/src/evm-adapter-constants.ts`	`RPC_LOG_SCAN_TIMEOUT_MS=30s`, `RPC_ENDPOINT_SET_RETRIES`, `RPC_ENDPOINT_SET_RETRY_BACKOFF_MS`
`packages/chain/src/evm-adapter-rpc.ts`	`boundedRetryFetchRequest(url, maxRetries)` (multi=0/single=5); `withTimeout` via `createRpcTimeoutError`; `RPC_TIMEOUT` retryable
`packages/chain/src/evm-adapter-events.ts`	wide `queryFilter`/`getLogs` reads via `contractReadWithFailover` with `multiAttemptTimeoutMs` (30s) — avoids the 4s poller-stall
`packages/chain/src/evm-adapter-{publish,conviction,identity,context-graph,storage-reads,random-sampling,ack-sign}.ts`	contract-view reads routed through the failover loops
`packages/chain/src/rpc-failover-log.ts`	host-only failover logging extended to the read path
`packages/chain/test/*` (5 modified, 6 new)	read-mock migration to `this.providers[]`; real-loopback read/write failover gates; `readwithfailover-loop` + the log-scan cap regression tests; `write-tx-safety-invariants`; mock-parity exemptions for the new protected helpers

Test plan

pnpm --filter @origintrail-official/dkg-chain exec vitest run — 701 pass / 1 skip (S3 breaker placeholder) / 0 fail (42 files)
pnpm run build:runtime:packages — green across all 15 packages (chain change propagates to agent/publisher/cli with no API break)
Real-loopback failover gates: a 429 on the primary fails over to a healthy backup for reads, writes, V10 publish, and the bespoke getMaxKaNumber loop (host-only logs verified: 127.0.0.1 present, API key never, never a full URL)
Tx-safety invariants asserted on the real dispatch → sendSignedTransactionAndWait → populateAndSignAcrossProviders seam: sign once, set-retry never re-signs, WAL fires once, exactly one forced approve across failover×TooLowAllowance, no resubmit on revert, lock held across passes
Log-scan cap regression: a multi-RPC wide read >4s completes with the 30s cap (and aborts→fails-over under the old 4s default); single-RPC stays uncapped (cli: daemon-http-behavior-extra RPC-failover 503 test times out — "Daemon did not become ready within 45s" (surfaced by #892) #894)
CI: full integration / Tornado EVM jobs (live-daemon paths are native-blocked locally → validated in CI)

🤖 Generated with Claude Code

On any retryable RPC error (429/5xx/network), multi-RPC adapters now fail over to the next configured endpoint IMMEDIATELY instead of burning ~7.5s of same-endpoint retry. Reads and writes both. Single-RPC is unchanged (per-endpoint retries=5, #894). - Multi-RPC: per-endpoint FetchRequest retries=0 and the ethers FallbackProvider REMOVED. All reads route through explicit readWithFailover / contractReadWithFailover loops over the bare this.providers[]. (The FallbackProvider's quorum:1 only advances on a 4s STALL, not a fast error, so retries=0 alone would have broken read failover — the explicit loop is what makes immediate failover real.) - Writes: a bounded set-retry inside sendSignedTransactionAndWait re-broadcasts the SAME signed tx across endpoints (signer-free seam, so re-sign is structurally impossible; WAL onBroadcast fires once upstream; the per-wallet KeyedSerializer lock is held across all passes). - OBS-1: V10 populate/sign fails over via populateAndSignAcrossProviders with a SHARED forcedReapprove latch (#888) — exactly one forced approve per publish across every endpoint, zero new on-chain txs from failover. - Per-attempt stall caps: 4s for point reads; RPC_LOG_SCAN_TIMEOUT_MS=30s for wide getLogs/queryFilter scans via a multiAttemptTimeoutMs opt that caps multi-RPC attempts ONLY — single-RPC stays uncapped so a cold-start backfill getLogs can't permanently stall the publisher poller (#894). - Process-wide host-only failover logging extended to reads. Tx-safety: 9 invariants + 3 OBS-1 guards verified on the real dispatch seam (no re-sign, WAL once, single shared approve latch). Chain suite 701 pass / 1 skip (S3 circuit-breaker staged as a fast-follow). Full runtime build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp

…adProvider, withRunner Addresses 6 otReviewAgent comments (all read-path): - readWithFailover / contractReadWithFailover now take an optional `isRetryable` classifier (default isRetryableRpcError). getMaxKaNumberForAuthor's bespoke per-endpoint staticCall loop is replaced by a single readWithFailover call with an absent-view classifier — it now gets the per-attempt withTimeout it lacked (a hung primary fails over instead of hanging), with no copied loop/logging/exhaustion. The existing absent-view outer catch (scan / bytecode-confirm) is unchanged. [#1, #5] - contractReadWithFailover defaults to a new `isContractViewRetryable` (= isRetryableRpcError MINUS BAD_DATA): a contract-view BAD_DATA is a deterministic ABI/empty-decode error, not an RPC outage, so it now surfaces directly instead of failing over across endpoints and masking the real error as RPC_ENDPOINTS_EXHAUSTED (restores pre-PR FallbackProvider behavior). Scoped to view reads only; global isRetryableRpcError untouched (writes/cli/agent). [#2] - Removed the obsolete public getReadProvider() (zero callers; it returned the bare primary and silently bypassed multi-RPC read failover) + its mock-parity exemption + two mechanical test refs. [#3, #4] - Tightened withRunner<T extends Contract | Wallet>: a concrete `.connect` probe (no cast-through-unknown / any on the production boundary) with the test-double passthrough as its own guarded branch. [#6] Tests: BAD_DATA-not-retryable-for-views (+ 429 contrast), opts.isRetryable precedence both ways, getMaxKaNumber hung-primary timeout-then-failover, and the preserved absent-view boundary. Chain suite 706 pass / 1 skip; full build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp

@deprecated

…t rebind helpers, events scan helper Addresses the reviewer's second round (7 items): - Restored getReadProvider() as @deprecated (returns the bare primary and does NOT fail over — callers should use the adapter read methods, which route through readWithFailover). Removing a public method from the published EVMChainAdapter export is a breaking change. [B-1] - Trimmed the large doc-comment blocks on readWithFailover and the V10 populate/broadcast helpers down to the stable contract (retry classifier, timeout semantics, purity), dropping the PR-stage/incident prose. [B-3] - Extracted queryFilterWithFailover in events.ts that bakes in the 30s LOG_SCAN_OPTS; all 9 event scans use it, so the wide-log timeout is owned once by the module instead of call-site discipline. [B-4] - Split withRunner into rebindContract + rebindSigner with a concrete `.connect` probe — no cast-through-unknown, no production branch for non-ethers test doubles. rebindSigner is byte-identical to the old signer path. [A/#6] - Migrated the affected tests to connect-capable mocks (a shared connectable() helper) so the real rebindContract/rebindSigner run; the failover test now uses provider-specific connect doubles and asserts the BACKUP-connected view is called (verifies per-provider rebinding). [B-2] Added a single-RPC bounded-retry-budget guard [B-5] and a listenForEvents wide-queryFilter log-scan-timeout test [B-6]. mock-parity exemptions updated for the new protected helpers (rebindContract/rebindSigner/queryFilterWithFailover). Chain suite 709 pass / 1 skip; full runtime build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp

otReviewAgent · 2026-06-26T02:15:14Z

+   * `p => contract.connect(p).someView(args)`) and MUST be a PURE read — no
+   * sign / broadcast / WAL — since it may execute on more than one provider.
+   */
+  protected async readWithFailover<T>(


🟡 Issue: This adds a substantial transport/failover subsystem directly into EVMChainAdapterBase, which is already a 3k+ line mix of provider wiring, contract resolution, wallet selection, approvals, publish orchestration, receipt polling, and error enrichment. The new read loop, contract rebinding, signer rebinding, populate/sign loop, and retry policy would be much easier to reason about as a focused RpcFailoverClient/ContractReadRunner module owned by the base class rather than more protected methods on the god class. That extraction would also let the tests exercise the failover policy directly instead of reaching through as any protected seams.

otReviewAgent · 2026-06-26T02:15:14Z

    await this.init();
    const cgs = this.requireContextGraphStorage();
-    return Boolean(await cgs.isContextGraphActive(contextGraphId));
+    return Boolean(await this.contractReadWithFailover(


🟡 Issue: This contractReadWithFailover(label, contract, c => c.method(...)) shape now recurs across the adapter modules, so every domain method has to carry transport labels, rebinding lambdas, and failover mechanics inline. That leaks the transport abstraction into business-facing code and makes future reads easy to implement inconsistently. Please hide this behind a narrower contract-view dispatcher or failover-bound contract facade, e.g. readContract(cgs, 'cgStorage.isContextGraphActive', 'isContextGraphActive', contextGraphId), so the call sites stay about the chain concept rather than the failover plumbing.

otReviewAgent · 2026-06-26T02:15:14Z

+  protected async readWithFailover<T>(
+    label: string,
+    fn: (provider: JsonRpcProvider) => Promise<T>,
+    opts?: {


🟡 Issue: The timeout policy is encoded as two optional numeric knobs whose interaction is non-obvious (attemptTimeoutMs caps single-RPC, multiAttemptTimeoutMs intentionally does not). That forces every caller to understand subtle failover policy to choose the right option and makes accidental misuse likely as more reads move onto this helper. Consider replacing these loose options with named policies/constants such as pointRead, wideLogScan, and failOpenFundingRead, with the helper owning the actual timeout matrix.

otReviewAgent · 2026-06-26T02:15:14Z

+// NO-OP self-rebind (connect returns the mock), so the REAL rebindContract runs
+// (a genuine rebindContract regression would still be caught). Idempotent —
+// leaves a MEANINGFUL .connect (e.g. token → tokenWithSigner) untouched.
+function connectable<T>(mock: T): T {


🔵 Nit: connectable is now copied into multiple test files with the same mutable mock behavior. Since the production rebind boundary made this helper part of the test contract, put it in a shared test utility instead of maintaining parallel copies that can drift.

otReviewAgent reviewed Jun 25, 2026

View reviewed changes

otReviewAgent reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(chain): immediate RPC failover for reads and writes (multi-RPC)#1335

feat(chain): immediate RPC failover for reads and writes (multi-RPC)#1335
Jurij89 wants to merge 3 commits into
mainfrom
feat/immediate-rpc-failover

Jurij89 commented Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

otReviewAgent Jun 26, 2026

Uh oh!

otReviewAgent Jun 26, 2026

Uh oh!

otReviewAgent Jun 26, 2026

Uh oh!

otReviewAgent Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jurij89 commented Jun 25, 2026

Summary

Related

Files changed

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

otReviewAgent Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

otReviewAgent Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

otReviewAgent Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

otReviewAgent Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants