feat(chain): immediate RPC failover for reads and writes (multi-RPC)#1335
feat(chain): immediate RPC failover for reads and writes (multi-RPC)#1335Jurij89 wants to merge 3 commits into
Conversation
On any retryable RPC error (429/5xx/network), multi-RPC adapters now fail over to the next configured endpoint IMMEDIATELY instead of burning ~7.5s of same-endpoint retry. Reads and writes both. Single-RPC is unchanged (per-endpoint retries=5, #894). - Multi-RPC: per-endpoint FetchRequest retries=0 and the ethers FallbackProvider REMOVED. All reads route through explicit readWithFailover / contractReadWithFailover loops over the bare this.providers[]. (The FallbackProvider's quorum:1 only advances on a 4s STALL, not a fast error, so retries=0 alone would have broken read failover — the explicit loop is what makes immediate failover real.) - Writes: a bounded set-retry inside sendSignedTransactionAndWait re-broadcasts the SAME signed tx across endpoints (signer-free seam, so re-sign is structurally impossible; WAL onBroadcast fires once upstream; the per-wallet KeyedSerializer lock is held across all passes). - OBS-1: V10 populate/sign fails over via populateAndSignAcrossProviders with a SHARED forcedReapprove latch (#888) — exactly one forced approve per publish across every endpoint, zero new on-chain txs from failover. - Per-attempt stall caps: 4s for point reads; RPC_LOG_SCAN_TIMEOUT_MS=30s for wide getLogs/queryFilter scans via a multiAttemptTimeoutMs opt that caps multi-RPC attempts ONLY — single-RPC stays uncapped so a cold-start backfill getLogs can't permanently stall the publisher poller (#894). - Process-wide host-only failover logging extended to reads. Tx-safety: 9 invariants + 3 OBS-1 guards verified on the real dispatch seam (no re-sign, WAL once, single shared approve latch). Chain suite 701 pass / 1 skip (S3 circuit-breaker staged as a fast-follow). Full runtime build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp
…adProvider, withRunner Addresses 6 otReviewAgent comments (all read-path): - readWithFailover / contractReadWithFailover now take an optional `isRetryable` classifier (default isRetryableRpcError). getMaxKaNumberForAuthor's bespoke per-endpoint staticCall loop is replaced by a single readWithFailover call with an absent-view classifier — it now gets the per-attempt withTimeout it lacked (a hung primary fails over instead of hanging), with no copied loop/logging/exhaustion. The existing absent-view outer catch (scan / bytecode-confirm) is unchanged. [#1, #5] - contractReadWithFailover defaults to a new `isContractViewRetryable` (= isRetryableRpcError MINUS BAD_DATA): a contract-view BAD_DATA is a deterministic ABI/empty-decode error, not an RPC outage, so it now surfaces directly instead of failing over across endpoints and masking the real error as RPC_ENDPOINTS_EXHAUSTED (restores pre-PR FallbackProvider behavior). Scoped to view reads only; global isRetryableRpcError untouched (writes/cli/agent). [#2] - Removed the obsolete public getReadProvider() (zero callers; it returned the bare primary and silently bypassed multi-RPC read failover) + its mock-parity exemption + two mechanical test refs. [#3, #4] - Tightened withRunner<T extends Contract | Wallet>: a concrete `.connect` probe (no cast-through-unknown / any on the production boundary) with the test-double passthrough as its own guarded branch. [#6] Tests: BAD_DATA-not-retryable-for-views (+ 429 contrast), opts.isRetryable precedence both ways, getMaxKaNumber hung-primary timeout-then-failover, and the preserved absent-view boundary. Chain suite 706 pass / 1 skip; full build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp
…t rebind helpers, events scan helper Addresses the reviewer's second round (7 items): - Restored getReadProvider() as @deprecated (returns the bare primary and does NOT fail over — callers should use the adapter read methods, which route through readWithFailover). Removing a public method from the published EVMChainAdapter export is a breaking change. [B-1] - Trimmed the large doc-comment blocks on readWithFailover and the V10 populate/broadcast helpers down to the stable contract (retry classifier, timeout semantics, purity), dropping the PR-stage/incident prose. [B-3] - Extracted queryFilterWithFailover in events.ts that bakes in the 30s LOG_SCAN_OPTS; all 9 event scans use it, so the wide-log timeout is owned once by the module instead of call-site discipline. [B-4] - Split withRunner into rebindContract + rebindSigner with a concrete `.connect` probe — no cast-through-unknown, no production branch for non-ethers test doubles. rebindSigner is byte-identical to the old signer path. [A/#6] - Migrated the affected tests to connect-capable mocks (a shared connectable() helper) so the real rebindContract/rebindSigner run; the failover test now uses provider-specific connect doubles and asserts the BACKUP-connected view is called (verifies per-provider rebinding). [B-2] Added a single-RPC bounded-retry-budget guard [B-5] and a listenForEvents wide-queryFilter log-scan-timeout test [B-6]. mock-parity exemptions updated for the new protected helpers (rebindContract/rebindSigner/queryFilterWithFailover). Chain suite 709 pass / 1 skip; full runtime build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012DGBSrmN6Y8Vsgb7x46amp
| * `p => contract.connect(p).someView(args)`) and MUST be a PURE read — no | ||
| * sign / broadcast / WAL — since it may execute on more than one provider. | ||
| */ | ||
| protected async readWithFailover<T>( |
There was a problem hiding this comment.
🟡 Issue: This adds a substantial transport/failover subsystem directly into EVMChainAdapterBase, which is already a 3k+ line mix of provider wiring, contract resolution, wallet selection, approvals, publish orchestration, receipt polling, and error enrichment. The new read loop, contract rebinding, signer rebinding, populate/sign loop, and retry policy would be much easier to reason about as a focused RpcFailoverClient/ContractReadRunner module owned by the base class rather than more protected methods on the god class. That extraction would also let the tests exercise the failover policy directly instead of reaching through as any protected seams.
| await this.init(); | ||
| const cgs = this.requireContextGraphStorage(); | ||
| return Boolean(await cgs.isContextGraphActive(contextGraphId)); | ||
| return Boolean(await this.contractReadWithFailover( |
There was a problem hiding this comment.
🟡 Issue: This contractReadWithFailover(label, contract, c => c.method(...)) shape now recurs across the adapter modules, so every domain method has to carry transport labels, rebinding lambdas, and failover mechanics inline. That leaks the transport abstraction into business-facing code and makes future reads easy to implement inconsistently. Please hide this behind a narrower contract-view dispatcher or failover-bound contract facade, e.g. readContract(cgs, 'cgStorage.isContextGraphActive', 'isContextGraphActive', contextGraphId), so the call sites stay about the chain concept rather than the failover plumbing.
| protected async readWithFailover<T>( | ||
| label: string, | ||
| fn: (provider: JsonRpcProvider) => Promise<T>, | ||
| opts?: { |
There was a problem hiding this comment.
🟡 Issue: The timeout policy is encoded as two optional numeric knobs whose interaction is non-obvious (attemptTimeoutMs caps single-RPC, multiAttemptTimeoutMs intentionally does not). That forces every caller to understand subtle failover policy to choose the right option and makes accidental misuse likely as more reads move onto this helper. Consider replacing these loose options with named policies/constants such as pointRead, wideLogScan, and failOpenFundingRead, with the helper owning the actual timeout matrix.
| // NO-OP self-rebind (connect returns the mock), so the REAL rebindContract runs | ||
| // (a genuine rebindContract regression would still be caught). Idempotent — | ||
| // leaves a MEANINGFUL .connect (e.g. token → tokenWithSigner) untouched. | ||
| function connectable<T>(mock: T): T { |
There was a problem hiding this comment.
🔵 Nit: connectable is now copied into multiple test files with the same mutable mock behavior. Since the production rebind boundary made this helper part of the test contract, put it in a shared test utility instead of maintaining parallel copies that can drift.
Summary
On any retryable RPC error (HTTP 429 / 5xx / network), a multi-RPC node now fails over to the next configured endpoint immediately for both reads and writes, instead of burning ~7.5s of same-endpoint retry first. This makes the multiple-RPC config actually behave like failover — the whole point of configuring backups.
The failover engine shipped dormant in #684/rc.12, but multi-RPC read failover was effectively broken: the ethers
FallbackProvider(quorum:1) only advances to a backup on a 4s stall, not on a fast error, so a 429 was thrown to the caller rather than failed over. This PR makes immediate failover real:FetchRequestretries=0 and theFallbackProvideris removed. Every read routes through explicitreadWithFailover/contractReadWithFailoverloops over the barethis.providers[]. First retryable error → next endpoint at once; all exhausted → typedRPC_ENDPOINTS_EXHAUSTED.sendSignedTransactionAndWaitre-broadcasts the same signed tx across endpoints. The seam is signer-free, so re-sign is structurally impossible (no new nonce / no second tx); the WALonBroadcastcheckpoint fires once upstream; the per-walletKeyedSerializerlock is held across all passes.populateAndSignAcrossProviderswith a sharedforcedReapprovelatch (rc.12: intermittent TooLowAllowance(TRAC,0,1) on consecutive zero-cost publishes (per-publish re-approve race; #870 residual) #888) — exactly one forced approve per publish across every endpoint, zero new on-chain txs introduced by failover.RPC_ENDPOINTS_EXHAUSTED→503, cli: daemon-http-behavior-extra RPC-failover 503 test times out — "Daemon did not become ready within 45s" (surfaced by #892) #894).RPC_LOG_SCAN_TIMEOUT_MS=30sfor widegetLogs/queryFilterscans via amultiAttemptTimeoutMsopt that caps multi-RPC only — single-RPC stays uncapped so a cold-start backfillgetLogscan't permanently stall the publisher poller.Review depth: an upfront red-team validated the design (caught the
FallbackProviderquorum-1 trap); a dedicated tx-safety pass verified 9 invariants + 3 OBS-1 guards line-by-line on the real dispatch seam and signed off; and an independent adversarial review of the diff caught one reliability bug (the 4s cap aborting widegetLogs→ permanent poller stall), which is fixed and regression-tested here.Related
ChainRpcTransportErrorboundary) and the multi-RPC defaults of feat(chain,cli): enable multi-RPC by default — failover, 503/504 parity, observability #1329getMaxKaNumberForAuthorabsent-view handling preserved (dkg_knowledge_asset_create aborts on testnet/mainnet: unbounded eth_getLogs (block 0 to latest) overflows the RPC 2000-block range during KA-number-floor reconcile #1080 / fix(#1080): on-chain getMaxKaNumberForAuthor view + client wiring (no more unbounded eth_getLogs) #1082)Files changed
packages/chain/src/evm-adapter-base.tsreadWithFailover/contractReadWithFailover/withRunnerread loops; FallbackProvider removed (bare primary); S2 set-retry insendSignedTransactionAndWait;populateAndSignAcrossProviders+ shared-latch OBS-1 recovery; all ~60 reads routedpackages/chain/src/evm-adapter-constants.tsRPC_LOG_SCAN_TIMEOUT_MS=30s,RPC_ENDPOINT_SET_RETRIES,RPC_ENDPOINT_SET_RETRY_BACKOFF_MSpackages/chain/src/evm-adapter-rpc.tsboundedRetryFetchRequest(url, maxRetries)(multi=0/single=5);withTimeoutviacreateRpcTimeoutError;RPC_TIMEOUTretryablepackages/chain/src/evm-adapter-events.tsqueryFilter/getLogsreads viacontractReadWithFailoverwithmultiAttemptTimeoutMs(30s) — avoids the 4s poller-stallpackages/chain/src/evm-adapter-{publish,conviction,identity,context-graph,storage-reads,random-sampling,ack-sign}.tspackages/chain/src/rpc-failover-log.tspackages/chain/test/*(5 modified, 6 new)this.providers[]; real-loopback read/write failover gates;readwithfailover-loop+ the log-scan cap regression tests;write-tx-safety-invariants; mock-parity exemptions for the new protected helpersTest plan
pnpm --filter @origintrail-official/dkg-chain exec vitest run— 701 pass / 1 skip (S3 breaker placeholder) / 0 fail (42 files)pnpm run build:runtime:packages— green across all 15 packages (chain change propagates to agent/publisher/cli with no API break)getMaxKaNumberloop (host-only logs verified:127.0.0.1present, API key never, never a full URL)dispatch → sendSignedTransactionAndWait → populateAndSignAcrossProvidersseam: sign once, set-retry never re-signs, WAL fires once, exactly one forced approve across failover×TooLowAllowance, no resubmit on revert, lock held across passes🤖 Generated with Claude Code