Skip to content

fix(telemetry): raise span queue size and make telemetry shutdown idempotent#24121

Merged
alexghr merged 1 commit into
merge-train/spartanfrom
cb/otel-resilience
Jun 16, 2026
Merged

fix(telemetry): raise span queue size and make telemetry shutdown idempotent#24121
alexghr merged 1 commit into
merge-train/spartanfrom
cb/otel-resilience

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

What

Addresses three operational issues seen on staging.

1. Spans dropped: queue full (maxQueueSize=2048)

BatchSpanProcessor dropping spans: queue full (maxQueueSize=2048). … 97007 total.

  • Raised the default OTEL_BSP_MAX_QUEUE_SIZE (otelBspMaxQueueSize) from 2048 → 16384 in telemetry-client/src/config.ts, and matched the fallback default in MonitoredBatchSpanProcessor.
  • A bigger queue alone doesn't drain faster: the SDK exports only 512 spans per scheduled flush by default, so a high span rate keeps the queue full. Added a maxExportBatchSize default of 2048 (clamped to maxQueueSize) so the larger queue actually drains.
  • Both remain overridable via env (OTEL_BSP_MAX_QUEUE_SIZE).

2. Metrics shutdown order / double shutdown

BatchSpanProcessor shutting down with 97567 total spans dropped   (logged twice)
invalid attempt to force flush after MeterProvider shutdown
invalid attempt to force flush after LoggerProvider shutdown
shutdown may only be called once per MeterProvider
shutdown may only be called once per LoggerProvider

Root cause: the telemetry client is shared between the aztec-node and an embedded prover-node. Both stop paths call telemetry.stop() (aztec-node/.../server.ts via tryStop, and prover-node.ts), so the meter/logger/trace providers are flushed-and-shut-down twice. The OTEL providers throw on the second shutdown() and on forceFlush() after shutdown.

Fix: made OpenTelemetryClient.stop() idempotent by memoizing the shutdown promise, and made flush() a no-op once shutdown has started. Repeated stop() calls now resolve to the same single shutdown.

3. prover-node → prover-broker ← prover-agent resilience

Error while retrying JsonRpcClient request to http://…prover-broker…:8080: connect ECONNREFUSED

How many retries before giving up? They never give up — retries are indefinite.

All three broker connections are built with createProvingJobBrokerClient(...) using the proverBrokerBackoff generator (prover-client/src/proving_broker/rpc.ts):

  • prover-node → broker: start_node.ts (makeTracedFetch(proverBrokerBackoff, …))
  • prover-agent → broker: start_prover_agent.ts (makeTracedFetch(proverBrokerBackoff, …))

proverBrokerBackoff is an infinite generator (backoff 1, 1, 1, 2, 4, 4, 4, … seconds, capped at 4s), and retry() only gives up when the backoff generator is exhausted — which never happens here. The ECONNREFUSED line is the per-attempt error logged by retry() while the broker pod is unavailable, not a terminal failure; once the broker pod comes back the next attempt succeeds. So these connections are already resilient to pods cycling — no code change was needed here, and this PR does not alter that behavior. (start_node.ts already comments "Retry indefinitely until the epoch proving times out and the chain reorgs".)

Tests

  • monitored_batch_span_processor.test.ts: warns when the queue fills; does not drop below the old 2048 default now that the default queue is larger.
  • otel.test.ts (new): stop() shuts the providers down exactly once across multiple/concurrent calls; flush() is a no-op after stop().

Local yarn build/test was not run — this workspace copy is not bootstrapped (the noir submodule isn't checked out and the full barretenberg→noir build chain is required). CI validates build + tests.


Created by claudebox · group: slackbot

@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jun 16, 2026
@alexghr alexghr marked this pull request as ready for review June 16, 2026 12:12
@alexghr alexghr enabled auto-merge (squash) June 16, 2026 12:13
@alexghr alexghr merged commit 05942c3 into merge-train/spartan Jun 16, 2026
43 of 47 checks passed
@alexghr alexghr deleted the cb/otel-resilience branch June 16, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants