fix(telemetry): raise span queue size and make telemetry shutdown idempotent by AztecBot · Pull Request #24121 · AztecProtocol/aztec-packages

AztecBot · 2026-06-16T09:22:36Z

What

Addresses three operational issues seen on staging.

1. Spans dropped: queue full (`maxQueueSize=2048`)

BatchSpanProcessor dropping spans: queue full (maxQueueSize=2048). … 97007 total.

Raised the default OTEL_BSP_MAX_QUEUE_SIZE (otelBspMaxQueueSize) from 2048 → 16384 in telemetry-client/src/config.ts, and matched the fallback default in MonitoredBatchSpanProcessor.
A bigger queue alone doesn't drain faster: the SDK exports only 512 spans per scheduled flush by default, so a high span rate keeps the queue full. Added a maxExportBatchSize default of 2048 (clamped to maxQueueSize) so the larger queue actually drains.
Both remain overridable via env (OTEL_BSP_MAX_QUEUE_SIZE).

2. Metrics shutdown order / double shutdown

BatchSpanProcessor shutting down with 97567 total spans dropped   (logged twice)
invalid attempt to force flush after MeterProvider shutdown
invalid attempt to force flush after LoggerProvider shutdown
shutdown may only be called once per MeterProvider
shutdown may only be called once per LoggerProvider

Root cause: the telemetry client is shared between the aztec-node and an embedded prover-node. Both stop paths call telemetry.stop() (aztec-node/.../server.ts via tryStop, and prover-node.ts), so the meter/logger/trace providers are flushed-and-shut-down twice. The OTEL providers throw on the second shutdown() and on forceFlush() after shutdown.

Fix: made OpenTelemetryClient.stop() idempotent by memoizing the shutdown promise, and made flush() a no-op once shutdown has started. Repeated stop() calls now resolve to the same single shutdown.

3. prover-node → prover-broker ← prover-agent resilience

Error while retrying JsonRpcClient request to http://…prover-broker…:8080: connect ECONNREFUSED

How many retries before giving up? They never give up — retries are indefinite.

All three broker connections are built with createProvingJobBrokerClient(...) using the proverBrokerBackoff generator (prover-client/src/proving_broker/rpc.ts):

prover-node → broker: start_node.ts (makeTracedFetch(proverBrokerBackoff, …))
prover-agent → broker: start_prover_agent.ts (makeTracedFetch(proverBrokerBackoff, …))

proverBrokerBackoff is an infinite generator (backoff 1, 1, 1, 2, 4, 4, 4, … seconds, capped at 4s), and retry() only gives up when the backoff generator is exhausted — which never happens here. The ECONNREFUSED line is the per-attempt error logged by retry() while the broker pod is unavailable, not a terminal failure; once the broker pod comes back the next attempt succeeds. So these connections are already resilient to pods cycling — no code change was needed here, and this PR does not alter that behavior. (start_node.ts already comments "Retry indefinitely until the epoch proving times out and the chain reorgs".)

Tests

monitored_batch_span_processor.test.ts: warns when the queue fills; does not drop below the old 2048 default now that the default queue is larger.
otel.test.ts (new): stop() shuts the providers down exactly once across multiple/concurrent calls; flush() is a no-op after stop().

Local yarn build/test was not run — this workspace copy is not bootstrapped (the noir submodule isn't checked out and the full barretenberg→noir build chain is required). CI validates build + tests.

Created by claudebox · group: slackbot

…mpotent

fix(telemetry): raise span queue size and make telemetry shutdown ide…

0e50d46

…mpotent

AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jun 16, 2026

alexghr approved these changes Jun 16, 2026

View reviewed changes

alexghr marked this pull request as ready for review June 16, 2026 12:12

alexghr enabled auto-merge (squash) June 16, 2026 12:13

alexghr merged commit 05942c3 into merge-train/spartan Jun 16, 2026
43 of 47 checks passed

alexghr deleted the cb/otel-resilience branch June 16, 2026 12:13

AztecBot mentioned this pull request Jun 16, 2026

feat: merge-train/spartan #24094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(telemetry): raise span queue size and make telemetry shutdown idempotent#24121

fix(telemetry): raise span queue size and make telemetry shutdown idempotent#24121
alexghr merged 1 commit into
merge-train/spartanfrom
cb/otel-resilience

AztecBot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AztecBot commented Jun 16, 2026

What

1. Spans dropped: queue full (maxQueueSize=2048)

2. Metrics shutdown order / double shutdown

3. prover-node → prover-broker ← prover-agent resilience

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Spans dropped: queue full (`maxQueueSize=2048`)