Skip to content

feat(telemetry): core-node observability foundation — OTLP logs + traces + metrics#1317

Open
Bojan131 wants to merge 20 commits into
mainfrom
feat/core-node-log-collection
Open

feat(telemetry): core-node observability foundation — OTLP logs + traces + metrics#1317
Bojan131 wants to merge 20 commits into
mainfrom
feat/core-node-log-collection

Conversation

@Bojan131

@Bojan131 Bojan131 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What & why

Task: enable log collection on core nodes → expanded per review into a proper observability foundation: opt-in OpenTelemetry logs + traces + metrics, off by default, built on V10's existing Logger seam. Nodes are independently operated, so the model is local-first + opt-in forwarding + at-source redaction.

Architecture

Call sites use the ambient @opentelemetry/api (withSpan/getMetrics in dkg-core); the SDK registers once at daemon boot, only when enabled and an endpoint resolves. Disabled ⇒ true no-op (no provider, no outbound calls, no per-call-site guards).

In this PR

  • Logs — structured export via OtlpLogWorker (bounded buffer + retry/backoff + redaction); records gain dkg.* resource/record attrs + trace_id/span_id.
  • Traces — explicit spans: agent.publish, agent.publish_from_swm, publisher.ack_collect/ack_peer_request/ack_verify_identity/storage_ack_handler, chain.eth_call/eth_getLogs/tx_send/tx_submit/tx_wait, sync.request/sync.response. Duration/status/exception; inbound boundaries are fresh roots.
  • Metrics — low-cardinality counters/histograms (dkg.publish.*, dkg.ack.*, dkg.chain.rpc.*, dkg.sync.*); bounded attribute keys only (no peer/tx/cg/op ids as labels).
  • Configtelemetry.{logs,traces,metrics}, env-first (OTEL_EXPORTER_OTLP_*), no TBD prod defaults; a signal registers only when its endpoint resolves. syslog still supported. Shutdown flush.
  • Node-side log collection (original scope) — per-node + fleet Grafana dashboards, Alloy bridge for the existing Loki 2.5.0, runbook + operator guide + deploy bundle in tools/log-collection-poc/.

Verified

  • Full graph builds; core (21), publisher ACK/host-mode (52), telemetry (9) suites green.
  • Tests: disabled no-op, span ERROR status + exception, metric label-cardinality allowlist, log↔trace correlation, redaction, non-blocking exporter failure.
  • Live: a real dkg daemon → local collector on 127.0.0.1:4318 exported logs+traces+metrics; payloads contained all 11 span names (publish → ACK collect → per-peer → verify → RPC), all 8 metric instruments, resource identity, and dkg.operation_id on logs.

Remaining (access-gated, not code)

Deploy on the OriginTrail fleet (Alloy on the Loki host, Cloudflare ingest hostname + token, node configs) — see tools/log-collection-poc/production/MANAGER-HANDOFF.md. Real collector endpoints for testnet/mainnet traces/metrics to be provided by ops (inert until then, by design).

🤖 Generated with Claude Code

Bojan131 and others added 2 commits June 24, 2026 14:38
Adds vendor-neutral OpenTelemetry (OTLP/HTTP) log forwarding for V10 nodes,
built on the existing Logger sink seam rather than rebuilding V8's Graylog.

- core: extract canonical `LogRecord` type; add `log-redaction.ts`
  (`createLogRedactor`) — at-source, key-name-driven scrubbing of wallet
  keys / mnemonics / tokens / JWTs, conservatively leaving public 0x hashes
  and Merkle roots intact.
- node-ui: `OtlpLogWorker` — hand-rolled OTLP/HTTP logs exporter (the JS Logs
  SDK is still "Development"); bounded drop-oldest buffer, batched async flush,
  exponential backoff honoring Retry-After, outbound-only, never blocks the node.
- cli: fan out the daemon `Logger.setSink` to local SQLite (full fidelity) +
  a single redacted copy to syslog and/or OTLP; add `config.telemetry.logs`
  ({exporter,endpoint,token,level,redact,bufferMaxEntries}) + per-network
  `otlpLogs` endpoint; boot + runtime exporter dispatch. Forwarding stays OFF
  by default; local logging is always on.
- tools/log-collection-poc: reference OTel Collector -> Loki -> Grafana stack
  + sample sender driving the real redactor->exporter pipeline.

Verified: 10 redaction tests + 7 OTLP-worker tests (real local HTTP server,
incl. 503-retry and 400-drop) + full core suite (1114) green; live end-to-end
through the stack confirmed logs land in Loki, dkg.* correlation metadata is
queryable, and a wallet private key + mnemonic arrive [REDACTED].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the hosted-fleet "pick a node, see its logs" workflow first-class:

- node identity is now emitted as OTel `service.instance.id` (= node name)
  and `deployment.environment` (= network), which Loki promotes to index
  labels (`service_instance_id`, `deployment_environment`) with no Loki/
  collector config change — so they can drive Grafana dashboard variables.
  dkg.* stay as structured metadata for correlation.
- add tools/log-collection-poc/grafana-dashboard-dkg-node-logs.json: a
  Node + Environment selector, regex filter, log-volume-by-level panel, and
  the built-in time range — importable into any Grafana with a Loki source.
- sample sender takes [nodeName] [network] so multiple named nodes can be
  simulated.

Verified live: two named nodes (testnet-core-01 / mainnet-core-01) ingested;
Loki label_values(service_instance_id) returns both; the dashboard's Node
dropdown filters to a single node's logs over a chosen time range.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: retry_exhausted

Wires V10 OTLP node logs into a pre-existing Loki 2.5.0 (which lacks native
OTLP), verified end-to-end locally against Loki 2.5.0:

- tools/log-collection-poc/production/config.alloy — Grafana Alloy: OTLP/HTTP
  receiver → loki.write to the existing local Loki, promoting node identity
  (service.name, service.instance.id, deployment.environment, dkg.node_role)
  to index labels for per-node Grafana selection.
- docker-compose.alloy.yml — run Alloy on the Loki host (network_mode host).
- docker-compose.sim.yml — local Loki-2.5.0 + Alloy rig that validated the
  whole path (per-node labels + redaction survive the bridge).
- grafana-dashboard-dkg-node-logs.json — Node + Level selectors, regex filter,
  json|line_format clean display, volume-by-level; queries validated on 2.5.0.
- node-config.example.json + RUNBOOK.md — exact node config + deploy steps
  (Alloy on host, Cloudflare-tunnel ingest hostname + bearer token, point nodes).
- sample sender gains OTLP_TOKEN env for authed pushes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: retry_exhausted

Bojan131 and others added 2 commits June 24, 2026 16:05
Turnkey, copy-paste artifacts so the access-gated deploy is zero-guesswork:
- cloudflared-config.example.yml — ingest hostname → Alloy (no Access policy).
- alloy.systemd.service — non-Docker alternative for running Alloy on the host.
- smoke-test.sh — post-deploy check (push one OTLP log + token → confirm it
  reaches Loki; verify tokenless push is rejected).
- MANAGER-HANDOFF.md — DONE / VERIFIED (incl. real-daemon proof) / BLOCKED-on-
  access summary with the exact 3 remaining steps.

Verified the REAL dkg daemon (not a harness) ships logs end-to-end: booted an
edge node with telemetry.logs.exporter=otlp → 30 live operational log lines
landed in a real Loki 2.5.0 via Alloy, queryable by service_instance_id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ication note

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10046239}

Bojan131 and others added 3 commits June 24, 2026 17:01
- grafana-dashboard-dkg-fleet-logs.json — fleet overview (active-node count,
  log volume per node, errors per node, recent fleet errors). Imported into
  polaris and verified ('Active nodes = 2' against test data).
- OPERATOR-GUIDE.md — self-serve enablement for any node operator (their own
  backend), complementing the OriginTrail-fleet RUNBOOK.
- example-alerts.md — node-quiet, error-spike, and redaction-leak alert rules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
These evm-module/deployments/localhost/* files were untracked build artifacts
in the worktree and got swept into an earlier 'git add -A'. They are unrelated
to log collection and not tracked on main. Removing from the PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ocalhost/ dir should have been dropped)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: retry_exhausted

Comment thread packages/core/src/log-redaction.ts
Comment thread tools/log-collection-poc/production/RUNBOOK.md Outdated
Comment thread packages/cli/src/daemon/lifecycle.ts Outdated
Comment thread packages/node-ui/test/otlp-log-worker.test.ts Outdated
Comment thread packages/node-ui/src/otlp-log-worker.ts Outdated
Comment thread packages/node-ui/src/otlp-log-worker.ts
…correlation, config

- core/src/telemetry-api.ts: call-site facade (getTracer, withSpan, linkedSpan,
  currentTraceIds, activeSpanContext) + low-cardinality metric instruments
  (getMetrics/rebuildMetrics). Global-API model → no-op when no provider.
- node-ui/src/telemetry.ts: real initTelemetry (NodeTracerProvider + MeterProvider
  + OTLP/proto exporters + shared Resource + W3C/AsyncLocalStorage via register());
  shutdownTelemetry; idempotent; no-op when disabled/no endpoint. (Logs stay on
  OtlpLogWorker — OTel Logs SDK still Development.)
- core/src/logger.ts: attach active trace_id/span_id to every LogRecord;
  OtlpLogWorker emits them as top-level OTLP fields. Redaction/buffer untouched.
- cli config: telemetry.traces/metrics blocks; lifecycle registers the SDK at boot
  (env OTEL_EXPORTER_OTLP_* > config; no TBD prod defaults) + flushes on shutdown.

Deps: @opentelemetry/api (core); sdk-trace-node, sdk-metrics, exporter-*-otlp-proto,
resources, semantic-conventions (node-ui). Full graph builds; 21 core tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10094535}

…ain + tests

Spans (explicit boundaries, duration/status/exception): agent.publish,
agent.publish_from_swm, sync.request, sync.response (inbound root),
publisher.ack_collect, publisher.ack_peer_request, publisher.ack_verify_identity,
publisher.storage_ack_handler (inbound root), chain.eth_call (×3 seams),
chain.eth_getLogs, chain.tx_send/tx_submit/tx_wait.

Metrics (low-cardinality, bounded attrs only): dkg.publish.total/.duration,
dkg.ack.quorum/peer/verify/handler.total, dkg.chain.rpc.total/.duration/.failover.total,
dkg.sync.request/response.total. No peer_id/cg_id/tx_hash/op_id as labels.

Tests (node-ui/test/telemetry.test.ts, 9): disabled=no-op, span ERROR status +
exception recording, metric label-cardinality allowlist, log↔trace correlation.
Full graph builds; publisher ACK/host-mode suite (52) green; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10163598}

…ector for verification

- otel-collector-config.yaml: add traces + metrics pipelines (debug exporter)
  alongside logs→Loki, so a node exports all three signals to one collector.
- mock-otlp-collector.mjs: zero-dep local OTLP receiver for verifying export
  without Docker.

Verified live: a real dkg daemon (telemetry on, endpoints → :4318) exported
logs+traces+metrics; captured payloads contained all 11 span names
(agent.publish, publisher.ack_collect/peer/verify/handler, chain.eth_call/
getLogs/tx_send, sync.request/response), all 8 metric instruments, the resource
identity (service.instance.id/dkg.network/dkg.node.role/dkg.peer_id), and
dkg.operation_id on logs. enabled=false → no-op (unit-tested).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Bojan131

Copy link
Copy Markdown
Contributor Author

Review addressed — moved from "OTLP log PoC" to an observability foundation (logs + traces + metrics)

Commits: a30318d2e (SDK foundation + log correlation + config), 80e536d76 (instrumentation + tests), a841d005d (collector pipelines + live-verification harness). Full graph builds; core (21), publisher ACK/host-mode (52), and new telemetry (9) suites green.

Architecture: call sites use the ambient @opentelemetry/api (withSpan/getMetrics in dkg-core); the SDK registers once at daemon boot only when enabled+endpoint resolves. So instrumentation is compiled-in but inert (true no-op) when disabled — no per-call-site guards, no DI threading.

1. Logs ✅

  • Structured DKG log export retained on the existing worker (bounded drop-oldest buffer + Retry-After backoff + at-source redaction) — unchanged.
  • Does not claim full capture — journald/full host logs remain the infra/Ansible OTel-agent path (noted in docs).
  • Records now carry resource attrs (service.name=dkg-node, service.instance.id, dkg.node.name/role, dkg.peer_id, dkg.network, dkg.chain) + record attrs (dkg.operation_id/name, dkg.module) + trace_id/span_id (top-level OTLP fields) when a span is active.

2. Traces ✅ (explicit spans, not blanket wrapping)

agent.publish, agent.publish_from_swm, publisher.ack_collect, publisher.ack_peer_request, publisher.ack_verify_identity, publisher.storage_ack_handler (inbound root), protocol/sync.request, sync.response (inbound root), chain.eth_call (3 seams), chain.eth_getLogs, chain.tx_send/tx_submit/tx_wait. Each records duration, status, and exception attrs; failover loops use one span + per-attempt events (not span-per-attempt). Inbound boundaries are fresh roots (no W3C propagation over libp2p; cross-node correlation via dkg.operation_id).

3. Metrics ✅ (low-cardinality)

dkg.publish.total/.duration, dkg.ack.quorum/peer/verify/handler.total, dkg.chain.rpc.total/.duration/.failover.total, dkg.sync.request/response.total. Attributes bounded to method/module/protocol_id/chain/role/outcome/error_type/decline_code(fixed enum)/result/retryable. No kaId/tx-hash/peer-id/operation-id/cg-id as labels — enforced by a unit test asserting an attribute-key allowlist.

4. Transport/config ✅

  • No hardcoded TBD prod endpoints. Endpoints resolve env-first (OTEL_EXPORTER_OTLP_*) then config; a signal registers only when an endpoint resolves — never a guessed default.
  • App → local collector is the documented path. OTLP/HTTP-proto export. Graylog-native OTel needs gRPC; not claimed (docs say "point at a collector"; traces/metrics need Tempo/Prometheus etc.).

5. Backward compatibility ✅

  • telemetry.enabled=false ⇒ all remote export is no-op (unit-tested; no provider registered).
  • Local SQLite + dashboard logging unchanged; redaction/buffer untouched.
  • Telemetry never blocks publish/sync/query (async batched export; failures swallowed) and never crashes the daemon (init wrapped, shutdown flush is best-effort).
  • Existing syslog exporter still explicitly supported.

Acceptance criteria ✅

  • Disabled ⇒ zero outbound calls (test telemetry — disabled is a total no-op).
  • Live: a real daemon with a local collector on 127.0.0.1:4318 exported logs + traces + metrics; captured OTLP payloads contained all 11 span names (incl. agent.publishpublisher.ack_collectack_peer_requestack_verify_identitychain.eth_call), all 8 metric instruments, the resource identity, and dkg.operation_id on logs.
  • Logs searchable by dkg.operation_id and correlated to trace/span ids (test log ↔ trace correlation).
  • Tests cover redaction, non-blocking exporter failure, disabled no-op, span ERROR status + exception, and metric label cardinality.

Open item (ops, not code): real collector hostnames for testnet/mainnet — until provided, traces/metrics stay inert on those networks by design (no guessed prod URL).

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10165655}

@Bojan131 Bojan131 changed the title feat(telemetry): opt-in OTLP log collection for core nodes (PoC) feat(telemetry): core-node observability foundation — OTLP logs + traces + metrics Jun 25, 2026
…tric

Closes the last review 'minimum spans' / metrics item: ProtocolRouter.send is
now wrapped (span 'protocol_router.send' + dkg.protocol_router.send.total/
.duration, protocol_id label). send body moved to private sendInner; the public
send is a thin span+metric wrapper (no-op when telemetry off). protocol-router
suite (94) + telemetry suite (9) green; full graph builds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Bojan131

Copy link
Copy Markdown
Contributor Author

Follow-up (): closed the last item from the review's Minimum spans + Metrics lists — protocol_router.send is now an explicit span, plus a P2P protocol send duration histogram (dkg.protocol_router.send.duration) and dkg.protocol_router.send.total (label: protocol_id, outcome). ProtocolRouter.send body moved to a private sendInner; the public method is a thin span+metric wrapper (no-op when telemetry is off). protocol-router suite (94) + telemetry suite (9) green; full graph builds. Every span and metric the review enumerated is now implemented.

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10168322}

- 🔴 redaction: auth-scheme credentials ('authorization: Bearer <token>') now
  redact the token too, not just the scheme word (log-redaction.ts) + test.
- 🟡 daemon telemetry routing now unit-tested: extracted resolveOtelSignals /
  resolveLogExporterMode (cli/src/telemetry-config.ts) used by lifecycle.ts;
  test covers env precedence, no-TBD-default, per-signal gating, exporter mode.
- 🟡 OtlpLogWorker overflow test now proves drop-OLDEST (survivors = newest 3),
  not merely capped.
- 🟡 RUNBOOK: corrected Alloy bind address to 0.0.0.0:4318 (matches config.alloy)
  + firewall note.
- 🔵 OtlpLogWorker.minLevel narrowed to a LogLevel union.
- 🔵 droppedNonRetryable now surfaced in the diagnostic message (no dead state).

Full graph builds; redaction (11), node-ui worker+telemetry (16), cli routing (7) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10174996}

…llection

# Conflicts:
#	packages/agent/src/dkg-agent-publish.ts

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10175178}

- agent: span attr read this.chain.chainId threw TypeError when this.chain is
  undefined (e.g. the publish-literal-size unit test), masking the expected
  OVERSIZED_RDF_LITERAL throw. Optional-chain the guard (this.chain?.chainId) in
  _publish + publishFromSharedMemory.
- chain: instrumentation added two internal helpers to EVMChainAdapter; the
  mock-adapter parity test flagged them. Renamed to _rpcOutcomeForError /
  _sendContractTransactionInner (the test's '_'-prefix = internal convention),
  so they're excluded from the public-API parity check without a new exemption.

Verified locally: mock-adapter-parity (15) + publish-literal-size (2) green;
chain + agent build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: Error: turn/start: turn/start failed: Input exceeds the maximum length of 1048576 characters. (code -32602), data: {"input_error_code":"input_too_large","max_chars":1048576,"actual_chars":10175191}

…llection

# Conflicts:
#	packages/chain/src/evm-adapter-base.ts
Comment thread packages/cli/src/daemon/lifecycle.ts Outdated
@@ -0,0 +1,324 @@
{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Do not commit generated localhost deployment artifacts

What's wrong
These files are generated environment output, not maintainable source. Committing full localhost deployment artifacts bloats the repository and reviews, creates high-churn merge noise, and makes real deployment records harder to audit. It also mixes local node state with the code changes, which is a weak ownership boundary for deployment data.

Example
The PR adds files like packages/evm-module/deployments/localhost/Ask.json and ContextGraphStorage.json with local addresses, block hashes, transaction hashes, full ABIs, bytecode, receipts, and embedded compiler metadata. This is about 32 generated files and roughly 9.6 MB under deployments/localhost.

Suggested direction
Keep source-controlled deployment data limited to canonical network artifacts or intentionally small fixtures. Local devnet artifacts should be regenerated by setup scripts, or represented by a compact checked-in manifest if deterministic local bootstrap needs one.

For Agents
Remove the new generated files under packages/evm-module/deployments/localhost unless there is a documented canonical fixture contract for them. Prefer a deploy script, compact manifest, or focused test fixture containing only fields tests actually consume. Add an ignore rule or fixture allowlist so local deployment output does not keep returning in future PRs.

lastRetryable = err;
if (i < this.providers.length - 1) {
noteRpcFailover(`${label} broadcast`, this.rpcUrls[i], err, this.rpcUrls[i + 1]);
return withSpan(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Centralize chain RPC telemetry instead of weaving it through failover logic

What's wrong
The new tracing and metric calls are repeated across every success, error, and exhaustion branch of already delicate RPC failover flows. That turns core chain transport code into instrumentation-heavy control flow and makes future changes brittle because every branch must remember to emit the same low-cardinality fields consistently.

Example
broadcastSignedTransactionWithFailover and getTransactionReceiptWithFailover now each allocate metrics, track startedAt, duplicate method/chain attributes, record duration in several branches, and classify ok/error/timeout/exhausted outcomes inline. The same pattern recurs in ensureV10ApproveTrac, resolveContract, getIdentityId, and log fetching.

Suggested direction
Introduce a small local helper such as withChainRpcMetrics or recordChainRpcOutcome(method, startedAt, attrs) so each failover loop only decides the transport result once, while one boundary owns duration/outcome/failover counter recording. That helper can also avoid the clever Parameters<Parameters<typeof withSpan>[1]>[0] span type by making the span boundary explicit.

For Agents
Look at packages/chain/src/evm-adapter-base.ts around broadcastSignedTransactionWithFailover, getTransactionReceiptWithFailover, _sendContractTransactionInner, ensureV10ApproveTrac, resolveContract, getIdentityId, and fetchEventLogsPage. Preserve metric names, labels, span names, and outcome classification, but move common record/finally/classification mechanics into one helper boundary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Centralize chain RPC instrumentation instead of inlining it into every failover path

What's wrong
The PR weaves telemetry bookkeeping directly into several already busy chain paths. That makes failover/control-flow logic harder to scan and creates repeated metric branches that can drift the next time a label, duration rule, or outcome mapping changes.

Example
broadcastSignedTransactionWithFailover records chainRpcTotal and chainRpcDuration on success, already-known, non-retryable error, and exhaustion; getTransactionReceiptWithFailover repeats the same structure, and ensureTokenAllowance adds another try/catch/finally wrapper for eth_call. A new RPC path now has to copy this pattern to stay consistent.

Suggested direction
Extract a small withChainRpcTelemetry / recordChainRpcOutcome helper that owns span creation, duration recording, outcome classification, and bounded labels. Leave the domain methods responsible for provider iteration and chain-specific decisions.

For Agents
In packages/chain/src/evm-adapter-base.ts, introduce a focused helper near _rpcOutcomeForError for chain RPC spans/metrics/duration/outcome recording. Preserve the existing failover behavior and current label values, then update the changed RPC call sites to use the helper. Add focused coverage that success, non-retryable error, timeout, and exhausted paths still emit the same metric attributes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Centralize chain RPC telemetry instead of recording it in every failover branch

What's wrong
The PR interleaves telemetry bookkeeping with every return and throw path in several already-complex failover routines. That makes the control flow harder to scan and creates a maintenance trap: every future RPC path must manually remember exactly which metric calls and labels belong on each branch.

Example
broadcastSignedTransactionWithFailover records the same total/duration label set on success, already-known, non-retryable error, and exhaustion paths; the same pattern appears again in receipt lookup, eth_call, and eth_getLogs. A new RPC outcome or label change now requires coordinated edits across all of those branches.

Suggested direction
Add a private helper such as withChainRpcTelemetry / recordChainRpcOutcome that owns startedAt, chainRpcTotal, chainRpcDuration, and chainRpcFailoverTotal emission. Keep the failover loops focused on provider selection and error handling.

For Agents
Refactor the new telemetry in packages/chain/src/evm-adapter-base.ts into one small chain-RPC instrumentation helper. Preserve the current span names, metric names, low-cardinality labels, and failover behavior; the chain RPC telemetry tests should still prove the emitted labels stay bounded.

… + gitignore the dir

The 2nd main-merge's 'git add -A' re-swept the untracked
packages/evm-module/deployments/localhost/*.json (regenerated by a local EVM
integration test run) into the PR — ~36k lines, unrelated to this change and
not tracked on main. Remove them, revert localhost_contracts.json to main, and
add the localhost/ dir to .gitignore so it can't recur.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread packages/node-ui/test/telemetry.test.ts
* `withSpan('publisher.ack_collect')` wrapper. Preserves the exact
* control flow + error propagation of the original method body.
*/
private async collectInner(ctx: {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This instrumentation pushes ACKCollector past the 1k-line boundary instead of decomposing it

What's wrong
The new collectInner wrapper and inline telemetry add another abstraction layer to an already large consensus-critical class. The wrapper mostly duplicates the public method’s parameters into a second local contract, then re-destructures them. That makes the file larger and harder to scan without removing any concepts from the reader’s head.

Example
This PR takes packages/publisher/src/ack-collector.ts from 964 lines on the base ref to 1095 lines, mostly by adding the collect wrapper, collectInner parameter object, and inline span/metric recording in the per-peer request path.

Suggested direction
Decompose before adding more code here. A cleaner shape would keep collect as the behavior owner and move span/metric recording into a small wrapper/helper such as recordAckRound, recordAckPeerRequest, and recordAckVerify, or split the per-peer request/verification logic out of the class entirely.

For Agents
In packages/publisher/src/ack-collector.ts, preserve ACK collection behavior but extract telemetry into focused helpers or a small ACK telemetry module. Avoid duplicating the public collect parameter surface; prove publish ACK collection and per-peer ACK/decline/transport metric behavior still match existing expectations.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Do not push ACK collection past 1k lines with a mirror-parameter inner method

What's wrong
The telemetry wrapper adds an artificial collectInner(ctx) layer whose context object mirrors the original method params plus REQUIRED_ACKS and params. This does not simplify the ACK collector; it adds a second contract and helps push the file to 1,095 lines, crossing the 1k-line smell threshold for code added by this PR.

Example
collect destructures the incoming params, repackages the same fields into collectInner, collectInner redeclares those fields in a context type, then destructures them again. Any future collection parameter now has to be kept in sync across the public param type, the wrapper object, and the inner destructuring.

Suggested direction
Keep one canonical parameter boundary. Either wrap the original method body directly and use small telemetry helpers for result/failure recording, or extract real ACK subcomponents instead of moving the whole previous method behind a context bag.

For Agents
In packages/publisher/src/ack-collector.ts, remove the mirror collectInner(ctx) boundary or replace it with a meaningful extraction. Preserve collect behavior and metric outcomes. If decomposing, move real subflows such as per-peer ACK request or identity verification into focused helpers/modules so the file drops back under a healthy size boundary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This pushes ACKCollector past the 1k-line boundary without real decomposition

What's wrong
The PR takes a file that was just under the project’s healthy-size threshold and grows it past 1k lines. The added split mostly wraps the old body rather than reducing concepts, so the collector now has more ceremony and more inline cross-cutting telemetry while remaining one large module.

Example
The file was 966 lines on main and is now 1095 lines. The new collectInner wrapper takes a 12-field object plus the original params, and additional per-peer / verifier spans are inserted into the collector’s hot loop.

Suggested direction
Extract an AckTelemetry helper or split the collector into focused pieces before adding instrumentation. The public collect path should not need to carry both duplicated destructured fields and the full original params just to support a span wrapper.

For Agents
Decompose the ACK collector changes so the file returns below the 1k-line boundary or has a strong structural split. Preserve the current ACK collection behavior and metric/span labels; move telemetry wrapper logic into a focused helper or split peer request / identity verification into smaller modules.

/** Duration buckets (ms) for chain RPC histograms (faster floor). */
const RPC_DURATION_BUCKETS = [25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 120000];

export interface DkgMetrics {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Raw metric instruments leak telemetry policy into every call site

What's wrong
The interface documents low-cardinality labels, but it gives callers unrestricted Counter and Histogram handles. That turns telemetry into scattered ad-hoc object literals across chain, publisher, sync, and agent code. The design relies on every future call site remembering the label contract manually.

Example
DkgMetrics exposes chainRpcTotal, chainRpcDuration, ackPeerTotal, etc. as raw OTel instruments. Call sites then repeat label construction and timing logic, for example the repeated rpc_method/outcome/retryable/chain_id blocks in evm-adapter-base.ts and the ackPeerTotal updates in ack-collector.ts.

Suggested direction
Make the core telemetry facade enforce the model: recordChainRpc, recordPublish, recordAckPeer, recordSyncResponse, etc. should own allowed labels, duration recording, and outcome classification. That would delete much of the repeated try/catch metric boilerplate and keep high-cardinality policy in one place.

For Agents
Refactor packages/core/src/telemetry-api.ts to expose domain-level recorder functions instead of raw instruments. Then update chain/publisher/sync/agent call sites to call those recorders while preserving emitted metric names and label values; tests should assert the same low-cardinality labels are produced.

this.timer.unref();
}

stop(): void {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The log worker exposes a synchronous stop while doing async shutdown work

What's wrong
stop() marks the worker stopped, clears the timer, then launches void this.flush(true). That hides an asynchronous finalization step behind a synchronous API, which makes the daemon lifecycle harder to reason about and prevents callers from composing shutdown atomically.

Example
stopTelemetry() calls stopOtlpExporter(), which calls OtlpLogWorker.stop(). That method starts a final async flush but returns immediately, so callers cannot express or enforce an orderly telemetry shutdown.

Suggested direction
Make the boundary explicit: either stop() should be async and await the final flush, or there should be a separate shutdown() method that daemon teardown awaits. Fire-and-forget is fine for periodic flushing, but not for the lifecycle API.

For Agents
Change OtlpLogWorker.stop or add shutdown(): Promise<void> in packages/node-ui/src/otlp-log-worker.ts, then await it from the daemon shutdown path. Preserve the non-blocking push() behavior and keep flush errors swallowed or reported according to the existing contract.

logPusher?.push(entry);
// Fan out a single redacted copy to every active remote shipper.
if (logPusher || otlpExporter) {
const safe = redactForRemote(entry);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Remote log redaction wiring is not verified

What's wrong
The PR tests the redaction helper and the OTLP worker separately, but not the daemon wiring that decides which copy leaves the node. This is the privacy-critical behavior: local logs should remain full-fidelity, while syslog/OTLP receive only the redacted record.

Example
A future edit that accidentally changes the fan-out to otlpExporter?.push(entry) would leak operationalWalletPrivateKey=... remotely. The redactor tests and OtlpLogWorker tests would still pass because neither test proves Logger.setSink sends the redacted copy to the remote shippers while keeping the local DB record unchanged.

Suggested direction
Add a lifecycle-level or extracted-helper test that exercises the actual Logger.setSink fan-out path with sensitive log content and configured telemetry modes.

For Agents
Add a focused daemon-sink wiring test, or extract the sink fan-out into a small helper that can be unit-tested. Mock/stub dashDb.insertLog, LogPushWorker.push, and OtlpLogWorker.push; emit a log containing a private key; assert the DB gets the original message, both remote shippers get [REDACTED], and logs.exporter: 'none' leaves remote pushers inactive.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Remote redaction wiring is not verified at the daemon sink

What's wrong
The existing tests prove the redactor can scrub a standalone record and that OtlpLogWorker can send records, but they do not exercise the trust-boundary code that actually applies redaction before forwarding daemon logs. Because this path handles secrets leaving the node, a regression that accidentally pushes entry instead of safe, or fails to apply configured extra redact keys, would not be caught.

Example
A daemon-level test should log operationalWalletPrivateKey=0x... with remote telemetry enabled and assert that dashDb.insertLog(...) receives the original message while both logPusher.push(...) and otlpExporter.push(...) receive [REDACTED], including an extra key from telemetry.logs.redact.

Suggested direction
Add a focused test around the daemon log sink fan-out, not just the standalone redactor and standalone OTLP worker.

For Agents
Look at the Logger.setSink block in packages/cli/src/daemon/lifecycle.ts. Extract the fan-out/redaction logic or add a focused daemon unit with fake DB, syslog worker, and OTLP worker. Preserve full-fidelity local logs and prove every remote shipper gets only the redacted copy.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Remote redaction is not verified at the daemon fan-out boundary

What's wrong
The PR adds at-source redaction before remote log shipping, but the tests only verify the pure redactor and the OTLP worker separately. They do not prove that daemon log forwarding applies the redactor before either remote shipper receives a record.

Example
A regression that changed line 2012 to const safe = entry, or pushed entry directly to otlpExporter, would still leave packages/core/test/log-redaction.test.ts green because those tests only call the redactor directly; they never exercise the daemon sink that forwards records off-node.

Suggested direction
Cover the actual Logger.setSink fan-out behavior, not only createLogRedactor in isolation, because this is the trust boundary where secrets either stay local or leave the node.

For Agents
Add a focused test around the daemon log fan-out path, or extract a small helper from lifecycle.ts, using fake dashboard/remote shippers. Prove local storage receives the original message while remote shippers receive a redacted message for privateKey=... and configured extra keys.

…ll-signal docs

Coworker review on PR #1317 (5 items):

1. Emit dkg.chain on logs. OtlpLogWorker gains a chainId option; the daemon
   passes config.chain?.chainId so logs carry the same dkg.chain resource attr
   as traces/metrics ({service_name="dkg-node", dkg_chain="base:8453"} queries
   now work across all signals).

2. Normalize on dotted OTel-style resource keys for logs to match
   traces/metrics: dkg.node_name -> dkg.node.name, dkg.node_role ->
   dkg.node.role. Loki sanitizes dots to underscores, so the promoted labels
   (dkg_node_role, etc.) are unchanged and dashboards are unaffected; the Alloy
   promotion hint is updated to the dotted source key so the attribute still
   resolves.

3. node-config.example.json now shows all three signals (logs + traces +
   metrics) with the real config keys, and documents that logs use a
   hand-rolled OTLP/HTTP exporter while traces/metrics use the OTel SDK.

4. config.alloy states it is logs-only for the current polaris backend and
   carries commented traces->Tempo / metrics->Prometheus(Mimir) routing ready
   to enable.

5. RUNBOOK + OPERATOR-GUIDE call out that telemetry.logs.exporter defaults to
   syslog when unset, so hosted nodes MUST set "exporter": "otlp" explicitly,
   and clarify the logs(custom OTLP) vs traces+metrics(OTel SDK) transport split.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
};
}

let metricsCache: DkgMetrics | undefined;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Avoid the global metrics cache/rebuild contract

What's wrong
getMetrics() caches instruments against whatever global meter exists on first use, while rebuildMetrics() must be called from another package after provider registration. That is a hidden boot-order dependency in a facade whose goal is to keep instrumentation call sites simple and inert when telemetry is disabled.

Example
If any code calls getMetrics() before initTelemetry() registers the real meter, the cache binds to the no-op meter. Metrics only recover because initTelemetry() knows to call rebuildMetrics(). A future CLI path, library embedder, or test setup can miss that side effect and silently keep no-op instruments.

Suggested direction
Replace the public rebuildMetrics() contract with a clearer installation boundary, such as a metrics registry installed by telemetry bootstrap, or proxy recorders that resolve the active meter internally without exposing cache invalidation to other packages.

For Agents
Update packages/core/src/telemetry-api.ts and packages/node-ui/src/telemetry.ts so the metrics facade has a single explicit ownership boundary. Preserve disabled no-op behavior and existing instrument names/labels. Prove that recording before and after telemetry init does not require callers to know about a rebuild step.

…llection

# Conflicts:
#	packages/chain/src/evm-adapter-base.ts
Comment thread packages/node-ui/test/otlp-log-worker.test.ts Outdated
Comment thread packages/node-ui/src/telemetry.ts
…strumentation in tests (#1317 review)

Addresses the red otReviewAgent findings on PR #1317:

🔴 Disable/enable telemetry now controls ALL signals, not just logs.
  Previously `telemetry.enabled` and the runtime settings toggle only
  started/stopped the log exporters; the OTel trace/metric SDK was
  initialized once at boot and torn down only at shutdown. So disabling
  telemetry kept exporting traces/metrics, enabling from a boot-disabled
  state never started them, and a log-exporter start failure flipped the
  master `enabled=false` (killing metrics/traces too).
  - lifecycle.ts: extract startOtelSdk(); startTelemetry() now starts OTel
    SDK + log exporter; stopTelemetry() is async and awaits shutdownTelemetry();
    boot + settings-toggle both go through the same path; a failed log
    shipper no longer disables the master gate or the other signals.
  - telemetry.ts: shutdownTelemetry() clears the OTel API globals
    (trace.disable()/metrics.disable() + rebuildMetrics()) so a live
    re-enable can register fresh providers instead of silently no-opping.

🔴 Metric test now exercises a REAL instrumented call site.
  - chain/test/chain-rpc-telemetry.unit.test.ts (new): drives the actual
    contractReadWithFailover seam under an in-memory meter and asserts
    dkg.chain.rpc.total carries ONLY {rpc_method,outcome,retryable,chain_id}
    — mutation-verified to fail if a high-cardinality label (e.g. rpc_url)
    is added. Adds @opentelemetry/api + sdk-metrics as chain devDeps.
  - node-ui/test/telemetry.test.ts: new initTelemetry() bootstrap test that
    exports a span + metric to a local OTLP collector (covers the real
    OTLP/proto exporter + register()/rebuildMetrics() path).

🔴 OTLP retry test no longer counts the failed 503 as delivery evidence.
  - otlp-log-worker.test.ts: the server now records response status; the
    503→200 case asserts the record arrived on the SUCCESSFUL request.

🔴 redaction of `authorization: Bearer <token>` — verified already fixed
  (auth-scheme alternation) and guarded by an existing regression test;
  no code change needed.

All affected suites green: node-ui 1475, chain 711, core redaction 11,
cli telemetry-config 7.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
export type { LlmConfig, LlmChatRequest, LlmChatMessage, LlmStreamEvent, LlmCompletionResult, LlmCapabilities } from './llm/types.js';
export { initTelemetry, recordGauge, setOperationSpan, isTelemetryConfigured } from './telemetry.js';
export type { TelemetryConfig } from './telemetry.js';
export { initTelemetry, shutdownTelemetry, isTelemetryConfigured } from './telemetry.js';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Public telemetry exports were removed and the init config shape now silently ignores old callers

What's wrong
@origintrail-official/dkg-node-ui is a published package, and this PR replaces the telemetry export surface instead of extending it. That breaks existing TypeScript imports immediately, and worse, callers that pass the old metricsEndpoint object shape to initTelemetry get a silent no-op because the new implementation only checks metrics.endpoint.

Example
Existing code that does import { initTelemetry, recordGauge, setOperationSpan, type TelemetryConfig } from '@origintrail-official/dkg-node-ui' no longer compiles because those named exports/types are gone. Existing JS that calls initTelemetry({ enabled: true, metricsEndpoint: 'http://collector/v1/metrics' }) still calls a function, but no metrics provider is registered because the new code only reads cfg.metrics?.endpoint.

Suggested direction
Keep deprecated compatibility shims for TelemetryConfig, recordGauge, setOperationSpan, and the old metricsEndpoint option, or explicitly version this as a breaking API change and update all documented consumers.

For Agents
Look at packages/node-ui/src/index.ts and packages/node-ui/src/telemetry.ts. Preserve backwards-compatible exports and normalize the old TelemetryConfig shape to the new nested TelemetryInitConfig, or make this an explicit major-version/API migration with all consumers updated. A regression test should import the old names and call initTelemetry({ enabled: true, metricsEndpoint }) to prove metrics still initialize or fail loudly with a documented migration.

privateQuads?: Quad[],
opts?: PublishOpts,
): Promise<PublishResult> {
return withSpan('agent.publish', async (span) => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Publish telemetry is duplicated across both publish entry points

What's wrong
The same cross-cutting observability sequence is pasted into two very busy publish flows in a 4.5k-line file. That makes future telemetry changes easy to apply to one path and miss in the other, and it makes the publish logic less direct without adding a real abstraction.

Example
The direct path at lines 1296-1304 and 1440-1459 mirrors the SWM path at lines 4143-4150 and 4282-4380: derive chain id, start a timer, set publish status, mark failed spans, and emit publishTotal / publishDuration.

Suggested direction
Use a helper such as withPublishTelemetry({ spanName, source, contextGraphId, chainId, attributes }, fn) that records status and metrics around the existing publish body. Let each method supply only the attributes that differ.

For Agents
In packages/agent/src/dkg-agent-publish.ts, factor the added publish telemetry into a small helper used by _publish and publishFromSharedMemory. Preserve span names (agent.publish, agent.publish_from_swm), source labels (direct, swm), status handling, and metric attributes.

params,
});
span.setAttribute('dkg.collected_acks', result.acks.length);
getMetrics().ackQuorumTotal.add(1, {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: ACK metrics are not verified through the real ACK paths

What's wrong
The PR adds several ACK telemetry counters in the collector and responder paths, but the added telemetry tests only manually emit facade counters. Existing ACK behavior tests exercise these flows with the default no-op meter, so they do not verify that the new instrumentation records the expected outcomes on success, quorum failure, peer decline, transport error, identity verification, or inbound handler reset.

Example
If the ackQuorumTotal.add(... outcome: 'reached') call were removed, or the QuorumUnmetError catch stopped recording timeout/impossible, the current ACK tests would still pass because they assert only the ACK result or thrown error. The same gap applies to peer decline/transport and storage-ack handler ack/decline/reset outcomes.

Suggested direction
Add focused metric assertions to the existing publisher ACK tests rather than only facade-level manual counter emissions.

For Agents
Extend packages/publisher/test/ack-collector.test.ts and packages/publisher/test/storage-ack-handler.test.ts with an in-memory OTel meter plus rebuildMetrics(). Drive the existing real collect/handler scenarios and assert dkg.ack.quorum.total, dkg.ack.peer.total, dkg.ack.verify.total, and dkg.ack.handler.total outcomes and bounded label keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants