loadbalancingexporter: failed endpoint retries re-route already-delivered records, causing duplicates on healthy endpoints

### Component(s)

exporter/loadbalancing

### What happened?

Title: loadbalancingexporter: failed endpoint retries re-route already-delivered records, causing duplicates on healthy endpoints

Component: exporter/loadbalancing
Version: v0.145.0
Collector: otelcol-contrib

Description
When one or more static resolver endpoints becomes unavailable, the loadbalancing exporter delivers duplicate records to the remaining healthy endpoints on every retry cycle. The duplicate rate scales with the number of failed endpoints and the Tier 1 retry window — with one endpoint down we observe approximately 3× the normal record volume on surviving endpoints; with two endpoints down the rate grows non-linearly and approaches 100% duplication. This behavior was observed under two distinct sub-exporter configurations.

Steps to Reproduce
Configure loadbalancingexporter with a static resolver pointing to 4 endpoints and retry_on_failure enabled at the Tier 1 level.
Begin sending logs through the pipeline.
Stop one of the 4 gateway endpoints (process kill, not graceful shutdown).
Observe delivery counts on the remaining 3 healthy endpoints.
Expected Behavior
Records that were successfully acknowledged by healthy endpoints before the retry are not retransmitted. Only records that failed delivery are retried.

Observed Behavior
Configuration A — protocol.otlp.sending_queue.enabled: true (default):

The per-endpoint async queue accepts batches destined for a downed endpoint and returns no error to the loadbalancing layer. The loadbalancing exporter does not retry. Records accumulate silently in the sub-exporter queue; duplicate delivery does not begin until the queue reaches capacity and the first error surfaces to the loadbalancing layer (~40 minutes at our traffic volume with queue_size: 1000). Once that threshold is crossed, the duplicate behavior described below begins.

Configuration B — protocol.otlp.sending_queue.enabled: false:

With the async queue removed, errors from the failed endpoint surface immediately to the loadbalancing layer. On every retry, records that were already successfully delivered to healthy endpoints are retransmitted to those endpoints. The healthy endpoints show approximately 3× normal ingest rate with one of four endpoints down, beginning immediately after the first failure rather than after a delay. Splunk confirms the records are genuine duplicates (same _raw, same timestamp).

In both configurations, the duplicate records are real — they are not an artifact of indexing or search. Configuration B makes the problem visible immediately; Configuration A conceals it until the sub-exporter queue is exhausted.

Configuration (relevant excerpt — Configuration B)
Additional Notes
Logs without a traceID attribute receive a random() routing key per record. All of our log traffic falls into this category.
The behavior is reproducible and consistent across test runs.
Downstream sink (Splunk HEC) confirms duplicates are real; they are not an artifact of indexing or search.

### Collector version

0.145.0

### Environment information

## Environment
OS: RHEL 9
Compiler(if manually compiled): --


### OpenTelemetry Collector configuration

```yaml

```

### Log output

```shell
Log 1 — queue_sender.go:50 — Dropping data (Tier 1 queue exhausted)
{"hostname":"ocp-wdc-n2a-int-1-compute-a-bb0a43ef-9ddpp","kubernetes":{"annotations":{"k8s.ovn.org/pod-networks":"{\"default\":{\"ip_addresses\":[\"10.58.9.13/23\"],\"mac_address\":\"0a:58:0a:3a:09:0d\",\"gateway_ips\":[\"10.58.8.1\"],\"routes\":[{\"dest\":\"10.58.0.0/15\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"172.29.0.0/16\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"169.254.169.5/32\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"100.65.0.0/16\",\"nextHop\":\"10.58.8.1\"}],\"ip_address\":\"10.58.9.13/23\",\"gateway_ip\":\"10.58.8.1\",\"role\":\"primary\"}}","k8s.v1.cni.cncf.io/network-status":"[{\n    \"name\": \"ovn-kubernetes\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.58.9.13\"\n    ],\n    \"mac\": \"0a:58:0a:3a:09:0d\",\n    \"default\": true,\n    \"dns\": {}\n}]","openshift.io/scc":"privileged"},"container_id":"cri-o://d9ec5b451914e7b284ca3892838178158452916a3aa36fd29698b80c3015f4c0","container_image":"infra-release-docker-local.registry.paychex.com/moneng/ocp-otelcollector:latest","container_image_id":"infra-release-docker-local.registry.paychex.com/moneng/ocp-otelcollector@sha256:41dbd942df344ead0dd1c565e57db2e9aae30e85eea6f7dbab4ca77e2a3b475e","container_iostream":"stderr","container_name":"otelcollector","labels":{"controller-revision-hash":"7b5b9b86cf","name":"otelcollector","pod-template-generation":"1"},"namespace_id":"cad83375-1281-4d10-a367-9e3d1a48905d","namespace_labels":{"kubernetes_io_metadata_name":"splunk-fwd","pod-security_kubernetes_io_audit":"privileged","pod-security_kubernetes_io_audit-version":"latest","pod-security_kubernetes_io_warn":"privileged","pod-security_kubernetes_io_warn-version":"latest"},"namespace_name":"splunk-fwd","pod_id":"09dc1f48-714f-4ab9-af10-6fd56fe33935","pod_ip":"10.58.9.13","pod_name":"otelcollector-gx8hw","pod_owner":"DaemonSet/otelcollector"},"level":"default","log_source":"container","log_type":"application","message":"2026-04-10T22:27:59.912-0400\terror\tinternal/base_exporter.go:115\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{\"resource\": {\"service.instance.id\": \"adef8a20-bbd9-49c7-900a-639e71696195\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"otelapn2ah2.paychex.com:4317\", \"error\": \"request will be cancelled before next retry: rpc error: code = DeadlineExceeded desc = context deadline exceeded\", \"rejected_items\": 3}","openshift":{"cluster_id":"52301906-c63f-4d57-ac4a-549d4a68614a","labels":{"cluster":"ocp-wdc-n2a-int-1","cluster_number":"1","cluster_type":"int","datacenter":"webster","environment":"n2a"},"sequence":1775874479921729300},"timestamp":"2026-04-11T02:27:59.912572738Z"}


base_exporter.go:115 — Rejecting data (Tier 2 retry exhausted, no queue)
{
  "hostname": "ocp-wdc-n2a-int-1-compute-a-<node-id>",
  "kubernetes": {
    "annotations": {
      "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"<pod-cidr-ip>/23\"],\"mac_address\":\"<mac>\",\"gateway_ips\":[\"<gateway-ip>\"],\"routes\":[...],\"role\":\"primary\"}}",
      "k8s.v1.cni.cncf.io/network-status": "[{\"name\":\"ovn-kubernetes\",\"interface\":\"eth0\",\"ips\":[\"<pod-ip>\"],\"mac\":\"<mac>\",\"default\":true,\"dns\":{}}]",
      "openshift.io/scc": "privileged"
    },
    "container_id": "cri-o://<container-id>",
    "container_image": "<internal-registry>/moneng/ocp-otelcollector:latest",
    "container_image_id": "<internal-registry>/moneng/ocp-otelcollector@sha256:<digest>",
    "container_iostream": "stderr",
    "container_name": "otelcollector",
    "labels": {
      "controller-revision-hash": "<hash>",
      "name": "otelcollector",
      "pod-template-generation": "1"
    },
    "namespace_id": "<namespace-uuid>",
    "namespace_labels": {
      "kubernetes_io_metadata_name": "splunk-fwd",
      "pod-security_kubernetes_io_audit": "privileged",
      "pod-security_kubernetes_io_audit-version": "latest",
      "pod-security_kubernetes_io_warn": "privileged",
      "pod-security_kubernetes_io_warn-version": "latest"
    },
    "namespace_name": "splunk-fwd",
    "pod_id": "<pod-uuid>",
    "pod_ip": "<pod-ip>",
    "pod_name": "otelcollector-<suffix>",
    "pod_owner": "DaemonSet/otelcollector"
  },
  "level": "default",
  "log_source": "container",
  "log_type": "application",
  "message": "2026-04-10T22:27:59.912-0400\terror\tinternal/base_exporter.go:115\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{\"resource\": {\"service.instance.id\": \"<uuid>\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"otelapn2ah2.paychex.com:4317\", \"error\": \"request will be cancelled before next retry: rpc error: code = DeadlineExceeded desc = context deadline exceeded\", \"rejected_items\": 3}",
  "openshift": {
    "cluster_id": "<cluster-uuid>",
    "labels": {
      "cluster": "ocp-wdc-n2a-int-1",
      "cluster_number": "1",
      "cluster_type": "int",
      "datacenter": "webster",
      "environment": "n2a"
    },
    "sequence": "<sequence>"
  },
  "timestamp": "2026-04-11T02:27:59.912572738Z"
}

Config A - tier 2 queues full
{
  "hostname": "ocp-wdc-n2a-int-1-compute-a-<node-id>",
  "kubernetes": {
    "annotations": {
      "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"<pod-cidr-ip>/23\"],\"mac_address\":\"<mac>\",\"gateway_ips\":[\"<gateway-ip>\"],\"routes\":[...],\"role\":\"primary\"}}",
      "k8s.v1.cni.cncf.io/network-status": "[{\"name\":\"ovn-kubernetes\",\"interface\":\"eth0\",\"ips\":[\"<pod-ip>\"],\"mac\":\"<mac>\",\"default\":true,\"dns\":{}}]",
      "openshift.io/scc": "privileged"
    },
    "container_id": "cri-o://<container-id>",
    "container_image": "<internal-registry>/moneng/ocp-otelcollector:latest",
    "container_image_id": "<internal-registry>/moneng/ocp-otelcollector@sha256:<digest>",
    "container_iostream": "stderr",
    "container_name": "otelcollector",
    "labels": {
      "controller-revision-hash": "<hash>",
      "name": "otelcollector",
      "pod-template-generation": "1"
    },
    "namespace_id": "<namespace-uuid>",
    "namespace_labels": {
      "kubernetes_io_metadata_name": "splunk-fwd",
      "pod-security_kubernetes_io_audit": "privileged",
      "pod-security_kubernetes_io_audit-version": "latest",
      "pod-security_kubernetes_io_warn": "privileged",
      "pod-security_kubernetes_io_warn-version": "latest"
    },
    "namespace_name": "splunk-fwd",
    "pod_id": "<pod-uuid>",
    "pod_ip": "<pod-ip>",
    "pod_name": "otelcollector-<suffix>",
    "pod_owner": "DaemonSet/otelcollector"
  },
  "level": "default",
  "log_source": "container",
  "log_type": "application",
  "message": "2026-04-10T23:07:59.976-0400\terror\tinternal/base_exporter.go:114\tExporting failed. Rejecting data.\t{\"resource\": {\"service.instance.id\": \"<uuid>\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"<gateway-host>:4317\", \"error\": \"sending queue is full\", \"rejected_items\": 4}",
  "openshift": {
    "cluster_id": "<cluster-uuid>",
    "labels": {
      "cluster": "ocp-wdc-n2a-int-1",
      "cluster_number": "1",
      "cluster_type": "int",
      "datacenter": "webster",
      "environment": "n2a"
    },
    "sequence": "<sequence>"
  },
  "timestamp": "2026-04-11T03:07:59.977205702Z"
}
```

### Additional context

_No response_

### Tip

<sub>[React](https://github.blog/news-insights/product-news/add-reactions-to-pull-requests-issues-and-comments/) with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding `+1` or `me too`, to help us triage it. Learn more [here](https://opentelemetry.io/community/end-user/issue-participation/).</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loadbalancingexporter: failed endpoint retries re-route already-delivered records, causing duplicates on healthy endpoints #47600

Component(s)

What happened?

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Tip

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

loadbalancingexporter: failed endpoint retries re-route already-delivered records, causing duplicates on healthy endpoints #47600

Description

Component(s)

What happened?

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Tip

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions