Title: loadbalancingexporter: failed endpoint retries re-route already-delivered records, causing duplicates on healthy endpoints
The per-endpoint async queue accepts batches destined for a downed endpoint and returns no error to the loadbalancing layer. The loadbalancing exporter does not retry. Records accumulate silently in the sub-exporter queue; duplicate delivery does not begin until the queue reaches capacity and the first error surfaces to the loadbalancing layer (~40 minutes at our traffic volume with queue_size: 1000). Once that threshold is crossed, the duplicate behavior described below begins.
With the async queue removed, errors from the failed endpoint surface immediately to the loadbalancing layer. On every retry, records that were already successfully delivered to healthy endpoints are retransmitted to those endpoints. The healthy endpoints show approximately 3× normal ingest rate with one of four endpoints down, beginning immediately after the first failure rather than after a delay. Splunk confirms the records are genuine duplicates (same _raw, same timestamp).
In both configurations, the duplicate records are real — they are not an artifact of indexing or search. Configuration B makes the problem visible immediately; Configuration A conceals it until the sub-exporter queue is exhausted.
Log 1 — queue_sender.go:50 — Dropping data (Tier 1 queue exhausted)
{"hostname":"ocp-wdc-n2a-int-1-compute-a-bb0a43ef-9ddpp","kubernetes":{"annotations":{"k8s.ovn.org/pod-networks":"{\"default\":{\"ip_addresses\":[\"10.58.9.13/23\"],\"mac_address\":\"0a:58:0a:3a:09:0d\",\"gateway_ips\":[\"10.58.8.1\"],\"routes\":[{\"dest\":\"10.58.0.0/15\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"172.29.0.0/16\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"169.254.169.5/32\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"100.65.0.0/16\",\"nextHop\":\"10.58.8.1\"}],\"ip_address\":\"10.58.9.13/23\",\"gateway_ip\":\"10.58.8.1\",\"role\":\"primary\"}}","k8s.v1.cni.cncf.io/network-status":"[{\n \"name\": \"ovn-kubernetes\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.58.9.13\"\n ],\n \"mac\": \"0a:58:0a:3a:09:0d\",\n \"default\": true,\n \"dns\": {}\n}]","openshift.io/scc":"privileged"},"container_id":"cri-o://d9ec5b451914e7b284ca3892838178158452916a3aa36fd29698b80c3015f4c0","container_image":"infra-release-docker-local.registry.paychex.com/moneng/ocp-otelcollector:latest","container_image_id":"infra-release-docker-local.registry.paychex.com/moneng/ocp-otelcollector@sha256:41dbd942df344ead0dd1c565e57db2e9aae30e85eea6f7dbab4ca77e2a3b475e","container_iostream":"stderr","container_name":"otelcollector","labels":{"controller-revision-hash":"7b5b9b86cf","name":"otelcollector","pod-template-generation":"1"},"namespace_id":"cad83375-1281-4d10-a367-9e3d1a48905d","namespace_labels":{"kubernetes_io_metadata_name":"splunk-fwd","pod-security_kubernetes_io_audit":"privileged","pod-security_kubernetes_io_audit-version":"latest","pod-security_kubernetes_io_warn":"privileged","pod-security_kubernetes_io_warn-version":"latest"},"namespace_name":"splunk-fwd","pod_id":"09dc1f48-714f-4ab9-af10-6fd56fe33935","pod_ip":"10.58.9.13","pod_name":"otelcollector-gx8hw","pod_owner":"DaemonSet/otelcollector"},"level":"default","log_source":"container","log_type":"application","message":"2026-04-10T22:27:59.912-0400\terror\tinternal/base_exporter.go:115\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{\"resource\": {\"service.instance.id\": \"adef8a20-bbd9-49c7-900a-639e71696195\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"otelapn2ah2.paychex.com:4317\", \"error\": \"request will be cancelled before next retry: rpc error: code = DeadlineExceeded desc = context deadline exceeded\", \"rejected_items\": 3}","openshift":{"cluster_id":"52301906-c63f-4d57-ac4a-549d4a68614a","labels":{"cluster":"ocp-wdc-n2a-int-1","cluster_number":"1","cluster_type":"int","datacenter":"webster","environment":"n2a"},"sequence":1775874479921729300},"timestamp":"2026-04-11T02:27:59.912572738Z"}
base_exporter.go:115 — Rejecting data (Tier 2 retry exhausted, no queue)
{
"hostname": "ocp-wdc-n2a-int-1-compute-a-<node-id>",
"kubernetes": {
"annotations": {
"k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"<pod-cidr-ip>/23\"],\"mac_address\":\"<mac>\",\"gateway_ips\":[\"<gateway-ip>\"],\"routes\":[...],\"role\":\"primary\"}}",
"k8s.v1.cni.cncf.io/network-status": "[{\"name\":\"ovn-kubernetes\",\"interface\":\"eth0\",\"ips\":[\"<pod-ip>\"],\"mac\":\"<mac>\",\"default\":true,\"dns\":{}}]",
"openshift.io/scc": "privileged"
},
"container_id": "cri-o://<container-id>",
"container_image": "<internal-registry>/moneng/ocp-otelcollector:latest",
"container_image_id": "<internal-registry>/moneng/ocp-otelcollector@sha256:<digest>",
"container_iostream": "stderr",
"container_name": "otelcollector",
"labels": {
"controller-revision-hash": "<hash>",
"name": "otelcollector",
"pod-template-generation": "1"
},
"namespace_id": "<namespace-uuid>",
"namespace_labels": {
"kubernetes_io_metadata_name": "splunk-fwd",
"pod-security_kubernetes_io_audit": "privileged",
"pod-security_kubernetes_io_audit-version": "latest",
"pod-security_kubernetes_io_warn": "privileged",
"pod-security_kubernetes_io_warn-version": "latest"
},
"namespace_name": "splunk-fwd",
"pod_id": "<pod-uuid>",
"pod_ip": "<pod-ip>",
"pod_name": "otelcollector-<suffix>",
"pod_owner": "DaemonSet/otelcollector"
},
"level": "default",
"log_source": "container",
"log_type": "application",
"message": "2026-04-10T22:27:59.912-0400\terror\tinternal/base_exporter.go:115\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{\"resource\": {\"service.instance.id\": \"<uuid>\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"otelapn2ah2.paychex.com:4317\", \"error\": \"request will be cancelled before next retry: rpc error: code = DeadlineExceeded desc = context deadline exceeded\", \"rejected_items\": 3}",
"openshift": {
"cluster_id": "<cluster-uuid>",
"labels": {
"cluster": "ocp-wdc-n2a-int-1",
"cluster_number": "1",
"cluster_type": "int",
"datacenter": "webster",
"environment": "n2a"
},
"sequence": "<sequence>"
},
"timestamp": "2026-04-11T02:27:59.912572738Z"
}
Config A - tier 2 queues full
{
"hostname": "ocp-wdc-n2a-int-1-compute-a-<node-id>",
"kubernetes": {
"annotations": {
"k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"<pod-cidr-ip>/23\"],\"mac_address\":\"<mac>\",\"gateway_ips\":[\"<gateway-ip>\"],\"routes\":[...],\"role\":\"primary\"}}",
"k8s.v1.cni.cncf.io/network-status": "[{\"name\":\"ovn-kubernetes\",\"interface\":\"eth0\",\"ips\":[\"<pod-ip>\"],\"mac\":\"<mac>\",\"default\":true,\"dns\":{}}]",
"openshift.io/scc": "privileged"
},
"container_id": "cri-o://<container-id>",
"container_image": "<internal-registry>/moneng/ocp-otelcollector:latest",
"container_image_id": "<internal-registry>/moneng/ocp-otelcollector@sha256:<digest>",
"container_iostream": "stderr",
"container_name": "otelcollector",
"labels": {
"controller-revision-hash": "<hash>",
"name": "otelcollector",
"pod-template-generation": "1"
},
"namespace_id": "<namespace-uuid>",
"namespace_labels": {
"kubernetes_io_metadata_name": "splunk-fwd",
"pod-security_kubernetes_io_audit": "privileged",
"pod-security_kubernetes_io_audit-version": "latest",
"pod-security_kubernetes_io_warn": "privileged",
"pod-security_kubernetes_io_warn-version": "latest"
},
"namespace_name": "splunk-fwd",
"pod_id": "<pod-uuid>",
"pod_ip": "<pod-ip>",
"pod_name": "otelcollector-<suffix>",
"pod_owner": "DaemonSet/otelcollector"
},
"level": "default",
"log_source": "container",
"log_type": "application",
"message": "2026-04-10T23:07:59.976-0400\terror\tinternal/base_exporter.go:114\tExporting failed. Rejecting data.\t{\"resource\": {\"service.instance.id\": \"<uuid>\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"<gateway-host>:4317\", \"error\": \"sending queue is full\", \"rejected_items\": 4}",
"openshift": {
"cluster_id": "<cluster-uuid>",
"labels": {
"cluster": "ocp-wdc-n2a-int-1",
"cluster_number": "1",
"cluster_type": "int",
"datacenter": "webster",
"environment": "n2a"
},
"sequence": "<sequence>"
},
"timestamp": "2026-04-11T03:07:59.977205702Z"
}
Component(s)
exporter/loadbalancing
What happened?
Title: loadbalancingexporter: failed endpoint retries re-route already-delivered records, causing duplicates on healthy endpoints
Component: exporter/loadbalancing
Version: v0.145.0
Collector: otelcol-contrib
Description
When one or more static resolver endpoints becomes unavailable, the loadbalancing exporter delivers duplicate records to the remaining healthy endpoints on every retry cycle. The duplicate rate scales with the number of failed endpoints and the Tier 1 retry window — with one endpoint down we observe approximately 3× the normal record volume on surviving endpoints; with two endpoints down the rate grows non-linearly and approaches 100% duplication. This behavior was observed under two distinct sub-exporter configurations.
Steps to Reproduce
Configure loadbalancingexporter with a static resolver pointing to 4 endpoints and retry_on_failure enabled at the Tier 1 level.
Begin sending logs through the pipeline.
Stop one of the 4 gateway endpoints (process kill, not graceful shutdown).
Observe delivery counts on the remaining 3 healthy endpoints.
Expected Behavior
Records that were successfully acknowledged by healthy endpoints before the retry are not retransmitted. Only records that failed delivery are retried.
Observed Behavior
Configuration A — protocol.otlp.sending_queue.enabled: true (default):
The per-endpoint async queue accepts batches destined for a downed endpoint and returns no error to the loadbalancing layer. The loadbalancing exporter does not retry. Records accumulate silently in the sub-exporter queue; duplicate delivery does not begin until the queue reaches capacity and the first error surfaces to the loadbalancing layer (~40 minutes at our traffic volume with queue_size: 1000). Once that threshold is crossed, the duplicate behavior described below begins.
Configuration B — protocol.otlp.sending_queue.enabled: false:
With the async queue removed, errors from the failed endpoint surface immediately to the loadbalancing layer. On every retry, records that were already successfully delivered to healthy endpoints are retransmitted to those endpoints. The healthy endpoints show approximately 3× normal ingest rate with one of four endpoints down, beginning immediately after the first failure rather than after a delay. Splunk confirms the records are genuine duplicates (same _raw, same timestamp).
In both configurations, the duplicate records are real — they are not an artifact of indexing or search. Configuration B makes the problem visible immediately; Configuration A conceals it until the sub-exporter queue is exhausted.
Configuration (relevant excerpt — Configuration B)
Additional Notes
Logs without a traceID attribute receive a random() routing key per record. All of our log traffic falls into this category.
The behavior is reproducible and consistent across test runs.
Downstream sink (Splunk HEC) confirms duplicates are real; they are not an artifact of indexing or search.
Collector version
0.145.0
Environment information
Environment
OS: RHEL 9
Compiler(if manually compiled): --
OpenTelemetry Collector configuration
Log output
Log 1 — queue_sender.go:50 — Dropping data (Tier 1 queue exhausted) {"hostname":"ocp-wdc-n2a-int-1-compute-a-bb0a43ef-9ddpp","kubernetes":{"annotations":{"k8s.ovn.org/pod-networks":"{\"default\":{\"ip_addresses\":[\"10.58.9.13/23\"],\"mac_address\":\"0a:58:0a:3a:09:0d\",\"gateway_ips\":[\"10.58.8.1\"],\"routes\":[{\"dest\":\"10.58.0.0/15\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"172.29.0.0/16\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"169.254.169.5/32\",\"nextHop\":\"10.58.8.1\"},{\"dest\":\"100.65.0.0/16\",\"nextHop\":\"10.58.8.1\"}],\"ip_address\":\"10.58.9.13/23\",\"gateway_ip\":\"10.58.8.1\",\"role\":\"primary\"}}","k8s.v1.cni.cncf.io/network-status":"[{\n \"name\": \"ovn-kubernetes\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.58.9.13\"\n ],\n \"mac\": \"0a:58:0a:3a:09:0d\",\n \"default\": true,\n \"dns\": {}\n}]","openshift.io/scc":"privileged"},"container_id":"cri-o://d9ec5b451914e7b284ca3892838178158452916a3aa36fd29698b80c3015f4c0","container_image":"infra-release-docker-local.registry.paychex.com/moneng/ocp-otelcollector:latest","container_image_id":"infra-release-docker-local.registry.paychex.com/moneng/ocp-otelcollector@sha256:41dbd942df344ead0dd1c565e57db2e9aae30e85eea6f7dbab4ca77e2a3b475e","container_iostream":"stderr","container_name":"otelcollector","labels":{"controller-revision-hash":"7b5b9b86cf","name":"otelcollector","pod-template-generation":"1"},"namespace_id":"cad83375-1281-4d10-a367-9e3d1a48905d","namespace_labels":{"kubernetes_io_metadata_name":"splunk-fwd","pod-security_kubernetes_io_audit":"privileged","pod-security_kubernetes_io_audit-version":"latest","pod-security_kubernetes_io_warn":"privileged","pod-security_kubernetes_io_warn-version":"latest"},"namespace_name":"splunk-fwd","pod_id":"09dc1f48-714f-4ab9-af10-6fd56fe33935","pod_ip":"10.58.9.13","pod_name":"otelcollector-gx8hw","pod_owner":"DaemonSet/otelcollector"},"level":"default","log_source":"container","log_type":"application","message":"2026-04-10T22:27:59.912-0400\terror\tinternal/base_exporter.go:115\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{\"resource\": {\"service.instance.id\": \"adef8a20-bbd9-49c7-900a-639e71696195\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"otelapn2ah2.paychex.com:4317\", \"error\": \"request will be cancelled before next retry: rpc error: code = DeadlineExceeded desc = context deadline exceeded\", \"rejected_items\": 3}","openshift":{"cluster_id":"52301906-c63f-4d57-ac4a-549d4a68614a","labels":{"cluster":"ocp-wdc-n2a-int-1","cluster_number":"1","cluster_type":"int","datacenter":"webster","environment":"n2a"},"sequence":1775874479921729300},"timestamp":"2026-04-11T02:27:59.912572738Z"} base_exporter.go:115 — Rejecting data (Tier 2 retry exhausted, no queue) { "hostname": "ocp-wdc-n2a-int-1-compute-a-<node-id>", "kubernetes": { "annotations": { "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"<pod-cidr-ip>/23\"],\"mac_address\":\"<mac>\",\"gateway_ips\":[\"<gateway-ip>\"],\"routes\":[...],\"role\":\"primary\"}}", "k8s.v1.cni.cncf.io/network-status": "[{\"name\":\"ovn-kubernetes\",\"interface\":\"eth0\",\"ips\":[\"<pod-ip>\"],\"mac\":\"<mac>\",\"default\":true,\"dns\":{}}]", "openshift.io/scc": "privileged" }, "container_id": "cri-o://<container-id>", "container_image": "<internal-registry>/moneng/ocp-otelcollector:latest", "container_image_id": "<internal-registry>/moneng/ocp-otelcollector@sha256:<digest>", "container_iostream": "stderr", "container_name": "otelcollector", "labels": { "controller-revision-hash": "<hash>", "name": "otelcollector", "pod-template-generation": "1" }, "namespace_id": "<namespace-uuid>", "namespace_labels": { "kubernetes_io_metadata_name": "splunk-fwd", "pod-security_kubernetes_io_audit": "privileged", "pod-security_kubernetes_io_audit-version": "latest", "pod-security_kubernetes_io_warn": "privileged", "pod-security_kubernetes_io_warn-version": "latest" }, "namespace_name": "splunk-fwd", "pod_id": "<pod-uuid>", "pod_ip": "<pod-ip>", "pod_name": "otelcollector-<suffix>", "pod_owner": "DaemonSet/otelcollector" }, "level": "default", "log_source": "container", "log_type": "application", "message": "2026-04-10T22:27:59.912-0400\terror\tinternal/base_exporter.go:115\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{\"resource\": {\"service.instance.id\": \"<uuid>\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"otelapn2ah2.paychex.com:4317\", \"error\": \"request will be cancelled before next retry: rpc error: code = DeadlineExceeded desc = context deadline exceeded\", \"rejected_items\": 3}", "openshift": { "cluster_id": "<cluster-uuid>", "labels": { "cluster": "ocp-wdc-n2a-int-1", "cluster_number": "1", "cluster_type": "int", "datacenter": "webster", "environment": "n2a" }, "sequence": "<sequence>" }, "timestamp": "2026-04-11T02:27:59.912572738Z" } Config A - tier 2 queues full { "hostname": "ocp-wdc-n2a-int-1-compute-a-<node-id>", "kubernetes": { "annotations": { "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"<pod-cidr-ip>/23\"],\"mac_address\":\"<mac>\",\"gateway_ips\":[\"<gateway-ip>\"],\"routes\":[...],\"role\":\"primary\"}}", "k8s.v1.cni.cncf.io/network-status": "[{\"name\":\"ovn-kubernetes\",\"interface\":\"eth0\",\"ips\":[\"<pod-ip>\"],\"mac\":\"<mac>\",\"default\":true,\"dns\":{}}]", "openshift.io/scc": "privileged" }, "container_id": "cri-o://<container-id>", "container_image": "<internal-registry>/moneng/ocp-otelcollector:latest", "container_image_id": "<internal-registry>/moneng/ocp-otelcollector@sha256:<digest>", "container_iostream": "stderr", "container_name": "otelcollector", "labels": { "controller-revision-hash": "<hash>", "name": "otelcollector", "pod-template-generation": "1" }, "namespace_id": "<namespace-uuid>", "namespace_labels": { "kubernetes_io_metadata_name": "splunk-fwd", "pod-security_kubernetes_io_audit": "privileged", "pod-security_kubernetes_io_audit-version": "latest", "pod-security_kubernetes_io_warn": "privileged", "pod-security_kubernetes_io_warn-version": "latest" }, "namespace_name": "splunk-fwd", "pod_id": "<pod-uuid>", "pod_ip": "<pod-ip>", "pod_name": "otelcollector-<suffix>", "pod_owner": "DaemonSet/otelcollector" }, "level": "default", "log_source": "container", "log_type": "application", "message": "2026-04-10T23:07:59.976-0400\terror\tinternal/base_exporter.go:114\tExporting failed. Rejecting data.\t{\"resource\": {\"service.instance.id\": \"<uuid>\", \"service.name\": \"otelcol-contrib\", \"service.version\": \"0.145.0\"}, \"otelcol.component.id\": \"loadbalancing/n2a\", \"otelcol.component.kind\": \"exporter\", \"otelcol.signal\": \"logs\", \"endpoint\": \"<gateway-host>:4317\", \"error\": \"sending queue is full\", \"rejected_items\": 4}", "openshift": { "cluster_id": "<cluster-uuid>", "labels": { "cluster": "ocp-wdc-n2a-int-1", "cluster_number": "1", "cluster_type": "int", "datacenter": "webster", "environment": "n2a" }, "sequence": "<sequence>" }, "timestamp": "2026-04-11T03:07:59.977205702Z" }Additional context
No response
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding
+1orme too, to help us triage it. Learn more here.