[receiver/prometheusremotewrite] Handle NaN and overflow buckets for native histograms#47745
[receiver/prometheusremotewrite] Handle NaN and overflow buckets for native histograms#47745dashpole wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
…native histograms
96b5389 to
5b094ab
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the Prometheus Remote Write (PRW) receiver’s native exponential histogram translation to better align with Prometheus compatibility requirements around stale NaN sums and overflow buckets.
Changes:
- Set
NoRecordedValueon exponential histogram datapoints when the incoming sum is Prometheus “stale NaN”. - Detect and drop overflow buckets (buckets that would exceed IEEE float range) and subtract their counts from the datapoint total count.
- Add unit tests covering stale-NaN sums and overflow-bucket dropping behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
receiver/prometheusremotewritereceiver/receiver.go |
Implements stale-NaN flagging for exponential histograms and drops overflow buckets while adjusting the overall count. |
receiver/prometheusremotewritereceiver/receiver_test.go |
Adds unit tests for stale-NaN sum handling and for dropping overflow buckets + count adjustment. |
.chloggen/prwreceiver-nan-nativeoverflow.yaml |
Adds a changelog entry for the bug fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| if droppedCount > 0 && !value.IsStaleNaN(histogram.Sum) { | ||
| dp.SetCount(dp.Count() - droppedCount) |
There was a problem hiding this comment.
dp.SetCount(dp.Count() - droppedCount) can underflow if the incoming histogram has an inconsistent Count (e.g., dropped overflow buckets sum to more than the reported total count). In Go this will wrap to a very large uint64. Consider guarding with an explicit comparison (e.g., clamp to 0 and/or log/drop the datapoint) before subtracting.
| dp.SetCount(dp.Count() - droppedCount) | |
| count := dp.Count() | |
| if droppedCount > count { | |
| prw.settings.Logger.Info("Clamping Native Histogram count to zero due to inconsistent dropped overflow bucket count", | |
| zapcore.Field{Key: "timeseries", Type: zapcore.StringType, String: ls.Get("__name__")}) | |
| dp.SetCount(0) | |
| } else { | |
| dp.SetCount(count - droppedCount) | |
| } |
| // The next bucket (i_max + 1) is the +Inf overflow bucket, which is also allowed. | ||
| // Buckets beyond that must be dropped. | ||
| // See https://prometheus.io/docs/specs/native_histograms/#schema for more information. | ||
| overflowLimit := 1024*math.Pow(2, float64(histogram.Schema)) + 1 |
There was a problem hiding this comment.
overflowLimit := 1024*math.Pow(2, float64(histogram.Schema)) + 1 is doing a float Pow for an integer power-of-two. Since Schema is constrained to [-4..8] for exponential histograms, this can be computed exactly with integer/bit operations (or math.Ldexp) and stored as an integer limit, avoiding float math and repeated float64(k) conversions in the bucket loops.
| overflowLimit := 1024*math.Pow(2, float64(histogram.Schema)) + 1 | |
| overflowLimit := math.Ldexp(1024, int(histogram.Schema)) + 1 |
| for i := int32(0); i < span.Offset; i++ { | ||
| buckets.Append(uint64(0)) | ||
| if float64(k) <= overflowLimit { | ||
| buckets.Append(uint64(0)) | ||
| } | ||
| k++ | ||
| } | ||
| } | ||
| for i := uint32(0); i < span.Length; i++ { | ||
| bucketCount += deltas[bucketIdx] | ||
| bucketIdx++ | ||
| buckets.Append(uint64(bucketCount)) | ||
|
|
||
| if float64(k) <= overflowLimit { | ||
| buckets.Append(uint64(bucketCount)) | ||
| } else { | ||
| droppedCount += uint64(bucketCount) |
There was a problem hiding this comment.
Both bucket conversion loops compare float64(k) against overflowLimit on every iteration. Since bucket indices and the overflow limit are integral in this schema range, consider switching k/overflowLimit to an integer type and comparing integers to reduce per-bucket overhead and remove any float edge cases.
Description
This fixes some handling of native histograms in the PRW receiver:
Link to tracking issue
Fixes #47728
Testing
Added unit test
@krajorama to double-check my native hist bucket logic.