Skip to content

[receiver/prometheusremotewrite] Handle NaN and overflow buckets for native histograms#47745

Open
dashpole wants to merge 1 commit intoopen-telemetry:mainfrom
dashpole:exphist_prwreceiver
Open

[receiver/prometheusremotewrite] Handle NaN and overflow buckets for native histograms#47745
dashpole wants to merge 1 commit intoopen-telemetry:mainfrom
dashpole:exphist_prwreceiver

Conversation

@dashpole
Copy link
Copy Markdown
Contributor

Description

This fixes some handling of native histograms in the PRW receiver:

  • NaN sums need to have the corresponding datapoint flag set
  • Overflow buckets need to be dropped, and removed from the overall count. These can only exist if the histogram includes observations that are greater than MaxFloat64.

Link to tracking issue

Fixes #47728

Testing

Added unit test

@krajorama to double-check my native hist bucket logic.

@dashpole dashpole force-pushed the exphist_prwreceiver branch from 96b5389 to 5b094ab Compare April 18, 2026 18:31
@dashpole dashpole marked this pull request as ready for review April 18, 2026 19:06
@dashpole dashpole requested review from a team and ArthurSens as code owners April 18, 2026 19:06
@dashpole dashpole requested a review from Copilot April 18, 2026 19:06
@github-actions github-actions bot requested a review from perebaj April 18, 2026 19:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Prometheus Remote Write (PRW) receiver’s native exponential histogram translation to better align with Prometheus compatibility requirements around stale NaN sums and overflow buckets.

Changes:

  • Set NoRecordedValue on exponential histogram datapoints when the incoming sum is Prometheus “stale NaN”.
  • Detect and drop overflow buckets (buckets that would exceed IEEE float range) and subtract their counts from the datapoint total count.
  • Add unit tests covering stale-NaN sums and overflow-bucket dropping behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
receiver/prometheusremotewritereceiver/receiver.go Implements stale-NaN flagging for exponential histograms and drops overflow buckets while adjusting the overall count.
receiver/prometheusremotewritereceiver/receiver_test.go Adds unit tests for stale-NaN sum handling and for dropping overflow buckets + count adjustment.
.chloggen/prwreceiver-nan-nativeoverflow.yaml Adds a changelog entry for the bug fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

if droppedCount > 0 && !value.IsStaleNaN(histogram.Sum) {
dp.SetCount(dp.Count() - droppedCount)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dp.SetCount(dp.Count() - droppedCount) can underflow if the incoming histogram has an inconsistent Count (e.g., dropped overflow buckets sum to more than the reported total count). In Go this will wrap to a very large uint64. Consider guarding with an explicit comparison (e.g., clamp to 0 and/or log/drop the datapoint) before subtracting.

Suggested change
dp.SetCount(dp.Count() - droppedCount)
count := dp.Count()
if droppedCount > count {
prw.settings.Logger.Info("Clamping Native Histogram count to zero due to inconsistent dropped overflow bucket count",
zapcore.Field{Key: "timeseries", Type: zapcore.StringType, String: ls.Get("__name__")})
dp.SetCount(0)
} else {
dp.SetCount(count - droppedCount)
}

Copilot uses AI. Check for mistakes.
// The next bucket (i_max + 1) is the +Inf overflow bucket, which is also allowed.
// Buckets beyond that must be dropped.
// See https://prometheus.io/docs/specs/native_histograms/#schema for more information.
overflowLimit := 1024*math.Pow(2, float64(histogram.Schema)) + 1
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflowLimit := 1024*math.Pow(2, float64(histogram.Schema)) + 1 is doing a float Pow for an integer power-of-two. Since Schema is constrained to [-4..8] for exponential histograms, this can be computed exactly with integer/bit operations (or math.Ldexp) and stored as an integer limit, avoiding float math and repeated float64(k) conversions in the bucket loops.

Suggested change
overflowLimit := 1024*math.Pow(2, float64(histogram.Schema)) + 1
overflowLimit := math.Ldexp(1024, int(histogram.Schema)) + 1

Copilot uses AI. Check for mistakes.
Comment on lines 775 to +789
for i := int32(0); i < span.Offset; i++ {
buckets.Append(uint64(0))
if float64(k) <= overflowLimit {
buckets.Append(uint64(0))
}
k++
}
}
for i := uint32(0); i < span.Length; i++ {
bucketCount += deltas[bucketIdx]
bucketIdx++
buckets.Append(uint64(bucketCount))

if float64(k) <= overflowLimit {
buckets.Append(uint64(bucketCount))
} else {
droppedCount += uint64(bucketCount)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both bucket conversion loops compare float64(k) against overflowLimit on every iteration. Since bucket indices and the overflow limit are integral in this schema range, consider switching k/overflowLimit to an integer type and comparing integers to reduce per-bucket overhead and remove any float edge cases.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/prometheusremotewrite] Handle Stale NaNs and overflow buckets

3 participants