Skip to content

fix: replace hardcoded fill value with dynamic min_nonzero/10 in get_binned_data#1862

Open
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985:fix/fill-zeroes-dynamic-value
Open

fix: replace hardcoded fill value with dynamic min_nonzero/10 in get_binned_data#1862
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985:fix/fill-zeroes-dynamic-value

Conversation

@mukund1985
Copy link
Copy Markdown

@mukund1985 mukund1985 commented Apr 21, 2026

Summary

Fixes #334

The zero-fill logic in get_binned_data used a hardcoded threshold and fallback of 0.0001. When all non-zero percentages in a distribution are smaller than 0.0001 (common with large datasets or rare categories), the fill value becomes larger than legitimate data values. This makes KL-divergence and other stattest calculations incorrect — the fill is supposed to be a negligible epsilon, not a dominant value.

Root cause:

# BEFORE — fill can exceed real values when min_nonzero <= 0.0001
np.place(reference_percents, reference_percents == 0,
    min(reference_percents[reference_percents != 0]) / 10**6
    if min(reference_percents[reference_percents != 0]) <= 0.0001
    else 0.0001)

Fix:

# AFTER — always proportional to the smallest real value in that array
ref_nonzero = reference_percents[reference_percents != 0]
if len(ref_nonzero) > 0:
    np.place(reference_percents, reference_percents == 0, min(ref_nonzero) / 10)

min_nonzero / 10 is guaranteed strictly smaller than any real data value at any scale. The empty-array guard prevents errors when one side has no non-zero entries.

Changes

  • src/evidently/legacy/calculations/stattests/utils.py — pandas implementation
  • src/evidently/legacy/spark/calculations/stattests/utils.py — Spark implementation

Both use identical logic with their respective parameter names (feel_zeroes / fill_zeroes).

Test plan

  • All 156 existing stattest unit tests pass (pytest tests/multitest/metrics/test_data_drift.py tests/stattests/ -v)
  • Smoke test confirms fill value equals exactly min_nonzero / 10 and old hardcoded 0.0001 is gone
  • Verified fix handles edge case where all values in one array are non-zero (empty guard)

…binned_data

Fixes evidentlyai#334

The zero-fill logic in `get_binned_data` used a hardcoded threshold and
fallback value of 0.0001. When all non-zero percentages were smaller than
0.0001 (e.g. for large datasets or rare categories), the fill value could
be *larger* than legitimate data values. This caused KL-divergence and
other stattest calculations to produce incorrect results because the fill
was supposed to be a negligible epsilon, not a dominant value.

Fix: always use `min(nonzero_values) / 10` as the fill value. This is
guaranteed to be strictly smaller than any real data value, regardless of
scale. Empty-array guards prevent errors when one side has no non-zero
entries.

Applied to both pandas (`calculations/stattests/utils.py`) and Spark
(`spark/calculations/stattests/utils.py`) implementations.
@mukund1985 mukund1985 force-pushed the fix/fill-zeroes-dynamic-value branch from 9b9872c to 99a8b3c Compare April 22, 2026 21:13
@mukund1985
Copy link
Copy Markdown
Author

Hey, just flagging — ran the existing test suite locally and everything passes. The fix is pretty minimal, just handling that edge case. Happy to change the approach if there's a better way to do it, just let me know.

@mukund1985
Copy link
Copy Markdown
Author

@Liraim — would appreciate a review when you get a chance. Tests pass locally, happy to make any changes needed.

@mukund1985
Copy link
Copy Markdown
Author

@DimaAmega — looks like CI hasn't triggered yet, could you approve the workflow run when you get a chance?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

📚 Artifacts deployed to GitHub Pages: https://evidentlyai.github.io/evidently/ci/#pr-1862-fix-fill-zeroes-dynamic-value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case.

1 participant