fix: replace hardcoded fill value with dynamic min_nonzero/10 in get_binned_data#1862
Open
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
Open
fix: replace hardcoded fill value with dynamic min_nonzero/10 in get_binned_data#1862mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
Conversation
…binned_data Fixes evidentlyai#334 The zero-fill logic in `get_binned_data` used a hardcoded threshold and fallback value of 0.0001. When all non-zero percentages were smaller than 0.0001 (e.g. for large datasets or rare categories), the fill value could be *larger* than legitimate data values. This caused KL-divergence and other stattest calculations to produce incorrect results because the fill was supposed to be a negligible epsilon, not a dominant value. Fix: always use `min(nonzero_values) / 10` as the fill value. This is guaranteed to be strictly smaller than any real data value, regardless of scale. Empty-array guards prevent errors when one side has no non-zero entries. Applied to both pandas (`calculations/stattests/utils.py`) and Spark (`spark/calculations/stattests/utils.py`) implementations.
9b9872c to
99a8b3c
Compare
Author
|
Hey, just flagging — ran the existing test suite locally and everything passes. The fix is pretty minimal, just handling that edge case. Happy to change the approach if there's a better way to do it, just let me know. |
Author
|
@Liraim — would appreciate a review when you get a chance. Tests pass locally, happy to make any changes needed. |
Author
|
@DimaAmega — looks like CI hasn't triggered yet, could you approve the workflow run when you get a chance? |
|
📚 Artifacts deployed to GitHub Pages: https://evidentlyai.github.io/evidently/ci/#pr-1862-fix-fill-zeroes-dynamic-value |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #334
The zero-fill logic in
get_binned_dataused a hardcoded threshold and fallback of0.0001. When all non-zero percentages in a distribution are smaller than0.0001(common with large datasets or rare categories), the fill value becomes larger than legitimate data values. This makes KL-divergence and other stattest calculations incorrect — the fill is supposed to be a negligible epsilon, not a dominant value.Root cause:
Fix:
min_nonzero / 10is guaranteed strictly smaller than any real data value at any scale. The empty-array guard prevents errors when one side has no non-zero entries.Changes
src/evidently/legacy/calculations/stattests/utils.py— pandas implementationsrc/evidently/legacy/spark/calculations/stattests/utils.py— Spark implementationBoth use identical logic with their respective parameter names (
feel_zeroes/fill_zeroes).Test plan
pytest tests/multitest/metrics/test_data_drift.py tests/stattests/ -v)min_nonzero / 10and old hardcoded0.0001is gone