feat: add ROUGE score descriptor and dataset metric (#1318)#1863
Open
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
Open
feat: add ROUGE score descriptor and dataset metric (#1318)#1863mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
Conversation
Implements ROUGE (Recall-Oriented Understudy for Gisting Evaluation) as requested in issue evidentlyai#1318. What is added: RougeScore descriptor (src/evidently/descriptors/_rouge_score.py) - Row-level ROUGE computation between prediction and reference columns - Supports rouge1, rouge2, rougeL, rougeLsum variants - Supports f (F1), precision, recall score types - NaN-safe: treats missing values as empty strings - Follows the same pattern as TextLength RougeScoreFeature (src/evidently/legacy/features/rouge_score_feature.py) - Legacy GeneratedFeature layer following SemanticSimilarityFeature pattern - Lazy import of rouge_score inside generate_feature() - Registered in features _registry.py RougeScoreMetric (src/evidently/metrics/text_evals.py) - Dataset-level metric returning mean ROUGE across all rows - Custom SingleValueMetric + SingleValueCalculation (not ColumnSummaryMetric) - Renders counter widget (current/reference means) + histogram distribution - Supports current-only and current+reference comparisons - Default test: eq(Reference(relative=0.1)) when reference data is present - Registered in core/registries/metrics.py Dependency: - Added rouge-score>=0.1.2 to [llm] optional extra in pyproject.toml - Added rouge_score.* to mypy ignore_missing_imports overrides - Lazy import (no module-level import): error only if feature is used without installing evidently[llm] Tests: 31 tests, all passing - tests/features/test_rouge_score_feature.py (13 tests) - tests/descriptors/test_rouge_score_descriptor.py (11 tests) - tests/metrics/test_rouge_score_metric.py (7 tests)
7526e15 to
468976d
Compare
Author
|
This is my first feature PR here so apologies if I've missed anything obvious. I tried to follow the same patterns as SemanticSimilarity and TextLength when putting it together. 31 tests pass locally. Also happy to split it into smaller PRs if that makes reviewing easier — just say the word. |
Author
|
@Liraim — would appreciate a review when you get a chance. Tests pass locally, happy to make any changes needed. |
Author
|
@DimaAmega — looks like CI hasn't triggered yet, could you approve the workflow run when you get a chance? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1318
Implements ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score as a row-level Descriptor and a dataset-level Metric, as requested in the issue.
Changes
RougeScoredescriptor — row-levelFile:
src/evidently/descriptors/_rouge_score.pyComputes ROUGE score between a prediction column and a reference column for each row. Returns a numeric value in [0, 1].
RougeScoreMetric— dataset-level aggregateFile:
src/evidently/metrics/text_evals.pyA dedicated
SingleValueMetric(notColumnSummaryMetric) that computes the mean ROUGE score across all rows. Renders a counter widget + histogram distribution. Supports current-only and current+reference comparisons.RougeScoreFeature— legacy feature layerFile:
src/evidently/legacy/features/rouge_score_feature.pyFollows the same pattern as
SemanticSimilarityFeature. Registered in_registry.py.Parameters
Both
RougeScoreandRougeScoreMetricexpose:rouge_typerouge1,rouge2,rougeL,rougeLsumrouge1score_typef(F1),precision,recallfDependency
Added
rouge-score>=0.1.2to the[llm]optional extra — same group assentence-transformersandtransformers. Import is lazy (inside function bodies), so it only fails if the feature is used withoutpip install evidently[llm].Added
rouge_score.*to[[tool.mypy.overrides]]ignore list.Tests
31 tests across 3 files, all passing:
tests/features/test_rouge_score_feature.py— 13 tests (parametrised variants, NaN handling, invalid params, feature name)tests/descriptors/test_rouge_score_descriptor.py— 11 tests (all ROUGE types, precision/recall, alias, column listing, validation)tests/metrics/test_rouge_score_metric.py— 7 tests (score range, with/without reference, identical texts, all variants)