Skip to content

feat: add ROUGE score descriptor and dataset metric (#1318)#1863

Open
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985:feature/rouge-score-descriptor
Open

feat: add ROUGE score descriptor and dataset metric (#1318)#1863
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985:feature/rouge-score-descriptor

Conversation

@mukund1985
Copy link
Copy Markdown

@mukund1985 mukund1985 commented Apr 21, 2026

Summary

Closes #1318

Implements ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score as a row-level Descriptor and a dataset-level Metric, as requested in the issue.

Changes

RougeScore descriptor — row-level

File: src/evidently/descriptors/_rouge_score.py

Computes ROUGE score between a prediction column and a reference column for each row. Returns a numeric value in [0, 1].

from evidently.descriptors import RougeScore

dataset = Dataset.from_pandas(df, data_definition=DataDefinition(), descriptors=[
    RougeScore("response", "ground_truth", rouge_type="rouge1", alias="ROUGE-1 F1"),
    RougeScore("response", "ground_truth", rouge_type="rouge2", score_type="recall"),
    RougeScore("response", "ground_truth", rouge_type="rougeL"),
])

RougeScoreMetric — dataset-level aggregate

File: src/evidently/metrics/text_evals.py

A dedicated SingleValueMetric (not ColumnSummaryMetric) that computes the mean ROUGE score across all rows. Renders a counter widget + histogram distribution. Supports current-only and current+reference comparisons.

from evidently import Report
from evidently.metrics import RougeScoreMetric

report = Report([
    RougeScoreMetric(
        prediction_column="response",
        reference_column="ground_truth",
        rouge_type="rouge1",
    )
])
result = report.run(current_dataset, reference_dataset)

RougeScoreFeature — legacy feature layer

File: src/evidently/legacy/features/rouge_score_feature.py

Follows the same pattern as SemanticSimilarityFeature. Registered in _registry.py.

Parameters

Both RougeScore and RougeScoreMetric expose:

Parameter Values Default
rouge_type rouge1, rouge2, rougeL, rougeLsum rouge1
score_type f (F1), precision, recall f

Dependency

Added rouge-score>=0.1.2 to the [llm] optional extra — same group as sentence-transformers and transformers. Import is lazy (inside function bodies), so it only fails if the feature is used without pip install evidently[llm].

Added rouge_score.* to [[tool.mypy.overrides]] ignore list.

Tests

31 tests across 3 files, all passing:

  • tests/features/test_rouge_score_feature.py — 13 tests (parametrised variants, NaN handling, invalid params, feature name)
  • tests/descriptors/test_rouge_score_descriptor.py — 11 tests (all ROUGE types, precision/recall, alias, column listing, validation)
  • tests/metrics/test_rouge_score_metric.py — 7 tests (score range, with/without reference, identical texts, all variants)

Implements ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
as requested in issue evidentlyai#1318.

What is added:

RougeScore descriptor (src/evidently/descriptors/_rouge_score.py)
- Row-level ROUGE computation between prediction and reference columns
- Supports rouge1, rouge2, rougeL, rougeLsum variants
- Supports f (F1), precision, recall score types
- NaN-safe: treats missing values as empty strings
- Follows the same pattern as TextLength

RougeScoreFeature (src/evidently/legacy/features/rouge_score_feature.py)
- Legacy GeneratedFeature layer following SemanticSimilarityFeature pattern
- Lazy import of rouge_score inside generate_feature()
- Registered in features _registry.py

RougeScoreMetric (src/evidently/metrics/text_evals.py)
- Dataset-level metric returning mean ROUGE across all rows
- Custom SingleValueMetric + SingleValueCalculation (not ColumnSummaryMetric)
- Renders counter widget (current/reference means) + histogram distribution
- Supports current-only and current+reference comparisons
- Default test: eq(Reference(relative=0.1)) when reference data is present
- Registered in core/registries/metrics.py

Dependency:
- Added rouge-score>=0.1.2 to [llm] optional extra in pyproject.toml
- Added rouge_score.* to mypy ignore_missing_imports overrides
- Lazy import (no module-level import): error only if feature is used
  without installing evidently[llm]

Tests: 31 tests, all passing
- tests/features/test_rouge_score_feature.py (13 tests)
- tests/descriptors/test_rouge_score_descriptor.py (11 tests)
- tests/metrics/test_rouge_score_metric.py (7 tests)
@mukund1985 mukund1985 force-pushed the feature/rouge-score-descriptor branch from 7526e15 to 468976d Compare April 22, 2026 21:13
@mukund1985
Copy link
Copy Markdown
Author

This is my first feature PR here so apologies if I've missed anything obvious. I tried to follow the same patterns as SemanticSimilarity and TextLength when putting it together. 31 tests pass locally. Also happy to split it into smaller PRs if that makes reviewing easier — just say the word.

@mukund1985
Copy link
Copy Markdown
Author

@Liraim — would appreciate a review when you get a chance. Tests pass locally, happy to make any changes needed.

@mukund1985
Copy link
Copy Markdown
Author

@DimaAmega — looks like CI hasn't triggered yet, could you approve the workflow run when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a new ROUGE metric to Evidently

1 participant