feat: add ROUGE score descriptor and dataset metric (#1318) by mukund1985 · Pull Request #1863 · evidentlyai/evidently

mukund1985 · 2026-04-21T12:26:38Z

Summary

Implements ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score as a row-level Descriptor and a dataset-level Metric, as requested in the issue.

Changes

`RougeScore` descriptor — row-level

File: src/evidently/descriptors/_rouge_score.py

Computes ROUGE score between a prediction column and a reference column for each row. Returns a numeric value in [0, 1].

from evidently.descriptors import RougeScore

dataset = Dataset.from_pandas(df, data_definition=DataDefinition(), descriptors=[
    RougeScore("response", "ground_truth", rouge_type="rouge1", alias="ROUGE-1 F1"),
    RougeScore("response", "ground_truth", rouge_type="rouge2", score_type="recall"),
    RougeScore("response", "ground_truth", rouge_type="rougeL"),
])

`RougeScoreMetric` — dataset-level aggregate

File: src/evidently/metrics/text_evals.py

A dedicated SingleValueMetric (not ColumnSummaryMetric) that computes the mean ROUGE score across all rows. Renders a counter widget + histogram distribution. Supports current-only and current+reference comparisons.

from evidently import Report
from evidently.metrics import RougeScoreMetric

report = Report([
    RougeScoreMetric(
        prediction_column="response",
        reference_column="ground_truth",
        rouge_type="rouge1",
    )
])
result = report.run(current_dataset, reference_dataset)

`RougeScoreFeature` — legacy feature layer

File: src/evidently/legacy/features/rouge_score_feature.py

Follows the same pattern as SemanticSimilarityFeature. Registered in _registry.py.

Parameters

Both RougeScore and RougeScoreMetric expose:

Parameter	Values	Default
`rouge_type`	`rouge1`, `rouge2`, `rougeL`, `rougeLsum`	`rouge1`
`score_type`	`f` (F1), `precision`, `recall`	`f`

Dependency

Added rouge-score>=0.1.2 to the [llm] optional extra — same group as sentence-transformers and transformers. Import is lazy (inside function bodies), so it only fails if the feature is used without pip install evidently[llm].

Added rouge_score.* to [[tool.mypy.overrides]] ignore list.

Tests

31 tests across 3 files, all passing:

tests/features/test_rouge_score_feature.py — 13 tests (parametrised variants, NaN handling, invalid params, feature name)
tests/descriptors/test_rouge_score_descriptor.py — 11 tests (all ROUGE types, precision/recall, alias, column listing, validation)
tests/metrics/test_rouge_score_metric.py — 7 tests (score range, with/without reference, identical texts, all variants)

Implements ROUGE (Recall-Oriented Understudy for Gisting Evaluation) as requested in issue evidentlyai#1318. What is added: RougeScore descriptor (src/evidently/descriptors/_rouge_score.py) - Row-level ROUGE computation between prediction and reference columns - Supports rouge1, rouge2, rougeL, rougeLsum variants - Supports f (F1), precision, recall score types - NaN-safe: treats missing values as empty strings - Follows the same pattern as TextLength RougeScoreFeature (src/evidently/legacy/features/rouge_score_feature.py) - Legacy GeneratedFeature layer following SemanticSimilarityFeature pattern - Lazy import of rouge_score inside generate_feature() - Registered in features _registry.py RougeScoreMetric (src/evidently/metrics/text_evals.py) - Dataset-level metric returning mean ROUGE across all rows - Custom SingleValueMetric + SingleValueCalculation (not ColumnSummaryMetric) - Renders counter widget (current/reference means) + histogram distribution - Supports current-only and current+reference comparisons - Default test: eq(Reference(relative=0.1)) when reference data is present - Registered in core/registries/metrics.py Dependency: - Added rouge-score>=0.1.2 to [llm] optional extra in pyproject.toml - Added rouge_score.* to mypy ignore_missing_imports overrides - Lazy import (no module-level import): error only if feature is used without installing evidently[llm] Tests: 31 tests, all passing - tests/features/test_rouge_score_feature.py (13 tests) - tests/descriptors/test_rouge_score_descriptor.py (11 tests) - tests/metrics/test_rouge_score_metric.py (7 tests)

mukund1985 · 2026-04-25T22:37:18Z

This is my first feature PR here so apologies if I've missed anything obvious. I tried to follow the same patterns as SemanticSimilarity and TextLength when putting it together. 31 tests pass locally. Also happy to split it into smaller PRs if that makes reviewing easier — just say the word.

mukund1985 · 2026-04-29T18:03:56Z

@Liraim — would appreciate a review when you get a chance. Tests pass locally, happy to make any changes needed.

mukund1985 · 2026-04-29T18:06:43Z

@DimaAmega — looks like CI hasn't triggered yet, could you approve the workflow run when you get a chance?

mukund1985 force-pushed the feature/rouge-score-descriptor branch from 7526e15 to 468976d Compare April 22, 2026 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ROUGE score descriptor and dataset metric (#1318)#1863

feat: add ROUGE score descriptor and dataset metric (#1318)#1863
mukund1985 wants to merge 1 commit intoevidentlyai:mainfrom
mukund1985:feature/rouge-score-descriptor

mukund1985 commented Apr 21, 2026 •

edited

Loading

Uh oh!

mukund1985 commented Apr 25, 2026

Uh oh!

mukund1985 commented Apr 29, 2026

Uh oh!

mukund1985 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mukund1985 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

RougeScore descriptor — row-level

RougeScoreMetric — dataset-level aggregate

RougeScoreFeature — legacy feature layer

Parameters

Dependency

Tests

Uh oh!

mukund1985 commented Apr 25, 2026

Uh oh!

mukund1985 commented Apr 29, 2026

Uh oh!

mukund1985 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mukund1985 commented Apr 21, 2026 •

edited

Loading

`RougeScore` descriptor — row-level

`RougeScoreMetric` — dataset-level aggregate

`RougeScoreFeature` — legacy feature layer