feat(ci): add RunnerContext and RegressionError for experiment GH action by wochinge · Pull Request #1635 · langfuse/langfuse-python

wochinge · 2026-04-20T15:13:50Z

Summary

Adds the SDK-side primitives consumed by the upcoming langfuse/experiment-action GitHub Action (LFE-9241).

RunnerContext wraps Langfuse.run_experiment with action-injected defaults (data, dataset_version, name, run_name, metadata). Users can override any default on the call site; metadata is merged with user-supplied keys winning on collision. Supports both LocalExperimentItem and DatasetItem.
RegressionError lets a user's experiment function signal a CI gate failure. Exposes optional structured metric / value / threshold fields so the action can render a targeted callout in the PR comment; falls back to a free-form message or a default string.
Linear

Usage (from the proposal)

from langfuse import RegressionError, RunnerContext
from langfuse.experiment import ExperimentResult

def experiment(context: RunnerContext) -> ExperimentResult:
    result = context.run_experiment(
        task=my_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[avg_accuracy],
    )
    if result.run_score_values["avg_accuracy"] < 0.9:
        raise RegressionError(
            result=result,
            metric="avg_accuracy",
            value=result.run_score_values["avg_accuracy"],
            threshold=0.9,
        )
    return result

Out of scope

The langfuse/experiment-action GitHub Action itself
PR-comment rendering
Multi-metric regression reporting (deferred — can land non-breaking as regressions: List[Regression] later)

🤖 Generated with Claude Code

Disclaimer: Experimental PR review

Greptile Summary

This PR adds two new SDK primitives — RunnerContext and RegressionError — intended for use with the upcoming langfuse/experiment-action GitHub Action. RunnerContext wraps Langfuse.run_experiment with action-injected defaults (dataset, name, run name, metadata) that users can override per-call, with user-supplied metadata keys winning on collision. RegressionError is a structured exception that lets a user's experiment function signal a CI gate failure with optional metric/value/threshold fields.

Confidence Score: 4/5

Safe to merge with two minor improvements worth addressing before the companion GitHub Action ships.

Both findings are P2, but the value/threshold-without-metric case silently drops structured diagnostic fields from the exception message, which could mislead action consumers. Neither is a hard blocker, but fixing them before the companion GitHub Action ships would avoid surprising behaviour.

langfuse/experiment.py — RunnerContext.run_experiment mutable defaults and RegressionError message formatting.

Important Files Changed

Filename	Overview
langfuse/experiment.py	Adds RunnerContext (wraps Langfuse.run_experiment with CI-injected defaults, metadata merging, and required-field validation) and RegressionError (structured exception for CI gate failures); minor mutable-default-arg and message-formatting issues.
langfuse/init.py	Exports RunnerContext and RegressionError from the top-level package; straightforward and correct.
tests/unit/test_experiment.py	Comprehensive unit tests covering defaults, call-site overrides, metadata merging, validation, and a signature-drift guard against Langfuse.run_experiment.

Sequence Diagram

sequenceDiagram
    participant A as experiment-action
    participant RC as RunnerContext
    participant L as Langfuse.run_experiment
    participant U as user experiment()

    A->>RC: RunnerContext(client, data, name, run_name, metadata, ...)
    A->>U: experiment(context=rc)
    U->>RC: context.run_experiment(task=..., [overrides])
    RC->>RC: resolve name / data / run_name (call overrides ctx defaults)
    RC->>RC: merge metadata (user keys win on collision)
    RC->>L: run_experiment(name, data, run_name, metadata, ...)
    L-->>RC: ExperimentResult
    RC-->>U: ExperimentResult
    alt regression detected
        U->>U: raise RegressionError(result, metric, value, threshold)
        U-->>A: RegressionError propagates
        A->>A: fail workflow, render PR callout
    else success
        U-->>A: ExperimentResult
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: langfuse/experiment.py
Line: 1128-1130

Comment:
**Mutable default arguments**

`evaluators=[]` and `run_evaluators=[]` use mutable list literals as defaults — a classic Python gotcha. Although these lists are only forwarded to `client.run_experiment` and not mutated here, using `[]` as a default means all callers that omit the argument share the *same* list object at definition time. If any downstream code ever mutates the list, the default is silently corrupted for future calls. The idiomatic fix is `None` with an `or []` guard (matching how `Langfuse.run_experiment` itself should be fixed), or `default_factory` in a dataclass.

```suggestion
        evaluators: Optional[List[EvaluatorFunction]] = None,
        composite_evaluator: Optional["CompositeEvaluatorFunction"] = None,
        run_evaluators: Optional[List[RunEvaluatorFunction]] = None,
```
And in the body, replace:
```python
evaluators=evaluators or [],
run_evaluators=run_evaluators or [],
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: langfuse/experiment.py
Line: 1196-1202

Comment:
**`value`/`threshold` silently dropped from message when `metric` is `None`**

If a caller provides `value=0.5, threshold=0.9` but omits `metric`, the branch falls through to `"Experiment regression detected"` and the structured fields are never reflected in `str(exc)`. The action's rendering path uses `str(exc)` as a fallback, so those diagnostics would be lost without any warning.

Consider guarding against this, or at least including them in the default message when present:

```suggestion
        if message is not None:
            formatted = message
        elif metric is not None:
            formatted = f"Regression on `{metric}`: {value} (threshold {threshold})"
        elif value is not None or threshold is not None:
            formatted = f"Experiment regression detected: value={value}, threshold={threshold}"
        else:
            formatted = "Experiment regression detected"
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "test: rename test_runner_context.py to t..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

Adds the SDK-side primitives consumed by the upcoming `langfuse/experiment-action` GitHub Action (LFE-9241): - `RunnerContext` wraps `Langfuse.run_experiment` with action-injected defaults (data, dataset_version, name, run_name, metadata). Users can override any default on the call site; metadata is merged with user-supplied keys winning on collision. - `RegressionError` lets users signal a CI gate failure and optionally pass structured `metric`/`value`/`threshold` fields so the action can render a callout in the PR comment. Both live in a dedicated `langfuse/ci.py` module so the CI surface stays isolated from the general experiment API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eriment module Relocates the CI-action primitives from the standalone `langfuse/ci.py` module into `langfuse/experiment.py` alongside the other experiment types. Deletes `langfuse/ci.py` and renames the tests accordingly. The public import paths (`from langfuse import RunnerContext, RegressionError`) are unchanged. `CompositeEvaluatorFunction` is imported under `TYPE_CHECKING` to avoid a circular import with `langfuse.batch_evaluation`. The signature-drift guard now resolves the forward reference via `typing.get_type_hints(..., localns=...)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the module name now that RunnerContext and RegressionError live in `langfuse.experiment`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T08:36:08Z

@claude review

- RunnerContext no longer carries `name` or `run_name` as context-level defaults. `name` is now required on every `run_experiment` call (supports the action's directory-of-experiments mode where each script must name itself). `run_name` passes straight through to `Langfuse.run_experiment`. - RegressionError gains three typed `@overload` signatures (minimal, free-form message, structured metric/value/threshold) so type checkers enforce that `metric` and `value` are supplied together. At runtime, partial structured input falls back to the default message instead of rendering misleading `None` placeholders in PR comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hassiebp

One SDK-side API comment to keep the action and SDK contract lined up.

hassiebp · 2026-04-22T15:01:28Z

+class RunnerContext:
+    """Wraps :meth:`Langfuse.run_experiment` with CI-injected defaults.
+
+    Intended for use with the ``langfuse/experiment-action`` GitHub Action


RunnerContext feels like the right abstraction here, and I think the action side should lean into it as the single public integration path: construct the context, then call experiment(context) explicitly instead of also teaching users to read action-specific env vars. Since this is likely to become the primary authoring model, I’d optimize for that clean contract end-to-end.

wochinge commented Apr 21, 2026

View reviewed changes

Comment thread langfuse/ci.py Outdated

wochinge requested a review from hassiebp April 21, 2026 11:14

This comment has been minimized.

Sign in to view

test: rename test_runner_context.py to test_experiment.py

ef1aa71

Mirrors the module name now that RunnerContext and RegressionError live in `langfuse.experiment`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wochinge marked this pull request as ready for review April 22, 2026 08:36

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread langfuse/experiment.py

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread langfuse/experiment.py Outdated

Comment thread langfuse/experiment.py

Comment thread langfuse/experiment.py

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread langfuse/experiment.py

hassiebp reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add RunnerContext and RegressionError for experiment GH action#1635

feat(ci): add RunnerContext and RegressionError for experiment GH action#1635
wochinge wants to merge 4 commits intomainfrom
worktree-runner-context

wochinge commented Apr 20, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hassiebp left a comment

Uh oh!

hassiebp Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wochinge commented Apr 20, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage (from the proposal)

Out of scope

Disclaimer: Experimental PR review

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hassiebp left a comment

Choose a reason for hiding this comment

Uh oh!

hassiebp Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wochinge commented Apr 20, 2026 •

edited by greptile-apps Bot

Loading