Skip to content

feat(ci): add RunnerContext and RegressionError for experiment GH action#1635

Open
wochinge wants to merge 4 commits intomainfrom
worktree-runner-context
Open

feat(ci): add RunnerContext and RegressionError for experiment GH action#1635
wochinge wants to merge 4 commits intomainfrom
worktree-runner-context

Conversation

@wochinge
Copy link
Copy Markdown
Contributor

@wochinge wochinge commented Apr 20, 2026

Summary

Adds the SDK-side primitives consumed by the upcoming langfuse/experiment-action GitHub Action (LFE-9241).

  • RunnerContext wraps Langfuse.run_experiment with action-injected defaults (data, dataset_version, name, run_name, metadata). Users can override any default on the call site; metadata is merged with user-supplied keys winning on collision. Supports both LocalExperimentItem and DatasetItem.
  • RegressionError lets a user's experiment function signal a CI gate failure. Exposes optional structured metric / value / threshold fields so the action can render a targeted callout in the PR comment; falls back to a free-form message or a default string.
  • Linear

Usage (from the proposal)

from langfuse import RegressionError, RunnerContext
from langfuse.experiment import ExperimentResult

def experiment(context: RunnerContext) -> ExperimentResult:
    result = context.run_experiment(
        task=my_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[avg_accuracy],
    )
    if result.run_score_values["avg_accuracy"] < 0.9:
        raise RegressionError(
            result=result,
            metric="avg_accuracy",
            value=result.run_score_values["avg_accuracy"],
            threshold=0.9,
        )
    return result

Out of scope

  • The langfuse/experiment-action GitHub Action itself
  • PR-comment rendering
  • Multi-metric regression reporting (deferred — can land non-breaking as regressions: List[Regression] later)

🤖 Generated with Claude Code

Disclaimer: Experimental PR review

Greptile Summary

This PR adds two new SDK primitives — RunnerContext and RegressionError — intended for use with the upcoming langfuse/experiment-action GitHub Action. RunnerContext wraps Langfuse.run_experiment with action-injected defaults (dataset, name, run name, metadata) that users can override per-call, with user-supplied metadata keys winning on collision. RegressionError is a structured exception that lets a user's experiment function signal a CI gate failure with optional metric/value/threshold fields.

Confidence Score: 4/5

Safe to merge with two minor improvements worth addressing before the companion GitHub Action ships.

Both findings are P2, but the value/threshold-without-metric case silently drops structured diagnostic fields from the exception message, which could mislead action consumers. Neither is a hard blocker, but fixing them before the companion GitHub Action ships would avoid surprising behaviour.

langfuse/experiment.py — RunnerContext.run_experiment mutable defaults and RegressionError message formatting.

Important Files Changed

Filename Overview
langfuse/experiment.py Adds RunnerContext (wraps Langfuse.run_experiment with CI-injected defaults, metadata merging, and required-field validation) and RegressionError (structured exception for CI gate failures); minor mutable-default-arg and message-formatting issues.
langfuse/init.py Exports RunnerContext and RegressionError from the top-level package; straightforward and correct.
tests/unit/test_experiment.py Comprehensive unit tests covering defaults, call-site overrides, metadata merging, validation, and a signature-drift guard against Langfuse.run_experiment.

Sequence Diagram

sequenceDiagram
    participant A as experiment-action
    participant RC as RunnerContext
    participant L as Langfuse.run_experiment
    participant U as user experiment()

    A->>RC: RunnerContext(client, data, name, run_name, metadata, ...)
    A->>U: experiment(context=rc)
    U->>RC: context.run_experiment(task=..., [overrides])
    RC->>RC: resolve name / data / run_name (call overrides ctx defaults)
    RC->>RC: merge metadata (user keys win on collision)
    RC->>L: run_experiment(name, data, run_name, metadata, ...)
    L-->>RC: ExperimentResult
    RC-->>U: ExperimentResult
    alt regression detected
        U->>U: raise RegressionError(result, metric, value, threshold)
        U-->>A: RegressionError propagates
        A->>A: fail workflow, render PR callout
    else success
        U-->>A: ExperimentResult
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: langfuse/experiment.py
Line: 1128-1130

Comment:
**Mutable default arguments**

`evaluators=[]` and `run_evaluators=[]` use mutable list literals as defaults — a classic Python gotcha. Although these lists are only forwarded to `client.run_experiment` and not mutated here, using `[]` as a default means all callers that omit the argument share the *same* list object at definition time. If any downstream code ever mutates the list, the default is silently corrupted for future calls. The idiomatic fix is `None` with an `or []` guard (matching how `Langfuse.run_experiment` itself should be fixed), or `default_factory` in a dataclass.

```suggestion
        evaluators: Optional[List[EvaluatorFunction]] = None,
        composite_evaluator: Optional["CompositeEvaluatorFunction"] = None,
        run_evaluators: Optional[List[RunEvaluatorFunction]] = None,
```
And in the body, replace:
```python
evaluators=evaluators or [],
run_evaluators=run_evaluators or [],
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: langfuse/experiment.py
Line: 1196-1202

Comment:
**`value`/`threshold` silently dropped from message when `metric` is `None`**

If a caller provides `value=0.5, threshold=0.9` but omits `metric`, the branch falls through to `"Experiment regression detected"` and the structured fields are never reflected in `str(exc)`. The action's rendering path uses `str(exc)` as a fallback, so those diagnostics would be lost without any warning.

Consider guarding against this, or at least including them in the default message when present:

```suggestion
        if message is not None:
            formatted = message
        elif metric is not None:
            formatted = f"Regression on `{metric}`: {value} (threshold {threshold})"
        elif value is not None or threshold is not None:
            formatted = f"Experiment regression detected: value={value}, threshold={threshold}"
        else:
            formatted = "Experiment regression detected"
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "test: rename test_runner_context.py to t..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

Adds the SDK-side primitives consumed by the upcoming
`langfuse/experiment-action` GitHub Action (LFE-9241):

- `RunnerContext` wraps `Langfuse.run_experiment` with action-injected
  defaults (data, dataset_version, name, run_name, metadata). Users can
  override any default on the call site; metadata is merged with
  user-supplied keys winning on collision.
- `RegressionError` lets users signal a CI gate failure and optionally
  pass structured `metric`/`value`/`threshold` fields so the action can
  render a callout in the PR comment.

Both live in a dedicated `langfuse/ci.py` module so the CI surface stays
isolated from the general experiment API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread langfuse/ci.py Outdated
@wochinge wochinge requested a review from hassiebp April 21, 2026 11:14
…eriment module

Relocates the CI-action primitives from the standalone `langfuse/ci.py`
module into `langfuse/experiment.py` alongside the other experiment
types. Deletes `langfuse/ci.py` and renames the tests accordingly.

The public import paths (`from langfuse import RunnerContext,
RegressionError`) are unchanged.

`CompositeEvaluatorFunction` is imported under `TYPE_CHECKING` to avoid
a circular import with `langfuse.batch_evaluation`. The
signature-drift guard now resolves the forward reference via
`typing.get_type_hints(..., localns=...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blacksmith-sh

This comment has been minimized.

Mirrors the module name now that RunnerContext and RegressionError
live in `langfuse.experiment`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wochinge wochinge marked this pull request as ready for review April 22, 2026 08:36
@github-actions
Copy link
Copy Markdown

@claude review

Comment thread langfuse/experiment.py
Comment thread langfuse/experiment.py Outdated
Comment thread langfuse/experiment.py
Comment thread langfuse/experiment.py
- RunnerContext no longer carries `name` or `run_name` as context-level
  defaults. `name` is now required on every `run_experiment` call
  (supports the action's directory-of-experiments mode where each
  script must name itself). `run_name` passes straight through to
  `Langfuse.run_experiment`.
- RegressionError gains three typed `@overload` signatures (minimal,
  free-form message, structured metric/value/threshold) so type
  checkers enforce that `metric` and `value` are supplied together.
  At runtime, partial structured input falls back to the default
  message instead of rendering misleading `None` placeholders in PR
  comments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread langfuse/experiment.py
Copy link
Copy Markdown
Contributor

@hassiebp hassiebp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One SDK-side API comment to keep the action and SDK contract lined up.

Comment thread langfuse/experiment.py
class RunnerContext:
"""Wraps :meth:`Langfuse.run_experiment` with CI-injected defaults.

Intended for use with the ``langfuse/experiment-action`` GitHub Action
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunnerContext feels like the right abstraction here, and I think the action side should lean into it as the single public integration path: construct the context, then call experiment(context) explicitly instead of also teaching users to read action-specific env vars. Since this is likely to become the primary authoring model, I’d optimize for that clean contract end-to-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants