Skip to content
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
ba4220d
Add llamacpp dependency and update gitignore with generated directories
ErlisLushtaku Feb 14, 2026
d2a5a42
Add documentation for llamacpp in Readme
ErlisLushtaku Feb 14, 2026
a828adb
Document direnv usage for environment variables management
ErlisLushtaku Feb 15, 2026
0dcebf9
narrow down transformers dependency to fix version mismatch
ErlisLushtaku Feb 15, 2026
d60073b
Add max_model_len param for VLLM in order to prevent OOM errors
ErlisLushtaku Feb 15, 2026
38f63ee
Fix completion loading and EuroLLM-9B example
ErlisLushtaku Feb 15, 2026
6f5e0fc
Remove `direnv` documentation
ErlisLushtaku Feb 17, 2026
42ff2ae
Revert stylistic (formatting) changes and add more documentation for …
ErlisLushtaku Feb 17, 2026
8fcb032
Rename OPENJURY_EVAL_DATA to OPENJURY_DATA
ErlisLushtaku Feb 17, 2026
df958af
Merge main
ErlisLushtaku Feb 21, 2026
35856f2
Revert changes in gitignore
ErlisLushtaku Feb 21, 2026
6a11182
Handle models with max_position_embeddings when we pass max_model_len
ErlisLushtaku Feb 21, 2026
fecd3ed
Revert EuroLLM-9B-Instruct to EuroLLM-9B since there is a default cha…
ErlisLushtaku Feb 21, 2026
0b4eaec
fix tests
ErlisLushtaku Feb 22, 2026
29340b0
Change test github workflow to use uv instead of pip for a more robus…
ErlisLushtaku Feb 22, 2026
2c294f1
Move dev dependencies to dependency-group
ErlisLushtaku Feb 22, 2026
4be61bf
Revert comment removal
ErlisLushtaku Feb 22, 2026
51d2597
Add pre-commit hook
ErlisLushtaku Feb 22, 2026
8dee7b2
add project scripts and move slurmpilot to dev group
ErlisLushtaku Feb 23, 2026
fdc9410
fix LlamaCpp bug with ChatTemplate
ErlisLushtaku Mar 2, 2026
48c5373
Add MT-Bench multi-turn evaluation support
ErlisLushtaku Mar 2, 2026
648a9be
Merge branch 'main' into erlislushtaku/feat/add-mt-bench-support
ErlisLushtaku Mar 2, 2026
14f747e
fix result formatting
ErlisLushtaku Mar 2, 2026
e67ea79
remove double environment variable
ErlisLushtaku Mar 2, 2026
4089be8
remove accidental duplications
ErlisLushtaku Mar 2, 2026
03f5cce
Refactor
ErlisLushtaku Mar 4, 2026
8ffe3a6
Remove duplication between prompt templates
ErlisLushtaku Mar 4, 2026
b877f11
add temperature argument
ErlisLushtaku Mar 9, 2026
c2056b5
add option for making mt-bench consistent with the original one from …
ErlisLushtaku Mar 9, 2026
41cd15d
Merge branch 'main' into erlislushtaku/feat/add-mt-bench-support
ErlisLushtaku Mar 9, 2026
0ca66c5
remove redundant print statement
ErlisLushtaku Mar 10, 2026
a295305
move mt-bench logic from the entrypoint
ErlisLushtaku Mar 17, 2026
0fb9700
Remove stale unused entries for fastchat mode
ErlisLushtaku Mar 17, 2026
e5670ea
Merge origin/main into erlislushtaku/feat/add-mt-bench-support
ErlisLushtaku Mar 17, 2026
6dd78fd
Refactor mt-bench eval helpers into shared runtime module
ErlisLushtaku Mar 17, 2026
0094eea
move cli args and parsing to separate util to remove dependencies on …
ErlisLushtaku Mar 18, 2026
f522e5b
refactor to address comments on PR
ErlisLushtaku Mar 24, 2026
6a851c3
remove openjury mode for mt-bench keeping only the original version
ErlisLushtaku Mar 31, 2026
caaa079
Merge remote-tracking branch 'origin/main' into erlislushtaku/feat/ad…
ErlisLushtaku Mar 31, 2026
2e8e04e
Restore code and fix after merge/refactor
ErlisLushtaku Mar 31, 2026
5a314a7
format
ErlisLushtaku Mar 31, 2026
8c91606
fix ci
ErlisLushtaku Mar 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Compared to other libraries, here is a breakdown of features:
| **Arena-Hard-Auto** | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| **Lighteval** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Evalchemy** | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| **OpenJury** | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ |
| **OpenJury** | | ✅ | ✅ | ✅ | ✅ | ✅ |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💪


The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue
or send a PR, we will be happy to update the information.
Expand Down Expand Up @@ -172,10 +172,29 @@ python openjury/generate_and_evaluate.py \

This override applies to all vLLM models in the run. For remote providers (OpenAI, Together, OpenRouter), the flag is ignored since they handle templates server-side.

### MT-Bench (Multi-Turn Evaluation)

MT-Bench evaluates multi-turn conversation ability using 80 two-turn questions across 8 categories
(writing, roleplay, reasoning, math, coding, extraction, STEM, humanities).
It uses category-dependent judge prompts and reference answers for math/reasoning/coding.
Questions are automatically downloaded from the [LMSYS MT-Bench HuggingFace space](https://huggingface.co/spaces/lmsys/mt-bench).

```bash
uv run python openjury/generate_and_evaluate.py \
--dataset mt-bench \
--model_A VLLM/Qwen/Qwen2.5-7B-Instruct \
--model_B OpenRouter/openai/gpt-4o \
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
--n_instructions 10
```

Results include per-category and per-turn win rate breakdowns. Use `--swap_mode both` to correct for judge position bias.

## 📊 Supported Datasets

| Dataset | Description |
|-----------------------|------------------------------------------------------------------------------------------------|
| `mt-bench` | 80 multi-turn (2-turn) questions across 8 categories ([LMSYS MT-Bench](https://arxiv.org/abs/2306.05685)) |
| `alpaca-eval` | General instruction-following benchmark |
| `arena-hard` | More challenging evaluation suite |
| `m-arena-hard` | Translated version of Arena-Hard in 23 languages |
Expand Down
23 changes: 12 additions & 11 deletions openjury/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
data_root,
download_hf,
do_inference,
truncate,
)


Expand Down Expand Up @@ -51,14 +52,22 @@ def get_regexp_match(self, s: str, regex: str, group_index: int = 1):

def load_judge_system_and_user_prompt(
provide_explanation: bool = True,
multi_turn: bool = False,
) -> tuple[str, str]:
# Prepare judge
with open(Path(__file__).parent / "prompts" / "system-prompt.txt", "r") as f:
system_prompt = str(f.read())

prompt_filename = (
"prompt-with-explanation.txt" if provide_explanation else "prompt.txt"
)
if multi_turn:
prompt_filename = (
"prompt-multi-turn-with-explanation.txt"
if provide_explanation
else "prompt-multi-turn.txt"
)
else:
prompt_filename = (
"prompt-with-explanation.txt" if provide_explanation else "prompt.txt"
)
with open(Path(__file__).parent / "prompts" / prompt_filename, "r") as f:
user_prompt_template = str(f.read())

Expand Down Expand Up @@ -240,14 +249,6 @@ def annotate_battles(
[("system", system_prompt), ("user", user_prompt_template)]
)

def truncate(s: str, max_len: int | None = None):
if not isinstance(s, str):
return ""
if max_len is not None:
return s[:max_len]
else:
return s

inputs = prompt_template.batch(
[
{
Expand Down
94 changes: 87 additions & 7 deletions openjury/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,10 @@
from openjury.utils import (
do_inference,
make_model,
truncate,
)


def truncate(s: str, max_len: int | None = None):
if max_len is not None:
return s[:max_len]
else:
return s


def generate_instructions(
instructions: pd.Series,
model: str,
Expand Down Expand Up @@ -57,6 +51,92 @@ def generate_instructions(
return df_outputs


def generate_multiturn(
questions: pd.DataFrame,
model: str,
truncate_input_chars: int | None = 8192,
max_tokens: int | None = 8192,
use_tqdm: bool = True,
**model_kwargs,
) -> pd.DataFrame:
"""Generate two-turn completions for MT-Bench style questions.

Generates turn 1 answers first, then uses them as conversation context
to generate turn 2 answers.

Args:
questions: DataFrame with columns turn_1, turn_2, and index instruction_index.
model: Model specification string (e.g. "VLLM/model-name").
**model_kwargs: Provider-specific options forwarded to make_model
(e.g. max_model_len, chat_template for VLLM).
Returns:
DataFrame with columns: instruction_index, completion_turn_1, completion_turn_2
"""
chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs)

system_prompt = "You are a helpful assistant."
Copy link
Copy Markdown
Collaborator

@kargibora kargibora Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use a better system_prompt. What does MT-Bench uses?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.

Copy link
Copy Markdown
Collaborator Author

@ErlisLushtaku ErlisLushtaku Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the mt-bench prompts (and other changes to reproduce their setup) here.

turn1_template = ChatPromptTemplate.from_messages(
[("system", system_prompt), ("user", "{user_prompt}")]
)

turn1_inputs = turn1_template.batch(
[
{"user_prompt": truncate(row["turn_1"], max_len=truncate_input_chars)}
for _, row in questions.iterrows()
]
)

print(f"Generating turn 1 completions ({len(turn1_inputs)} questions).")
completions_turn_1 = do_inference(
chat_model=chat_model,
inputs=turn1_inputs,
use_tqdm=use_tqdm,
)

turn2_inputs = []
for (_, row), t1_answer in zip(questions.iterrows(), completions_turn_1):
if row["turn_2"] is None:
turn2_inputs.append(
turn1_template.invoke(
{"user_prompt": "No follow-up question."}
)
)
else:
multi_turn_template = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("user", "{turn_1}"),
("assistant", "{turn_1_answer}"),
("user", "{turn_2}"),
]
)
turn2_inputs.append(
multi_turn_template.invoke(
{
"turn_1": truncate(row["turn_1"], max_len=truncate_input_chars),
"turn_1_answer": truncate(str(t1_answer), max_len=truncate_input_chars),
"turn_2": truncate(row["turn_2"], max_len=truncate_input_chars),
}
)
)

print(f"Generating turn 2 completions ({len(turn2_inputs)} questions).")
completions_turn_2 = do_inference(
chat_model=chat_model,
inputs=turn2_inputs,
use_tqdm=use_tqdm,
)

df_outputs = pd.DataFrame(
data={
"instruction_index": questions.index.tolist(),
"completion_turn_1": completions_turn_1,
"completion_turn_2": completions_turn_2,
},
)
return df_outputs


def generate_base(
instructions: pd.Series,
model: str,
Expand Down
Loading
Loading