Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.0.18] - 2026-05-25

### Removed

- **`cooperbench._proxy` module and the `--openai-base-url` / `--openai-model` CLI flags are gone.** Both existed because `claude_code` (which wraps the Anthropic CLI that only speaks `/v1/messages`) was assumed to need a LiteLLM translation layer to reach an OpenAI-compatible vLLM. That assumption was wrong: vLLM v0.17.1+ implements the Anthropic Messages API natively at the same `/v1/messages` path, so claude-code can be pointed straight at the vLLM endpoint with `--base-url`. Removing the auto-spawned LiteLLM also removes a class of bugs we kept hitting from LiteLLM version drift (`/v1/responses` auto-rewrite on `litellm>=1.82` when the inbound request has `thinking={"type":"enabled"}` — claude-code 2.1.x sends it by default; `litellm_params.stream: false` being ignored by some provider prefixes; intermittent `API Error: Content block not found` from vLLM's streaming `tool_call` extractor desynchronizing block_start / block_delta events).

### Changed

- **`--base-url` now points straight at a vLLM-served model.** Existing `--base-url` / `--auth-token` flags are kept and are the only knobs you need. `ANTHROPIC_BASE_URL` is forwarded into the task container; the adapter rewrites `localhost` / `127.0.0.1` → `host.docker.internal`, adds the matching `--add-host` to the container, injects a placeholder auth token if you didn't supply one, and writes `~/.claude/settings.json` with `CLAUDE_CODE_ATTRIBUTION_HEADER=0` (KV-cache perf fix on vLLM/llama.cpp). Real Anthropic runs (no `--base-url`) are unaffected.
- **`docs/QWEN_LOCAL.md`** rewritten to show the single-command direct flow:
```
cooperbench run --base-url https://your-vllm-host -m Qwen/Qwen3.5-9B \
-a claude_code --setting coop -s lite -c 2 --no-auto-eval
```
No LiteLLM, no proxy subprocess, no extras.

### Verified

- Direct curl against `https://cooperbench--qwen35-9b-128k-serve.modal.run/v1/messages`: tool conversation returns proper Anthropic `tool_use` blocks with parsed `input`, `stop_reason: "tool_use"`; streaming returns proper `content_block_start` → `content_block_delta` → `content_block_stop` ordering with no missed start events.
- End-to-end coop run with the new flow on the same `anyhow_task` pair that was failing in `0.0.17` against the older `--openai-base-url` proxy path: agents iterate over multiple tool rounds against vLLM directly. (Adapter-level behavior unchanged from `0.0.17`; only the routing layer simplified.)

## [0.0.17] - 2026-05-25

### Fixed
Expand Down
144 changes: 52 additions & 92 deletions docs/QWEN_LOCAL.md
Original file line number Diff line number Diff line change
@@ -1,118 +1,92 @@
# Running CooperBench against a self-hosted (Qwen / Llama / etc.) endpoint
# Running CooperBench against a self-hosted Qwen (or any vLLM endpoint)

CooperBench's `claude_code` adapter drives the official `claude-code`
CLI, which only speaks Anthropic's `/v1/messages` API. To run it
against any other model you put a translation proxy in between:
CooperBench's `claude_code` adapter wraps the official `claude-code` CLI,
which speaks Anthropic's `/v1/messages` API. vLLM v0.17.1+ implements
that same API natively — so claude-code can talk to a vLLM server
**directly, with no translation proxy in between**.

```
claude-code (Anthropic format)
LiteLLM proxy ← you run this; it translates Anthropic ↔ OpenAI
your OpenAI-compatible inference server (vLLM, llama.cpp, ...)
claude-code (Anthropic /v1/messages) ───► vLLM /v1/messages
```

This document covers the canonical reproducible setup using only the
PyPI distribution — no repo checkout required.

## Prerequisites

- Docker (CooperBench runs each task in a container)
- Redis on `localhost:6379` for coop messaging:
```
docker run -d --name cb-redis -p 6379:6379 redis:7-alpine
```
- An OpenAI-compatible endpoint URL serving your model
- Python ≥ 3.12
- A vLLM (v0.17.1+) endpoint serving your model with tool-calling
enabled. Reference serve flags (Qwen3.5-9B at 128k):
```
vllm serve Qwen/Qwen3.5-9B \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
```

## Install

```bash
pip install cooperbench # adapter + CLI
pip install 'litellm[proxy]' # translation proxy (used internally)
pip install cooperbench
```

## Canonical single-command run (Qwen3.5-9B on Modal as the example)
That's it. No `litellm[proxy]`, no extras.

## Run

```bash
cooperbench run \
--openai-base-url https://cooperbench--qwen35-9b-128k-serve.modal.run/v1 \
--openai-model Qwen/Qwen3.5-9B \
--base-url https://your-vllm-host.example.com \
--auth-token dummy \
-m Qwen/Qwen3.5-9B \
-a claude_code \
--setting coop \
-s lite \
-r dspy_task -t 8394 -f 3,4 \
-c 2 \
-s lite -c 2 \
--no-auto-eval
```

Logs land in `./logs/<run-name>/coop/<repo>/<task>/<features>/`.

### What that does under the hood

- Picks a free local port.
- Spawns `litellm --model openai/Qwen/Qwen3.5-9B --api_base <openai-base-url> ...`
bound to that port, with `OPENAI_API_KEY=dummy` in the child env.
- Polls `/health/liveliness` until the proxy is up.
- Sets `ANTHROPIC_BASE_URL=http://localhost:<port>` and a placeholder
`ANTHROPIC_AUTH_TOKEN` for the duration of the run.
- Tears down the proxy subprocess when the run exits (also on Ctrl-C).

### Why those flags

- `--openai-base-url` — the OpenAI-compatible endpoint (vLLM, llama.cpp, ...).
- `--openai-model` — the model name sent to that endpoint. Defaults to
the value of `-m` if omitted.
- `-m Qwen/Qwen3.5-9B` — model name sent to claude-code (must contain
`qwen` so the adapter's model registry picks the small-context
profile).
That's the whole flow. claude-code (inside the task container) issues
`POST /v1/messages` to your vLLM, and vLLM responds in Anthropic format
directly.

### What each flag does

- `--base-url` — vLLM endpoint. Bare host or host+`/v1`; claude-code
appends `/v1/messages` itself. Auto-rewritten to
`host.docker.internal` for container reachability when it's a local URL.
- `--auth-token` — placeholder for vLLM (no auth needed); claude-code
requires *some* credential env var to start.
- `-m Qwen/Qwen3.5-9B` — model name sent to vLLM. Must match
vLLM's `--served-model-name`. The substring `qwen` (case-insensitive)
is also how the adapter's `_MODEL_PROFILES` picks the small-context
profile (tighter Read/MCP budgets + stripped tool surface).
- `-a claude_code` — selects the Claude Code adapter.

## Manual-proxy escape hatch

If you already have an Anthropic-format proxy running (or want to share
one across multiple `cooperbench run` invocations), use `--base-url` /
`--auth-token` instead of `--openai-base-url`:

```bash
# Start your own proxy somewhere
litellm --model openai/Qwen/Qwen3.5-9B \
--api_base https://cooperbench--qwen35-9b-128k-serve.modal.run/v1 \
--port 4000 ...

# Point cooperbench at it (no auto-spawn)
cooperbench run --base-url http://localhost:4000 --auth-token any \
-m Qwen/Qwen3.5-9B ...
```

`--openai-base-url` and `--base-url` are mutually exclusive.

## How the adapter behaves with a custom endpoint
## What the adapter does for you

When `--base-url` is set, `src/cooperbench/agents/claude_code/adapter.py`:

1. Forwards `ANTHROPIC_BASE_URL` / `ANTHROPIC_AUTH_TOKEN` into the task
container (rewriting `localhost` → `host.docker.internal`).
2. Adds `--add-host=host.docker.internal:host-gateway` so the container
can reach the host proxy.
3. Preserves the model name verbatim (the proxy controls naming).
4. Writes `~/.claude/settings.json` with
container, rewriting `localhost` / `127.0.0.1` →
`host.docker.internal` so the container can reach a host-side endpoint.
2. Adds `--add-host=host.docker.internal:host-gateway` to make that
rewrite resolve.
3. Preserves the model name verbatim (no provider-prefix strip — vLLM
controls naming via `--served-model-name`).
4. Injects a placeholder auth token if `--base-url` is set without one
(claude-code refuses to start without a credential env var).
5. Writes `~/.claude/settings.json` inside the container with
`CLAUDE_CODE_ATTRIBUTION_HEADER=0` — that header otherwise busts the
KV cache on vLLM/llama.cpp (~90% slowdown).
5. Looks up the model name (case-insensitive substring) in
KV cache on vLLM/llama.cpp backends (~90% slowdown).
6. Looks up the model name (case-insensitive substring) in
`_MODEL_PROFILES`. For `qwen`, applies:
- `max_output_tokens=4096`
- `file_read_max_tokens=4000`
- `mcp_max_output_tokens=2000`
- `disallowed_tools=SMALL_CONTEXT_DISALLOWED_TOOLS`

Profile values fill defaults; explicit `config` keys override.
A model name without a registry match (e.g. `gpt-5.5`) still gets
routing + attribution-header fix but keeps Claude Code's stock tool
surface and budgets.
Real Anthropic runs (i.e. no `--base-url`) are unaffected by any of this.

## Adding another small-context model

Expand All @@ -131,9 +105,8 @@ _MODEL_PROFILES = {
}
```

The key is matched as a case-insensitive substring against the model
name passed via `-m`. Cut a release after merging so PyPI users pick it
up.
The key matches as a case-insensitive substring against `-m`. Cut a
release after merging.

## Inspecting a run

Expand All @@ -144,7 +117,7 @@ logs/<run-name>/coop/<repo>/<task>/<features>/
├── agent{N}.patch # diff each agent produced (N = feature_id)
├── agent1_stream.jsonl # raw claude-code stream events
├── agent2_stream.jsonl
├── agent1_session.jsonl # claude-code session JSONL (tool calls, messages)
├── agent1_session.jsonl # session JSONL (tool calls, messages)
├── agent2_session.jsonl
├── agent1_sent.jsonl # per-agent coop messaging log
├── agent2_sent.jsonl
Expand All @@ -154,16 +127,3 @@ logs/<run-name>/coop/<repo>/<task>/<features>/

The `*_session.jsonl` files are the most useful — one JSON line per
tool call, tool result, or assistant message.

## Local-dev shortcuts (optional)

For convenience when working out of a repo checkout there are two
helper files that bundle the proxy invocation:

- `scripts/qwen_proxy.yaml` — equivalent to the inline `litellm` flags
above
- `scripts/serve_qwen_proxy.sh` — `litellm --config <yaml> --port ...`
wrapper

Neither is required for PyPI users — they're just easier to edit than
a long CLI invocation when you're iterating on the proxy config.
2 changes: 1 addition & 1 deletion src/cooperbench/__about__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""Version information for CooperBench."""

__version__ = "0.0.17"
__version__ = "0.0.18"
Loading
Loading