cooperbench · akhatua2 · May 26, 2026 · May 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.0.18] - 2026-05-25
+
+### Removed
+
+- **`cooperbench._proxy` module and the `--openai-base-url` / `--openai-model` CLI flags are gone.** Both existed because `claude_code` (which wraps the Anthropic CLI that only speaks `/v1/messages`) was assumed to need a LiteLLM translation layer to reach an OpenAI-compatible vLLM. That assumption was wrong: vLLM v0.17.1+ implements the Anthropic Messages API natively at the same `/v1/messages` path, so claude-code can be pointed straight at the vLLM endpoint with `--base-url`. Removing the auto-spawned LiteLLM also removes a class of bugs we kept hitting from LiteLLM version drift (`/v1/responses` auto-rewrite on `litellm>=1.82` when the inbound request has `thinking={"type":"enabled"}` — claude-code 2.1.x sends it by default; `litellm_params.stream: false` being ignored by some provider prefixes; intermittent `API Error: Content block not found` from vLLM's streaming `tool_call` extractor desynchronizing block_start / block_delta events).
+
+### Changed
+
+- **`--base-url` now points straight at a vLLM-served model.** Existing `--base-url` / `--auth-token` flags are kept and are the only knobs you need. `ANTHROPIC_BASE_URL` is forwarded into the task container; the adapter rewrites `localhost` / `127.0.0.1` → `host.docker.internal`, adds the matching `--add-host` to the container, injects a placeholder auth token if you didn't supply one, and writes `~/.claude/settings.json` with `CLAUDE_CODE_ATTRIBUTION_HEADER=0` (KV-cache perf fix on vLLM/llama.cpp). Real Anthropic runs (no `--base-url`) are unaffected.
+- **`docs/QWEN_LOCAL.md`** rewritten to show the single-command direct flow:
+  ```
+  cooperbench run --base-url https://your-vllm-host -m Qwen/Qwen3.5-9B \
+    -a claude_code --setting coop -s lite -c 2 --no-auto-eval
+  ```
+  No LiteLLM, no proxy subprocess, no extras.
+
+### Verified
+
+- Direct curl against `https://cooperbench--qwen35-9b-128k-serve.modal.run/v1/messages`: tool conversation returns proper Anthropic `tool_use` blocks with parsed `input`, `stop_reason: "tool_use"`; streaming returns proper `content_block_start` → `content_block_delta` → `content_block_stop` ordering with no missed start events.
+- End-to-end coop run with the new flow on the same `anyhow_task` pair that was failing in `0.0.17` against the older `--openai-base-url` proxy path: agents iterate over multiple tool rounds against vLLM directly. (Adapter-level behavior unchanged from `0.0.17`; only the routing layer simplified.)
+
 ## [0.0.17] - 2026-05-25
 
 ### Fixed

diff --git a/docs/QWEN_LOCAL.md b/docs/QWEN_LOCAL.md
@@ -1,118 +1,92 @@
-# Running CooperBench against a self-hosted (Qwen / Llama / etc.) endpoint
+# Running CooperBench against a self-hosted Qwen (or any vLLM endpoint)
 
-CooperBench's `claude_code` adapter drives the official `claude-code`
-CLI, which only speaks Anthropic's `/v1/messages` API. To run it
-against any other model you put a translation proxy in between:
+CooperBench's `claude_code` adapter wraps the official `claude-code` CLI,
+which speaks Anthropic's `/v1/messages` API. vLLM v0.17.1+ implements
+that same API natively — so claude-code can talk to a vLLM server
+**directly, with no translation proxy in between**.
 
 ```
-claude-code (Anthropic format)
-       │
-       ▼
-   LiteLLM proxy   ←  you run this; it translates Anthropic ↔ OpenAI
-       │
-       ▼
-your OpenAI-compatible inference server (vLLM, llama.cpp, ...)
+claude-code (Anthropic /v1/messages) ───► vLLM /v1/messages
 ```
 
-This document covers the canonical reproducible setup using only the
-PyPI distribution — no repo checkout required.
-
 ## Prerequisites
 
 - Docker (CooperBench runs each task in a container)
 - Redis on `localhost:6379` for coop messaging:
   ```
   docker run -d --name cb-redis -p 6379:6379 redis:7-alpine
   ```
-- An OpenAI-compatible endpoint URL serving your model
-- Python ≥ 3.12
+- A vLLM (v0.17.1+) endpoint serving your model with tool-calling
+  enabled. Reference serve flags (Qwen3.5-9B at 128k):
+  ```
+  vllm serve Qwen/Qwen3.5-9B \
+    --max-model-len 131072 \
+    --enable-auto-tool-choice \
+    --tool-call-parser qwen3_coder
+  ```
 
 ## Install
 
 ```bash
-pip install cooperbench           # adapter + CLI
-pip install 'litellm[proxy]'      # translation proxy (used internally)
+pip install cooperbench
 ```
 
-## Canonical single-command run (Qwen3.5-9B on Modal as the example)
+That's it. No `litellm[proxy]`, no extras.
+
+## Run
 
 ```bash
 cooperbench run \
-  --openai-base-url https://cooperbench--qwen35-9b-128k-serve.modal.run/v1 \
-  --openai-model Qwen/Qwen3.5-9B \
+  --base-url https://your-vllm-host.example.com \
+  --auth-token dummy \
   -m Qwen/Qwen3.5-9B \
   -a claude_code \
   --setting coop \
-  -s lite \
-  -r dspy_task -t 8394 -f 3,4 \
-  -c 2 \
+  -s lite -c 2 \
   --no-auto-eval
 ```
 
-Logs land in `./logs/<run-name>/coop/<repo>/<task>/<features>/`.
-
-### What that does under the hood
-
-- Picks a free local port.
-- Spawns `litellm --model openai/Qwen/Qwen3.5-9B --api_base <openai-base-url> ...`
-  bound to that port, with `OPENAI_API_KEY=dummy` in the child env.
-- Polls `/health/liveliness` until the proxy is up.
-- Sets `ANTHROPIC_BASE_URL=http://localhost:<port>` and a placeholder
-  `ANTHROPIC_AUTH_TOKEN` for the duration of the run.
-- Tears down the proxy subprocess when the run exits (also on Ctrl-C).
-
-### Why those flags
-
-- `--openai-base-url` — the OpenAI-compatible endpoint (vLLM, llama.cpp, ...).
-- `--openai-model` — the model name sent to that endpoint. Defaults to
-  the value of `-m` if omitted.
-- `-m Qwen/Qwen3.5-9B` — model name sent to claude-code (must contain
-  `qwen` so the adapter's model registry picks the small-context
-  profile).
+That's the whole flow. claude-code (inside the task container) issues
+`POST /v1/messages` to your vLLM, and vLLM responds in Anthropic format
+directly.
+
+### What each flag does
+
+- `--base-url` — vLLM endpoint. Bare host or host+`/v1`; claude-code
+  appends `/v1/messages` itself. Auto-rewritten to
+  `host.docker.internal` for container reachability when it's a local URL.
+- `--auth-token` — placeholder for vLLM (no auth needed); claude-code
+  requires *some* credential env var to start.
+- `-m Qwen/Qwen3.5-9B` — model name sent to vLLM. Must match
+  vLLM's `--served-model-name`. The substring `qwen` (case-insensitive)
+  is also how the adapter's `_MODEL_PROFILES` picks the small-context
+  profile (tighter Read/MCP budgets + stripped tool surface).
 - `-a claude_code` — selects the Claude Code adapter.
 
-## Manual-proxy escape hatch
-
-If you already have an Anthropic-format proxy running (or want to share
-one across multiple `cooperbench run` invocations), use `--base-url` /
-`--auth-token` instead of `--openai-base-url`:
-
-```bash
-# Start your own proxy somewhere
-litellm --model openai/Qwen/Qwen3.5-9B \
-  --api_base https://cooperbench--qwen35-9b-128k-serve.modal.run/v1 \
-  --port 4000 ...
-
-# Point cooperbench at it (no auto-spawn)
-cooperbench run --base-url http://localhost:4000 --auth-token any \
-  -m Qwen/Qwen3.5-9B ...
-```
-
-`--openai-base-url` and `--base-url` are mutually exclusive.
-
-## How the adapter behaves with a custom endpoint
+## What the adapter does for you
 
 When `--base-url` is set, `src/cooperbench/agents/claude_code/adapter.py`:
 
 1. Forwards `ANTHROPIC_BASE_URL` / `ANTHROPIC_AUTH_TOKEN` into the task
-   container (rewriting `localhost` → `host.docker.internal`).
-2. Adds `--add-host=host.docker.internal:host-gateway` so the container
-   can reach the host proxy.
-3. Preserves the model name verbatim (the proxy controls naming).
-4. Writes `~/.claude/settings.json` with
+   container, rewriting `localhost` / `127.0.0.1` →
+   `host.docker.internal` so the container can reach a host-side endpoint.
+2. Adds `--add-host=host.docker.internal:host-gateway` to make that
+   rewrite resolve.
+3. Preserves the model name verbatim (no provider-prefix strip — vLLM
+   controls naming via `--served-model-name`).
+4. Injects a placeholder auth token if `--base-url` is set without one
+   (claude-code refuses to start without a credential env var).
+5. Writes `~/.claude/settings.json` inside the container with
    `CLAUDE_CODE_ATTRIBUTION_HEADER=0` — that header otherwise busts the
-   KV cache on vLLM/llama.cpp (~90% slowdown).
-5. Looks up the model name (case-insensitive substring) in
+   KV cache on vLLM/llama.cpp backends (~90% slowdown).
+6. Looks up the model name (case-insensitive substring) in
    `_MODEL_PROFILES`. For `qwen`, applies:
    - `max_output_tokens=4096`
    - `file_read_max_tokens=4000`
    - `mcp_max_output_tokens=2000`
    - `disallowed_tools=SMALL_CONTEXT_DISALLOWED_TOOLS`
 
-Profile values fill defaults; explicit `config` keys override.
-A model name without a registry match (e.g. `gpt-5.5`) still gets
-routing + attribution-header fix but keeps Claude Code's stock tool
-surface and budgets.
+Real Anthropic runs (i.e. no `--base-url`) are unaffected by any of this.
 
 ## Adding another small-context model
 
@@ -131,9 +105,8 @@ _MODEL_PROFILES = {
 }
 ```
 
-The key is matched as a case-insensitive substring against the model
-name passed via `-m`. Cut a release after merging so PyPI users pick it
-up.
+The key matches as a case-insensitive substring against `-m`. Cut a
+release after merging.
 
 ## Inspecting a run
 
@@ -144,7 +117,7 @@ logs/<run-name>/coop/<repo>/<task>/<features>/
 ├── agent{N}.patch            # diff each agent produced (N = feature_id)
 ├── agent1_stream.jsonl       # raw claude-code stream events
 ├── agent2_stream.jsonl
-├── agent1_session.jsonl      # claude-code session JSONL (tool calls, messages)
+├── agent1_session.jsonl      # session JSONL (tool calls, messages)
 ├── agent2_session.jsonl
 ├── agent1_sent.jsonl         # per-agent coop messaging log
 ├── agent2_sent.jsonl
@@ -154,16 +127,3 @@ logs/<run-name>/coop/<repo>/<task>/<features>/
 
 The `*_session.jsonl` files are the most useful — one JSON line per
 tool call, tool result, or assistant message.
-
-## Local-dev shortcuts (optional)
-
-For convenience when working out of a repo checkout there are two
-helper files that bundle the proxy invocation:
-
-- `scripts/qwen_proxy.yaml` — equivalent to the inline `litellm` flags
-  above
-- `scripts/serve_qwen_proxy.sh` — `litellm --config <yaml> --port ...`
-  wrapper
-
-Neither is required for PyPI users — they're just easier to edit than
-a long CLI invocation when you're iterating on the proxy config.
diff --git a/src/cooperbench/__about__.py b/src/cooperbench/__about__.py
@@ -1,3 +1,3 @@
 """Version information for CooperBench."""
 
-__version__ = "0.0.17"
+__version__ = "0.0.18"