Skip to content
Open
3 changes: 1 addition & 2 deletions .claude/skills/docker-build/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
name: docker-build
description: Build an LMDeploy Docker image and push it to the inner registry.
disable-model-invocation: true
---

# Docker Build & Push
Expand All @@ -26,7 +25,7 @@ If any are missing, stop and tell the user to set them before proceeding.
BRANCH=$(git branch --show-current | sed 's/[^a-zA-Z0-9._-]/-/g')
SHA=$(git rev-parse --short=7 HEAD)
TAG="${BRANCH}-${SHA}"
IMAGE="${LMDEPLOY_REGISTRY}/lmdeploy:${TAG}"
IMAGE="${LMDEPLOY_REGISTRY}/ailab-puyu-puyu_gpu/lmdeploy-dev:lmdeploy-${TAG}"
```

Print the computed image name so the user can confirm.
Expand Down
204 changes: 204 additions & 0 deletions .claude/skills/submit-llm-eval/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
name: submit-eval
description: Use when submitting a model eval task to the auto-eval platform
disable-model-invocation: true
---

# Submit Eval Task

Submit a model evaluation task to the auto-eval platform.

## Prerequisites

Read `~/.eval/config` and verify these required keys are present:

```
AUTO_EVAL_TOKEN
FEISHU_EVAL_WEBHOOK
USER
OPENAI_API_BASE
AUTO_EVAL_API_URL
```

If any are missing, stop and tell the user to populate `~/.eval/config`.

Also verify `~/.eval/model.yaml` exists. If missing, stop and tell the user to create it.

Comment on lines +25 to +26
## 1. Gather inputs

Read `~/.eval/model.yaml` and present the list of available model keys to the user. Ask them to select one or more models. The model key (e.g. `Qwen3.5-35B-A3B`) is the `model_abbr`. User input is matched case-insensitively — `qwen3.5-35b-a3b` matches `Qwen3.5-35B-A3B`. The original casing from the YAML key is used in the payload.

Then ask for:

- **backend** — `pytorch` or `turbomind`. This determines the Dockerfile used for building and is passed as `--backend` in `infer_extra_params`.
- **instances** — number of inference instances (integer, used to compute `end_num`)
- **datasets** — comma-separated dataset keys (looked up in `~/.eval/config`)
- **image** (optional) — Docker image for the eval container

If the user selected multiple models, repeat steps 3-10 for each model.

## 2. Look up model config

For each selected model, read its entry from `~/.eval/model.yaml`. Extract `model_path` and all other fields.

All fields except `model_path` are passed as CLI flags to `infer_extra_params`, mapping each key to `--{key} {value}`. The `--backend` flag comes from the user's backend input (step 1), not from model.yaml. For example:

```yaml
tp: 2
reasoning_parser: qwen-qwq
tool_call_parser: qwen
```

with `backend=turbomind` produces:

```
--tp 2 --backend turbomind --reasoning-parser qwen-qwq --tool-call-parser qwen
```

Keys with underscores are converted to hyphens for the CLI flag name.

## 3. Resolve image

If the user provided an `image`, use it. Otherwise, invoke the `docker-build` skill to build and push an image from the current branch. Use the `image` variable it produces.

- If backend is `turbomind`, tell `docker-build` to use the full build mode (`docker/Dockerfile`).
- If backend is `pytorch`, tell `docker-build` to use the patch build mode (`docker/Dockerfile_patch`).

## 4. Resolve datasets

Parse the comma-separated `datasets` input. For each key, look up its value in `~/.eval/config`. If a key is not found, stop and list available dataset keys.

Combine the values into `subdataset`: `[*val1, *val2, ...]`

## 5. Compute derived fields

Compute the timestamp-padded model name:

```bash
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
MODEL_ABBR_PADDED="${model_abbr}-${TIMESTAMP}"
```

Compute `infer_extra_params` from all model.yaml fields except `model_path` (see step 2 for the mapping rule).

Compute resource fields:

- `gpu_num` = `tp`
- `cpu` = `16 * tp`
- `memory` = `"128000 * tp"` (as string)
- `end_num` = `instances + 1`
- `tokenizer_path` = `model_path`
- `output_dir` = `"./{USER}/${MODEL_ABBR_PADDED}"`

## 6. Assemble model_infer_config

Build as a Python dict string with these fields:

```python
{
'type': 'OpenAISDKStreaming',
'key': 'sk-admin',
'openai_api_base': ['{OPENAI_API_BASE}'],
'query_per_second': 8,
'batch_size': 32,
'max_workers': 8,
'temperature': 1,
'tokenizer_path': '{model_path}',
'retry': 50,
'max_out_len': 128000,
'max_seq_len': 128000,
'extra_body': {
'top_k': 20,
'repetition_penalty': 1.0,
'top_p': 0.95,
},
'verbose': True,
}
```

If the model has a `reasoning_parser` field, add to `extra_body`:

```python
'chat_template_kwargs': {'enable_thinking': True},
```

And add at the top level:

```python
'pred_postprocessor': {'type': 'extract-non-reasoning-content'},
```

## 7. Compute model_infer_config_base64

```bash
echo -n '{model_infer_config}' | base64 -w 0
```

## 8. Assemble infer_backend_config

```json
{
"end_num": {instances + 1},
"gpu_num": {tp},
"memory": "{128000 * tp}",
"cpu": {16 * tp},
"parallelism": "TP",
"oc_cpu": "1",
"oc_mem": 4000,
"model": "{MODEL_ABBR_PADDED}",
"model_path": "{model_path}",
"image": "{image}",
"infer_engine": "lmdeploy",
"infer_extra_params": "{infer_extra_params}",
"delete": "false",
"start_infer": "true",
"node_num": 1
}
```

## 9. Assemble full payload

Build the JSON body:

```json
{
"job_name": "api_eval_v4",
"param": {
"cluster": "yidian",
"workspace_id": "evalservice_gpu",
"model_abbr": "{MODEL_ABBR_PADDED}",
"user": "{USER}",
"model_infer_config": "{model_infer_config as string}",
"llm_judger_config": "",
"infer_worker_nums": 8,
"eval_nums": "15",
"eval_type": "chat_objective",
"auto_eval_version": "ld_0122_oc_0524d49_v2",
"ocp_version": "fullbench_v2_0",
"subdataset": "{subdataset}",
"fast_infer": "true",
"output_dir": "{output_dir}",
"eval_only": "false",
"cli_extra": "",
"dataset_max_out_len": "128000",
"feishu_token": "{FEISHU_EVAL_WEBHOOK}",
"model_infer_config_base64": "{model_infer_config_base64}",
"infer_backend_config": {infer_backend_config}
}
}
```

## 10. Submit

Execute the curl command:

```bash
curl -s -X POST "${AUTO_EVAL_API_URL}" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${AUTO_EVAL_TOKEN}" \
-d @- <<'JSON'
{payload}
JSON
```

Report the HTTP status and response body to the user.
6 changes: 2 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,10 @@ __pycache__/
*$py.class
.vscode/
.idea/
.cursor/
# C extensions
*.so

# skills
.cursor/
!.claude/skills/docker-build/
!.claude/skills/docker-build/SKILL.md

# Distribution / packaging
.Python
Expand Down Expand Up @@ -51,6 +48,7 @@ htmlcov/
.cache
*build*/
!builder/
!.claude/skills/docker-build/
lmdeploy/lib/
lmdeploy/bin/
dist/
Expand Down
149 changes: 149 additions & 0 deletions docs/superpowers/specs/2026-04-29-submit-eval-skill-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# submit-eval Skill Design

## Overview

A Claude Code skill that submits a model evaluation task to the auto-eval platform. The API URL is read from `~/.eval/config`. It reads model and dataset config from user-maintained files, computes derived fields, optionally builds a Docker image via the `docker-build` skill, and submits the request via curl.

## User Inputs

| Input | Required | Example | Notes |
| ------------ | -------- | --------------------------------------- | --------------------------------------------------------------------------------- |
| `model_abbr` | Yes | `Qwen3.5-35B` | Key in `~/.eval/models/model.yaml`; padded with `-yyyymmdd-hhmmss` at submit time |
| `instances` | Yes | `3` | Number of inference instances; used to compute `end_num` |
| `datasets` | Yes | `mmlu_pro, ifeval, aime2026` | Comma-separated keys (looked up in `~/.eval/config`) |
Comment on lines +11 to +13
| `image` | No | `<your-registry>/lmdeploy:main-abc1234` | Docker image. If omitted, triggers `docker-build` skill |

## Config Files

### `~/.eval/config` (KEY=VALUE)

```
AUTO_EVAL_TOKEN=<your-token>
FEISHU_EVAL_WEBHOOK=<your-feishu-webhook-url>
USER=lvhan
OPENAI_API_BASE=<your-api-base-url>
AUTO_EVAL_API_URL=<your-auto-eval-api-url>
mmlu_pro=*mmlu_pro_datasets
ifeval=*ifeval_datasets
aime2026=*aime2026_datasets
```

### `~/.eval/models/model.yaml` (YAML)

```yaml
Qwen3.5-35B:
model_path: /mnt/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/b1fc3d59ae0ab1e4279e04a8dd0fc4dc361fc2b6
tp: 2
backend: turbomind
reasoning_parser: qwen-qwq
# tool_call_parser: ... (optional)

Qwen3-32B:
model_path: /mnt/shared-storage-gpfs2/.../snapshots/...
tp: 2
backend: turbomind
reasoning_parser: qwen-qwq
```

## Hardcoded Defaults

| Field | Default |
| ------------------- | ----------------------- |
| `job_name` | `api_eval_v4` |
| `cluster` | `yidian` |
| `workspace_id` | `evalservice_gpu` |
| `eval_type` | `chat_objective` |
| `auto_eval_version` | `ld_0122_oc_0524d49_v2` |
| `ocp_version` | `fullbench_v2_0` |
| `fast_infer` | `true` |
| `eval_only` | `false` |
| `parallelism` | `TP` |
| `infer_engine` | `lmdeploy` |
| `delete` | `false` |
| `start_infer` | `true` |
| `node_num` | `1` |
| `oc_cpu` | `1` |
| `oc_mem` | `4000` |
| `infer_worker_nums` | `8` |
| `eval_nums` | `15` |
| `llm_judger_config` | `""` |
| `cli_extra` | `""` |

## Derived Fields

| Field | Derivation |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `model_abbr` (padded) | `{model_abbr}-{yyyymmdd-hhmmss}` using current timestamp at submit time |
| `subdataset` | From `datasets` input: look up each key in `~/.eval/config`, combine values as `[*val1, *val2, ...]` |
| `infer_extra_params` | `--tp {tp} --backend {backend} --reasoning-parser {reasoning_parser}` + optional `--tool-call-parser {tool_call_parser}` |
| `gpu_num` | = `tp` |
| `cpu` | = `16 * tp` |
| `memory` | = `"128000 * tp"` (as string, e.g. tp=2 → `"256000"`) |
| `end_num` | = `instances + 1` |
| `tokenizer_path` | = `model_path` |
| `output_dir` | `./{user}/{model_abbr_padded}` |
| `model_infer_config` | Assembled dict string (see below) |
| `model_infer_config_base64` | `echo -n '{model_infer_config}' \| base64` |
| `infer_backend_config` | Assembled dict (see below) |

### model_infer_config structure

Assembled as a Python dict string with:

- `type`: `OpenAISDKStreaming`
- `key`: `sk-admin`
- `openai_api_base`: `['{OPENAI_API_BASE}']` (from `~/.eval/config`)
- `query_per_second`: `8`
- `batch_size`: `32`
- `max_workers`: `8`
- `temperature`: `1`
- `tokenizer_path`: `{model_path}`
- `retry`: `50`
- `max_out_len`: `128000`
- `max_seq_len`: `128000`
- `extra_body`: `{top_k: 20, repetition_penalty: 1.0, top_p: 0.95, chat_template_kwargs: {enable_thinking: True}}`
- If `reasoning_parser` is set, include `chat_template_kwargs.enable_thinking: True`; otherwise omit
- `pred_postprocessor`: `{type: 'extract-non-reasoning-content'}` (only if `reasoning_parser` is set)
- `verbose`: `True`

### infer_backend_config structure

Assembled as a dict with:

- `end_num`: `{instances + 1}`
- `gpu_num`: `{tp}`
- `memory`: `"{128000 * tp}"`
- `cpu`: `{16 * tp}`
- `parallelism`: `TP`
- `oc_cpu`: `1`
- `oc_mem`: `4000`
- `model`: `{model_abbr_padded}`
- `model_path`: `{model_path}`
- `image`: `{image}`
- `infer_engine`: `lmdeploy`
- `infer_extra_params`: `{infer_extra_params}`
- `delete`: `false`
- `start_infer`: `true`
- `node_num`: `1`

## Flow

1. **Read `~/.eval/config`** — verify `AUTO_EVAL_TOKEN`, `FEISHU_EVAL_WEBHOOK`, `USER`, `OPENAI_API_BASE`, `AUTO_EVAL_API_URL` are present. Stop if missing.
2. **Gather inputs** — ask user for `model_abbr`, `instances`, `datasets`, and optionally `image`.
3. **Look up model** — read `~/.eval/models/model.yaml`, find the entry for `model_abbr`. Stop if not found.
4. **Resolve image** — if user provided `image`, use it; else invoke the `docker-build` skill to build and push an image from the current branch.
5. **Resolve datasets** — parse comma-separated `datasets` input, look up each key in `~/.eval/config`, combine into `subdataset` value.
6. **Compute derived fields** — pad `model_abbr` with timestamp, assemble `infer_extra_params`, `gpu_num`, `cpu`, `memory`, `end_num`, `model_infer_config`, `model_infer_config_base64`, `infer_backend_config`, `output_dir`.
7. **Assemble payload** — build the full JSON body with defaults + user inputs + computed fields.
8. **Submit** — execute `curl -X POST` to `AUTO_EVAL_API_URL` with the payload and `AUTO_EVAL_TOKEN` as Bearer token. Report the response.

## Skill File Location

`/workspace/lmdeploy/.claude/skills/submit-eval/SKILL.md`

## Error Handling

- Missing `~/.eval/config` or incomplete keys → stop and tell user to create/populate it
- Model not found in `~/.eval/models/model.yaml` → stop and list available models
- Dataset key not found in `~/.eval/config` → stop and list available dataset keys
- curl failure → report the HTTP status and response body
Loading