InternLM · lvhan028 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.claude/skills/docker-build/SKILL.md b/.claude/skills/docker-build/SKILL.md
@@ -1,7 +1,6 @@
 ---
 name: docker-build
 description: Build an LMDeploy Docker image and push it to the inner registry.
-disable-model-invocation: true
 ---
 
 # Docker Build & Push
@@ -26,7 +25,7 @@ If any are missing, stop and tell the user to set them before proceeding.
 BRANCH=$(git branch --show-current | sed 's/[^a-zA-Z0-9._-]/-/g')
 SHA=$(git rev-parse --short=7 HEAD)
 TAG="${BRANCH}-${SHA}"
-IMAGE="${LMDEPLOY_REGISTRY}/lmdeploy:${TAG}"
+IMAGE="${LMDEPLOY_REGISTRY}/ailab-puyu-puyu_gpu/lmdeploy-dev:lmdeploy-${TAG}"
 ```
 
 Print the computed image name so the user can confirm.

diff --git a/.claude/skills/submit-llm-eval/SKILL.md b/.claude/skills/submit-llm-eval/SKILL.md
@@ -0,0 +1,204 @@
+---
+name: submit-eval
+description: Use when submitting a model eval task to the auto-eval platform
+disable-model-invocation: true
+---
+
+# Submit Eval Task
+
+Submit a model evaluation task to the auto-eval platform.
+
+## Prerequisites
+
+Read `~/.eval/config` and verify these required keys are present:
+
+```
+AUTO_EVAL_TOKEN
+FEISHU_EVAL_WEBHOOK
+USER
+OPENAI_API_BASE
+AUTO_EVAL_API_URL
+```
+
+If any are missing, stop and tell the user to populate `~/.eval/config`.
+
+Also verify `~/.eval/model.yaml` exists. If missing, stop and tell the user to create it.
+
+## 1. Gather inputs
+
+Read `~/.eval/model.yaml` and present the list of available model keys to the user. Ask them to select one or more models. The model key (e.g. `Qwen3.5-35B-A3B`) is the `model_abbr`. User input is matched case-insensitively — `qwen3.5-35b-a3b` matches `Qwen3.5-35B-A3B`. The original casing from the YAML key is used in the payload.
+
+Then ask for:
+
+- **backend** — `pytorch` or `turbomind`. This determines the Dockerfile used for building and is passed as `--backend` in `infer_extra_params`.
+- **instances** — number of inference instances (integer, used to compute `end_num`)
+- **datasets** — comma-separated dataset keys (looked up in `~/.eval/config`)
+- **image** (optional) — Docker image for the eval container
+
+If the user selected multiple models, repeat steps 3-10 for each model.
+
+## 2. Look up model config
+
+For each selected model, read its entry from `~/.eval/model.yaml`. Extract `model_path` and all other fields.
+
+All fields except `model_path` are passed as CLI flags to `infer_extra_params`, mapping each key to `--{key} {value}`. The `--backend` flag comes from the user's backend input (step 1), not from model.yaml. For example:
+
+```yaml
+tp: 2
+reasoning_parser: qwen-qwq
+tool_call_parser: qwen
+```
+
+with `backend=turbomind` produces:
+
+```
+--tp 2 --backend turbomind --reasoning-parser qwen-qwq --tool-call-parser qwen
+```
+
+Keys with underscores are converted to hyphens for the CLI flag name.
+
+## 3. Resolve image
+
+If the user provided an `image`, use it. Otherwise, invoke the `docker-build` skill to build and push an image from the current branch. Use the `image` variable it produces.
+
+- If backend is `turbomind`, tell `docker-build` to use the full build mode (`docker/Dockerfile`).
+- If backend is `pytorch`, tell `docker-build` to use the patch build mode (`docker/Dockerfile_patch`).
+
+## 4. Resolve datasets
+
+Parse the comma-separated `datasets` input. For each key, look up its value in `~/.eval/config`. If a key is not found, stop and list available dataset keys.
+
+Combine the values into `subdataset`: `[*val1, *val2, ...]`
+
+## 5. Compute derived fields
+
+Compute the timestamp-padded model name:
+
+```bash
+TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+MODEL_ABBR_PADDED="${model_abbr}-${TIMESTAMP}"
+```
+
+Compute `infer_extra_params` from all model.yaml fields except `model_path` (see step 2 for the mapping rule).
+
+Compute resource fields:
+
+- `gpu_num` = `tp`
+- `cpu` = `16 * tp`
+- `memory` = `"128000 * tp"` (as string)
+- `end_num` = `instances + 1`
+- `tokenizer_path` = `model_path`
+- `output_dir` = `"./{USER}/${MODEL_ABBR_PADDED}"`
+
+## 6. Assemble model_infer_config
+
+Build as a Python dict string with these fields:
+
+```python
+{
+    'type': 'OpenAISDKStreaming',
+    'key': 'sk-admin',
+    'openai_api_base': ['{OPENAI_API_BASE}'],
+    'query_per_second': 8,
+    'batch_size': 32,
+    'max_workers': 8,
+    'temperature': 1,
+    'tokenizer_path': '{model_path}',
+    'retry': 50,
+    'max_out_len': 128000,
+    'max_seq_len': 128000,
+    'extra_body': {
+        'top_k': 20,
+        'repetition_penalty': 1.0,
+        'top_p': 0.95,
+    },
+    'verbose': True,
+}
+```
+
+If the model has a `reasoning_parser` field, add to `extra_body`:
+
+```python
+'chat_template_kwargs': {'enable_thinking': True},
+```
+
+And add at the top level:
+
+```python
+'pred_postprocessor': {'type': 'extract-non-reasoning-content'},
+```
+
+## 7. Compute model_infer_config_base64
+
+```bash
+echo -n '{model_infer_config}' | base64 -w 0
+```
+
+## 8. Assemble infer_backend_config
+
+```json
+{
+    "end_num": {instances + 1},
+    "gpu_num": {tp},
+    "memory": "{128000 * tp}",
+    "cpu": {16 * tp},
+    "parallelism": "TP",
+    "oc_cpu": "1",
+    "oc_mem": 4000,
+    "model": "{MODEL_ABBR_PADDED}",
+    "model_path": "{model_path}",
+    "image": "{image}",
+    "infer_engine": "lmdeploy",
+    "infer_extra_params": "{infer_extra_params}",
+    "delete": "false",
+    "start_infer": "true",
+    "node_num": 1
+}
+```
+
+## 9. Assemble full payload
+
+Build the JSON body:
+
+```json
+{
+    "job_name": "api_eval_v4",
+    "param": {
+        "cluster": "yidian",
+        "workspace_id": "evalservice_gpu",
+        "model_abbr": "{MODEL_ABBR_PADDED}",
+        "user": "{USER}",
+        "model_infer_config": "{model_infer_config as string}",
+        "llm_judger_config": "",
+        "infer_worker_nums": 8,
+        "eval_nums": "15",
+        "eval_type": "chat_objective",
+        "auto_eval_version": "ld_0122_oc_0524d49_v2",
+        "ocp_version": "fullbench_v2_0",
+        "subdataset": "{subdataset}",
+        "fast_infer": "true",
+        "output_dir": "{output_dir}",
+        "eval_only": "false",
+        "cli_extra": "",
+        "dataset_max_out_len": "128000",
+        "feishu_token": "{FEISHU_EVAL_WEBHOOK}",
+        "model_infer_config_base64": "{model_infer_config_base64}",
+        "infer_backend_config": {infer_backend_config}
+    }
+}
+```
+
+## 10. Submit
+
+Execute the curl command:
+
+```bash
+curl -s -X POST "${AUTO_EVAL_API_URL}" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer ${AUTO_EVAL_TOKEN}" \
+  -d @- <<'JSON'
+{payload}
+JSON
+```
+
+Report the HTTP status and response body to the user.
diff --git a/.gitignore b/.gitignore
@@ -4,13 +4,10 @@ __pycache__/
 *$py.class
 .vscode/
 .idea/
+.cursor/
 # C extensions
 *.so
 
-# skills
-.cursor/
-!.claude/skills/docker-build/
-!.claude/skills/docker-build/SKILL.md
 
 # Distribution / packaging
 .Python
@@ -51,6 +48,7 @@ htmlcov/
 .cache
 *build*/
 !builder/
+!.claude/skills/docker-build/
 lmdeploy/lib/
 lmdeploy/bin/
 dist/

diff --git a/docs/superpowers/specs/2026-04-29-submit-eval-skill-design.md b/docs/superpowers/specs/2026-04-29-submit-eval-skill-design.md
@@ -0,0 +1,149 @@
+# submit-eval Skill Design
+
+## Overview
+
+A Claude Code skill that submits a model evaluation task to the auto-eval platform. The API URL is read from `~/.eval/config`. It reads model and dataset config from user-maintained files, computes derived fields, optionally builds a Docker image via the `docker-build` skill, and submits the request via curl.
+
+## User Inputs
+
+| Input        | Required | Example                                 | Notes                                                                             |
+| ------------ | -------- | --------------------------------------- | --------------------------------------------------------------------------------- |
+| `model_abbr` | Yes      | `Qwen3.5-35B`                           | Key in `~/.eval/models/model.yaml`; padded with `-yyyymmdd-hhmmss` at submit time |
+| `instances`  | Yes      | `3`                                     | Number of inference instances; used to compute `end_num`                          |
+| `datasets`   | Yes      | `mmlu_pro, ifeval, aime2026`            | Comma-separated keys (looked up in `~/.eval/config`)                              |
+| `image`      | No       | `<your-registry>/lmdeploy:main-abc1234` | Docker image. If omitted, triggers `docker-build` skill                           |
+
+## Config Files
+
+### `~/.eval/config` (KEY=VALUE)
+
+```
+AUTO_EVAL_TOKEN=<your-token>
+FEISHU_EVAL_WEBHOOK=<your-feishu-webhook-url>
+USER=lvhan
+OPENAI_API_BASE=<your-api-base-url>
+AUTO_EVAL_API_URL=<your-auto-eval-api-url>
+mmlu_pro=*mmlu_pro_datasets
+ifeval=*ifeval_datasets
+aime2026=*aime2026_datasets
+```
+
+### `~/.eval/models/model.yaml` (YAML)
+
+```yaml
+Qwen3.5-35B:
+  model_path: /mnt/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/b1fc3d59ae0ab1e4279e04a8dd0fc4dc361fc2b6
+  tp: 2
+  backend: turbomind
+  reasoning_parser: qwen-qwq
+  # tool_call_parser: ...   (optional)
+
+Qwen3-32B:
+  model_path: /mnt/shared-storage-gpfs2/.../snapshots/...
+  tp: 2
+  backend: turbomind
+  reasoning_parser: qwen-qwq
+```
+
+## Hardcoded Defaults
+
+| Field               | Default                 |
+| ------------------- | ----------------------- |
+| `job_name`          | `api_eval_v4`           |
+| `cluster`           | `yidian`                |
+| `workspace_id`      | `evalservice_gpu`       |
+| `eval_type`         | `chat_objective`        |
+| `auto_eval_version` | `ld_0122_oc_0524d49_v2` |
+| `ocp_version`       | `fullbench_v2_0`        |
+| `fast_infer`        | `true`                  |
+| `eval_only`         | `false`                 |
+| `parallelism`       | `TP`                    |
+| `infer_engine`      | `lmdeploy`              |
+| `delete`            | `false`                 |
+| `start_infer`       | `true`                  |
+| `node_num`          | `1`                     |
+| `oc_cpu`            | `1`                     |
+| `oc_mem`            | `4000`                  |
+| `infer_worker_nums` | `8`                     |
+| `eval_nums`         | `15`                    |
+| `llm_judger_config` | `""`                    |
+| `cli_extra`         | `""`                    |
+
+## Derived Fields
+
+| Field                       | Derivation                                                                                                               |
+| --------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
+| `model_abbr` (padded)       | `{model_abbr}-{yyyymmdd-hhmmss}` using current timestamp at submit time                                                  |
+| `subdataset`                | From `datasets` input: look up each key in `~/.eval/config`, combine values as `[*val1, *val2, ...]`                     |
+| `infer_extra_params`        | `--tp {tp} --backend {backend} --reasoning-parser {reasoning_parser}` + optional `--tool-call-parser {tool_call_parser}` |
+| `gpu_num`                   | = `tp`                                                                                                                   |
+| `cpu`                       | = `16 * tp`                                                                                                              |
+| `memory`                    | = `"128000 * tp"` (as string, e.g. tp=2 → `"256000"`)                                                                    |
+| `end_num`                   | = `instances + 1`                                                                                                        |
+| `tokenizer_path`            | = `model_path`                                                                                                           |
+| `output_dir`                | `./{user}/{model_abbr_padded}`                                                                                           |
+| `model_infer_config`        | Assembled dict string (see below)                                                                                        |
+| `model_infer_config_base64` | `echo -n '{model_infer_config}' \| base64`                                                                               |
+| `infer_backend_config`      | Assembled dict (see below)                                                                                               |
+
+### model_infer_config structure
+
+Assembled as a Python dict string with:
+
+- `type`: `OpenAISDKStreaming`
+- `key`: `sk-admin`
+- `openai_api_base`: `['{OPENAI_API_BASE}']` (from `~/.eval/config`)
+- `query_per_second`: `8`
+- `batch_size`: `32`
+- `max_workers`: `8`
+- `temperature`: `1`
+- `tokenizer_path`: `{model_path}`
+- `retry`: `50`
+- `max_out_len`: `128000`
+- `max_seq_len`: `128000`
+- `extra_body`: `{top_k: 20, repetition_penalty: 1.0, top_p: 0.95, chat_template_kwargs: {enable_thinking: True}}`
+  - If `reasoning_parser` is set, include `chat_template_kwargs.enable_thinking: True`; otherwise omit
+- `pred_postprocessor`: `{type: 'extract-non-reasoning-content'}` (only if `reasoning_parser` is set)
+- `verbose`: `True`
+
+### infer_backend_config structure
+
+Assembled as a dict with:
+
+- `end_num`: `{instances + 1}`
+- `gpu_num`: `{tp}`
+- `memory`: `"{128000 * tp}"`
+- `cpu`: `{16 * tp}`
+- `parallelism`: `TP`
+- `oc_cpu`: `1`
+- `oc_mem`: `4000`
+- `model`: `{model_abbr_padded}`
+- `model_path`: `{model_path}`
+- `image`: `{image}`
+- `infer_engine`: `lmdeploy`
+- `infer_extra_params`: `{infer_extra_params}`
+- `delete`: `false`
+- `start_infer`: `true`
+- `node_num`: `1`
+
+## Flow
+
+1. **Read `~/.eval/config`** — verify `AUTO_EVAL_TOKEN`, `FEISHU_EVAL_WEBHOOK`, `USER`, `OPENAI_API_BASE`, `AUTO_EVAL_API_URL` are present. Stop if missing.
+2. **Gather inputs** — ask user for `model_abbr`, `instances`, `datasets`, and optionally `image`.
+3. **Look up model** — read `~/.eval/models/model.yaml`, find the entry for `model_abbr`. Stop if not found.
+4. **Resolve image** — if user provided `image`, use it; else invoke the `docker-build` skill to build and push an image from the current branch.
+5. **Resolve datasets** — parse comma-separated `datasets` input, look up each key in `~/.eval/config`, combine into `subdataset` value.
+6. **Compute derived fields** — pad `model_abbr` with timestamp, assemble `infer_extra_params`, `gpu_num`, `cpu`, `memory`, `end_num`, `model_infer_config`, `model_infer_config_base64`, `infer_backend_config`, `output_dir`.
+7. **Assemble payload** — build the full JSON body with defaults + user inputs + computed fields.
+8. **Submit** — execute `curl -X POST` to `AUTO_EVAL_API_URL` with the payload and `AUTO_EVAL_TOKEN` as Bearer token. Report the response.
+
+## Skill File Location
+
+`/workspace/lmdeploy/.claude/skills/submit-eval/SKILL.md`
+
+## Error Handling
+
+- Missing `~/.eval/config` or incomplete keys → stop and tell user to create/populate it
+- Model not found in `~/.eval/models/model.yaml` → stop and list available models
+- Dataset key not found in `~/.eval/config` → stop and list available dataset keys
+- curl failure → report the HTTP status and response body