Fix Structured Output for GPT-OSS Models by windreamer · Pull Request #4386 · InternLM/lmdeploy

windreamer · 2026-03-02T06:23:17Z

Motivation

GPT-OSS models use Harmony Response format, which conflicts with Guided Decoding (token-level JSON constraint) when response_format is specified. This causes:

Harmony parse errors
Request hangs
Empty message.parsed results

Breaking existing OpenAI SDK clients using client.beta.chat.completions.parse().

Modification

Approach: Replace Guided Decoding with Harmony-native structured output.

Detect GPT-OSS architecture with active response_format
Inject JSON schema into system message under # Response Formats section
Disable Guided Decoding by clearing response_format
Create system message automatically if none exists

closes: #4347

Copilot

Pull request overview

This PR fixes structured output for GPT-OSS models by avoiding Guided Decoding (which conflicts with Harmony response parsing) and instead injecting the requested response schema into the prompt using Harmony’s native # Response Formats section.

Changes:

Detect GPT-OSS (arch == 'GptOssForCausalLM') requests with non-text response_format.
Inject the serialized response_format schema into the system message under # Response Formats (creating a system message if missing).
Disable guided decoding for GPT-OSS by clearing the local response_format passed into GenerationConfig.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jingyibo123 · 2026-03-02T11:48:07Z

It's been a while since I compiled from source ,does CUDA 12.1 + GCC 9.4 work?

windreamer · 2026-03-02T23:20:01Z

It's been a while since I compiled from source ,does CUDA 12.1 + GCC 9.4 work?

No need to recompile it, you can just patch the python part.

… Harmony/JSON mode conflict for GPT-OSS Move the GPT-OSS guided decoding logic from api_server.py inline code into GptOssResponseParser._convert_response_format_to_harmony(), following the established ResponseParser pattern for model-specific request handling. When the model architecture is GptOssForCausalLM and a structured response_format is requested, the schema is now injected into the system prompt as a '# Response Formats' section and response_format is cleared on the request to avoid the conflict between Harmony-native mode and the engine's built-in JSON/response-format mode. In api_server.py, response_format extraction is moved after parser instantiation so that the parser can modify the request first. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

…ormat conversion tests - Build format_body without leading newlines; only prefix with \n\n when appending to an existing system message. This prevents a newly inserted system message from starting with blank lines that could interact poorly with downstream chat-template rendering. - Add TestGptOssResponseFormatHarmonyConversion test class with 5 tests: 1. response_format is cleared after conversion 2. schema appended to existing system message 3. schema inserted as new system message (no leading blank lines) 4. text-type response_format is not converted 5. no response_format leaves request unchanged

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

1. Guard model_copy() with hasattr check: extract _clear_response_format() helper that falls back to in-place mutation for non-Pydantic request objects (e.g. test sentinels). Prevents double-raise in the except path. 2. Use logger.exception() instead of logger.error(f'...{e}') so that stack traces are preserved in the log output. 3. Mark _patch_streamable_parser fixture as autouse=True and remove redundant monkeypatch.setattr calls from individual test methods.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings March 2, 2026 06:23

Copilot started reviewing on behalf of windreamer March 2, 2026 06:23 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Comment thread lmdeploy/serve/openai/api_server.py Outdated

Comment thread lmdeploy/serve/openai/api_server.py Outdated

Comment thread lmdeploy/serve/openai/api_server.py Outdated

Comment thread lmdeploy/serve/openai/api_server.py Outdated

windreamer force-pushed the fix_gpt_oss_guided_decoding branch from f51f924 to d3f847a Compare March 2, 2026 06:46

windreamer requested a review from Copilot March 2, 2026 06:47

Copilot started reviewing on behalf of windreamer March 2, 2026 06:47 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Comment thread lmdeploy/serve/openai/api_server.py Outdated

windreamer requested a review from lvhan028 March 12, 2026 01:46

windreamer force-pushed the fix_gpt_oss_guided_decoding branch from d3f847a to d4366e3 Compare March 24, 2026 03:22

lvhan028 reviewed May 7, 2026

View reviewed changes

Comment thread lmdeploy/serve/openai/api_server.py Outdated

windreamer force-pushed the fix_gpt_oss_guided_decoding branch from d4366e3 to 8cb9ef2 Compare May 7, 2026 04:17

windreamer marked this pull request as draft May 7, 2026 04:25

windreamer force-pushed the fix_gpt_oss_guided_decoding branch from 8cb9ef2 to e991917 Compare May 7, 2026 06:37

windreamer marked this pull request as ready for review May 7, 2026 06:38

windreamer requested a review from Copilot May 7, 2026 06:38

Copilot started reviewing on behalf of windreamer May 7, 2026 06:38 View session