Skip to content

Fix Structured Output for GPT-OSS Models#4386

Open
windreamer wants to merge 3 commits intoInternLM:mainfrom
windreamer:fix_gpt_oss_guided_decoding
Open

Fix Structured Output for GPT-OSS Models#4386
windreamer wants to merge 3 commits intoInternLM:mainfrom
windreamer:fix_gpt_oss_guided_decoding

Conversation

@windreamer
Copy link
Copy Markdown
Collaborator

Motivation

GPT-OSS models use Harmony Response format, which conflicts with Guided Decoding (token-level JSON constraint) when response_format is specified. This causes:

  • Harmony parse errors
  • Request hangs
  • Empty message.parsed results

Breaking existing OpenAI SDK clients using client.beta.chat.completions.parse().

Modification

Approach: Replace Guided Decoding with Harmony-native structured output.

  1. Detect GPT-OSS architecture with active response_format
  2. Inject JSON schema into system message under # Response Formats section
  3. Disable Guided Decoding by clearing response_format
  4. Create system message automatically if none exists

closes: #4347

Copilot AI review requested due to automatic review settings March 2, 2026 06:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes structured output for GPT-OSS models by avoiding Guided Decoding (which conflicts with Harmony response parsing) and instead injecting the requested response schema into the prompt using Harmony’s native # Response Formats section.

Changes:

  • Detect GPT-OSS (arch == 'GptOssForCausalLM') requests with non-text response_format.
  • Inject the serialized response_format schema into the system message under # Response Formats (creating a system message if missing).
  • Disable guided decoding for GPT-OSS by clearing the local response_format passed into GenerationConfig.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/serve/openai/api_server.py Outdated
Comment thread lmdeploy/serve/openai/api_server.py Outdated
Comment thread lmdeploy/serve/openai/api_server.py Outdated
Comment thread lmdeploy/serve/openai/api_server.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/serve/openai/api_server.py Outdated
@jingyibo123
Copy link
Copy Markdown
Contributor

It's been a while since I compiled from source ,does CUDA 12.1 + GCC 9.4 work?

@windreamer
Copy link
Copy Markdown
Collaborator Author

It's been a while since I compiled from source ,does CUDA 12.1 + GCC 9.4 work?

No need to recompile it, you can just patch the python part.

@windreamer windreamer requested a review from lvhan028 March 12, 2026 01:46
@windreamer windreamer force-pushed the fix_gpt_oss_guided_decoding branch from d3f847a to d4366e3 Compare March 24, 2026 03:22
Comment thread lmdeploy/serve/openai/api_server.py Outdated
@windreamer windreamer force-pushed the fix_gpt_oss_guided_decoding branch from d4366e3 to 8cb9ef2 Compare May 7, 2026 04:17
@windreamer windreamer marked this pull request as draft May 7, 2026 04:25
… Harmony/JSON mode conflict for GPT-OSS

Move the GPT-OSS guided decoding logic from api_server.py inline code into
GptOssResponseParser._convert_response_format_to_harmony(), following the
established ResponseParser pattern for model-specific request handling.

When the model architecture is GptOssForCausalLM and a structured
response_format is requested, the schema is now injected into the system
prompt as a '# Response Formats' section and response_format is cleared on
the request to avoid the conflict between Harmony-native mode and the
engine's built-in JSON/response-format mode.

In api_server.py, response_format extraction is moved after parser
instantiation so that the parser can modify the request first.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@windreamer windreamer force-pushed the fix_gpt_oss_guided_decoding branch from 8cb9ef2 to e991917 Compare May 7, 2026 06:37
@windreamer windreamer marked this pull request as ready for review May 7, 2026 06:38
@windreamer windreamer requested a review from Copilot May 7, 2026 06:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread lmdeploy/serve/parsers/_openai_harmony.py
Comment thread lmdeploy/serve/parsers/_openai_harmony.py
Comment thread lmdeploy/serve/parsers/_openai_harmony.py Outdated
Comment thread lmdeploy/serve/parsers/_openai_harmony.py Outdated
…ormat conversion tests

- Build format_body without leading newlines; only prefix with \n\n when
  appending to an existing system message. This prevents a newly inserted
  system message from starting with blank lines that could interact poorly
  with downstream chat-template rendering.

- Add TestGptOssResponseFormatHarmonyConversion test class with 5 tests:
  1. response_format is cleared after conversion
  2. schema appended to existing system message
  3. schema inserted as new system message (no leading blank lines)
  4. text-type response_format is not converted
  5. no response_format leaves request unchanged
@windreamer windreamer requested a review from Copilot May 7, 2026 08:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread lmdeploy/serve/parsers/_openai_harmony.py
Comment thread lmdeploy/serve/parsers/_openai_harmony.py Outdated
Comment thread tests/test_lmdeploy/serve/parsers/test_gpt_oss_parser.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread lmdeploy/serve/parsers/_openai_harmony.py
Comment thread lmdeploy/serve/parsers/_openai_harmony.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread lmdeploy/serve/parsers/_openai_harmony.py Outdated
1. Guard model_copy() with hasattr check: extract _clear_response_format()
   helper that falls back to in-place mutation for non-Pydantic request
   objects (e.g. test sentinels). Prevents double-raise in the except path.

2. Use logger.exception() instead of logger.error(f'...{e}') so that
   stack traces are preserved in the log output.

3. Mark _patch_streamable_parser fixture as autouse=True and remove
   redundant monkeypatch.setattr calls from individual test methods.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] GPT-OSS-120B + openai-python empty result from client.beta.chat.completions.parse with response_format

4 participants