Skip to content

[Feature] Implement /v1/embeddings endpoint for OpenAI-compatible API#4550

Open
ZhijunLStudio wants to merge 1 commit intoInternLM:mainfrom
ZhijunLStudio:feat/embeddings-endpoint
Open

[Feature] Implement /v1/embeddings endpoint for OpenAI-compatible API#4550
ZhijunLStudio wants to merge 1 commit intoInternLM:mainfrom
ZhijunLStudio:feat/embeddings-endpoint

Conversation

@ZhijunLStudio
Copy link
Copy Markdown
Contributor

Motivation

The /v1/embeddings endpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings. Currently lmdeploy's /v1/embeddings is a stub that returns Unsupported by turbomind.

The infrastructure to pass last_hidden_state through the pipeline already exists at the high level (Response, EngineOutput, GenOut all have the field), but the PyTorch engine's internal pipeline never populates it.

Modification

API layer

  • lmdeploy/serve/openai/protocol.py: Add encoding_format field to EmbeddingsRequest (supports float and base64)
  • lmdeploy/serve/openai/api_server.py: Replace stub with full implementation that calls engine with max_new_tokens=1 + output_last_hidden_state='all', applies mean pooling across input sequence, and returns EmbeddingsResponse

PyTorch engine pipeline (threading hidden states from model forward to API response)

  • lmdeploy/pytorch/messages.py: Add output_last_hidden_state field to SamplingParam, add return_last_hidden_states property to SchedulerSequence, replace unsupported warning with validation
  • lmdeploy/pytorch/engine/inputs_maker.py: Add __need_hidden_states check and pass return_last_hidden_states flag
  • lmdeploy/pytorch/engine/model_agent/agent.py: Add last_hidden_states to BatchedOutputs, capture full-sequence hidden states in _async_model_forward before postprocessing slices to last token, mean pool per-sequence
  • lmdeploy/pytorch/engine/engine.py: Add last_hidden_states field to InferOutput
  • lmdeploy/pytorch/engine/engine_loop.py: Thread hidden states through _send_resp and _make_infer_outputs
  • lmdeploy/pytorch/engine/engine_instance.py: Pass last_hidden_state to EngineOutput

Tested with

  • Qwen3-8B on PyTorch backend: single/multi input, cosine similarity (cat/cat-like=0.9754 > cat/stock=0.9478), empty input validation, base64 encoding

BC-breaking

No. The new endpoint is additive. Existing TurboMind output_last_hidden_state support is unchanged.

Use cases

# Start server
lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch

# Get embeddings
curl -X POST http://localhost:23333/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3", "input": ["Hello", "World"]}'

Checklist

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  • If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  • The documentation has been modified accordingly, like docstring or example tutorials.

@lvhan028 lvhan028 requested review from Copilot and grimoire April 28, 2026 03:19
@lvhan028 lvhan028 added the enhancement New feature or request label Apr 28, 2026
@lvhan028 lvhan028 self-requested a review April 28, 2026 03:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements an OpenAI-compatible /v1/embeddings endpoint by enabling the PyTorch backend to return (pooled) hidden states through the engine pipeline up to the API layer.

Changes:

  • Add encoding_format to the OpenAI embeddings request schema and return an EmbeddingsResponse from /v1/embeddings.
  • Thread output_last_hidden_state from GenerationConfig into the PyTorch scheduler/engine and propagate pooled hidden states through engine outputs.
  • Capture and mean-pool hidden states in the model-agent forward path and forward them through engine loop/instance plumbing.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
lmdeploy/serve/openai/protocol.py Extends embeddings request schema with encoding_format and exposes EmbeddingsResponse.
lmdeploy/serve/openai/api_server.py Replaces /v1/embeddings stub with an engine-backed implementation and optional base64 encoding.
lmdeploy/pytorch/messages.py Adds output_last_hidden_state plumbing and exposes return_last_hidden_states on sequences.
lmdeploy/pytorch/engine/model_agent/agent.py Captures full hidden states pre-postprocess and mean-pools per sequence for embeddings.
lmdeploy/pytorch/engine/inputs_maker.py Adds hidden-state demand detection and forwards return_last_hidden_states for prefill.
lmdeploy/pytorch/engine/engine_loop.py Includes hidden states in response payloads and maps them into InferOutput.
lmdeploy/pytorch/engine/engine_instance.py Extracts last_hidden_states from response payloads and exposes them as EngineOutput.last_hidden_state.
lmdeploy/pytorch/engine/engine.py Extends InferOutput with last_hidden_states.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gen_config=gen_config,
stream_response=True,
sequence_start=True,
sequence_end=True,
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncEngine.generate defaults to do_preprocess=True, which applies the configured chat template even when messages is a plain string (see MultimodalProcessor._get_text_prompt_input). For an OpenAI-compatible embeddings endpoint, this will change the text being embedded (e.g., adding user/assistant wrappers). Pass do_preprocess=False here (similar to how string prompts are handled in /v1/chat/completions).

Suggested change
sequence_end=True,
sequence_end=True,
do_preprocess=False,

Copilot uses AI. Check for mistakes.
Comment on lines +1029 to +1033
return EmbeddingsResponse(
data=embedding_data,
model=request.model or '',
usage=UsageInfo(prompt_tokens=prompt_tokens, total_tokens=prompt_tokens, completion_tokens=0),
)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_tokens is overwritten per input and the final usage only reflects the last item. Also, when request.model is omitted the response currently returns an empty string, unlike other endpoints (e.g. /pooling) which default to async_engine.model_name. Consider summing prompt tokens across all inputs and defaulting model to the server model name when not provided.

Copilot uses AI. Check for mistakes.
Comment thread lmdeploy/serve/openai/api_server.py Outdated
emb_list = last_hidden_state.tolist()

if request.encoding_format == 'base64':
packed = struct.pack(f'{len(emb_list)}f', *emb_list)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For encoding_format='base64', struct.pack(f'{len(emb_list)}f', ...) uses native endianness. The OpenAI API expects embeddings encoded as little-endian float32. Use an explicit little-endian format (e.g. <{n}f) to avoid platform-dependent output.

Suggested change
packed = struct.pack(f'{len(emb_list)}f', *emb_list)
packed = struct.pack(f'<{len(emb_list)}f', *emb_list)

Copilot uses AI. Check for mistakes.
Comment on lines +1004 to +1013
async for res in result_generator:
if res.last_hidden_state is not None:
last_hidden_state = res.last_hidden_state
prompt_tokens = res.input_token_len

if last_hidden_state is None:
return create_error_response(
HTTPStatus.INTERNAL_SERVER_ERROR,
'Model does not support hidden states output for embeddings.',
)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop ignores res.finish_reason / error frames from AsyncEngine.generate (e.g. when prefix caching is enabled with output_last_hidden_state='all', the generator yields a finish_reason='error' frame with an error message and no hidden states). Currently this falls through to a generic 500. Handle finish_reason=='error' (and possibly client disconnect) and surface the actual error message/status to the caller.

Copilot uses AI. Check for mistakes.
Comment on lines +449 to +463
# capture full hidden states before postprocessing slices to last token
full_hidden_states = None
if return_last_hidden_states:
raw_hidden = ret['hidden_states']
raw_seq_length = ret.get('seq_length', inputs.seq_length)
# raw_hidden shape: [1, total_tokens, hidden_dim] or [total_tokens, hidden_dim]
if raw_hidden.dim() == 3:
raw_hidden = raw_hidden[0] # [total_tokens, hidden_dim]
# slice per-sequence and mean pool
if raw_seq_length.numel() == 1:
full_hidden_states = raw_hidden.mean(dim=0, keepdim=True) # [1, hidden_dim]
else:
parts = raw_hidden.split(raw_seq_length.tolist(), dim=0)
full_hidden_states = torch.stack([p.mean(dim=0) for p in parts], dim=0) # [bs, hidden_dim]

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean pooling is computed per forward pass using raw_seq_length. For long-context chunking (inputs.is_chunk), intermediate chunks do not emit outputs (EngineLoop skips non-last chunks), so the embedding on the last chunk will only reflect that final chunk rather than the full input sequence. To make embeddings correct for chunked prefill, accumulate a weighted sum/count across chunks (e.g., store partial sums in the sequence state) and finalize the mean on the last chunk.

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 requested a review from CUHKSZzxy April 29, 2026 06:00
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy CUHKSZzxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, can you resolve the merge conflict and add unit tests for v1/embeddings?

Add support for the standard OpenAI embeddings endpoint that extracts
last hidden states from the model and applies mean pooling. This enables
downstream tools (LangChain, LlamaIndex, RAG pipelines) to use lmdeploy
for text embedding generation.

Changes:
- Replace stub /v1/embeddings with full implementation supporting
  float and base64 encoding formats
- Thread last_hidden_states through the PyTorch engine pipeline
  (BatchedOutputs -> InferOutput -> EngineOutput)
- Capture full-sequence hidden states before postprocessing slices
  to last token, and mean pool per-sequence in the engine
- Pass do_preprocess=False to avoid chat template being applied
- Sum prompt_tokens across all inputs instead of overwriting
- Default model to async_engine.model_name when not provided
- Use little-endian format for base64 encoding (< prefix)
- Handle finish_reason='error' frames from engine
- Add unit tests for v1/embeddings endpoint
@ZhijunLStudio ZhijunLStudio force-pushed the feat/embeddings-endpoint branch from 84a2f95 to 654dde9 Compare May 7, 2026 08:07
@ZhijunLStudio
Copy link
Copy Markdown
Contributor Author

Thanks for your contribution, can you resolve the merge conflict and add unit tests for v1/embeddings?

Thanks, conflicts resolved and unit tests added. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants