[Feature] Implement `/v1/embeddings` endpoint for OpenAI-compatible API by ZhijunLStudio · Pull Request #4550 · InternLM/lmdeploy

ZhijunLStudio · 2026-04-23T08:14:13Z

Motivation

The /v1/embeddings endpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings. Currently lmdeploy's /v1/embeddings is a stub that returns Unsupported by turbomind.

The infrastructure to pass last_hidden_state through the pipeline already exists at the high level (Response, EngineOutput, GenOut all have the field), but the PyTorch engine's internal pipeline never populates it.

Modification

API layer

lmdeploy/serve/openai/protocol.py: Add encoding_format field to EmbeddingsRequest (supports float and base64)
lmdeploy/serve/openai/api_server.py: Replace stub with full implementation that calls engine with max_new_tokens=1 + output_last_hidden_state='all', applies mean pooling across input sequence, and returns EmbeddingsResponse

PyTorch engine pipeline (threading hidden states from model forward to API response)

lmdeploy/pytorch/messages.py: Add output_last_hidden_state field to SamplingParam, add return_last_hidden_states property to SchedulerSequence, replace unsupported warning with validation
lmdeploy/pytorch/engine/inputs_maker.py: Add __need_hidden_states check and pass return_last_hidden_states flag
lmdeploy/pytorch/engine/model_agent/agent.py: Add last_hidden_states to BatchedOutputs, capture full-sequence hidden states in _async_model_forward before postprocessing slices to last token, mean pool per-sequence
lmdeploy/pytorch/engine/engine.py: Add last_hidden_states field to InferOutput
lmdeploy/pytorch/engine/engine_loop.py: Thread hidden states through _send_resp and _make_infer_outputs
lmdeploy/pytorch/engine/engine_instance.py: Pass last_hidden_state to EngineOutput

Tested with

Qwen3-8B on PyTorch backend: single/multi input, cosine similarity (cat/cat-like=0.9754 > cat/stock=0.9478), empty input validation, base64 encoding

BC-breaking

No. The new endpoint is additive. Existing TurboMind output_last_hidden_state support is unchanged.

Use cases

# Start server
lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch

# Get embeddings
curl -X POST http://localhost:23333/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3", "input": ["Hello", "World"]}'

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

Implements an OpenAI-compatible /v1/embeddings endpoint by enabling the PyTorch backend to return (pooled) hidden states through the engine pipeline up to the API layer.

Changes:

Add encoding_format to the OpenAI embeddings request schema and return an EmbeddingsResponse from /v1/embeddings.
Thread output_last_hidden_state from GenerationConfig into the PyTorch scheduler/engine and propagate pooled hidden states through engine outputs.
Capture and mean-pool hidden states in the model-agent forward path and forward them through engine loop/instance plumbing.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
lmdeploy/serve/openai/protocol.py	Extends embeddings request schema with `encoding_format` and exposes `EmbeddingsResponse`.
lmdeploy/serve/openai/api_server.py	Replaces `/v1/embeddings` stub with an engine-backed implementation and optional base64 encoding.
lmdeploy/pytorch/messages.py	Adds `output_last_hidden_state` plumbing and exposes `return_last_hidden_states` on sequences.
lmdeploy/pytorch/engine/model_agent/agent.py	Captures full hidden states pre-postprocess and mean-pools per sequence for embeddings.
lmdeploy/pytorch/engine/inputs_maker.py	Adds hidden-state demand detection and forwards `return_last_hidden_states` for prefill.
lmdeploy/pytorch/engine/engine_loop.py	Includes hidden states in response payloads and maps them into `InferOutput`.
lmdeploy/pytorch/engine/engine_instance.py	Extracts `last_hidden_states` from response payloads and exposes them as `EngineOutput.last_hidden_state`.
lmdeploy/pytorch/engine/engine.py	Extends `InferOutput` with `last_hidden_states`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T03:27:26Z

+            gen_config=gen_config,
+            stream_response=True,
+            sequence_start=True,
+            sequence_end=True,


AsyncEngine.generate defaults to do_preprocess=True, which applies the configured chat template even when messages is a plain string (see MultimodalProcessor._get_text_prompt_input). For an OpenAI-compatible embeddings endpoint, this will change the text being embedded (e.g., adding user/assistant wrappers). Pass do_preprocess=False here (similar to how string prompts are handled in /v1/chat/completions).

Suggested change

sequence_end=True,

sequence_end=True,

do_preprocess=False,

Copilot · 2026-04-28T03:27:27Z

+    return EmbeddingsResponse(
+        data=embedding_data,
+        model=request.model or '',
+        usage=UsageInfo(prompt_tokens=prompt_tokens, total_tokens=prompt_tokens, completion_tokens=0),
+    )


prompt_tokens is overwritten per input and the final usage only reflects the last item. Also, when request.model is omitted the response currently returns an empty string, unlike other endpoints (e.g. /pooling) which default to async_engine.model_name. Consider summing prompt tokens across all inputs and defaulting model to the server model name when not provided.

Copilot · 2026-04-28T03:27:27Z

+            emb_list = last_hidden_state.tolist()
+
+        if request.encoding_format == 'base64':
+            packed = struct.pack(f'{len(emb_list)}f', *emb_list)


For encoding_format='base64', struct.pack(f'{len(emb_list)}f', ...) uses native endianness. The OpenAI API expects embeddings encoded as little-endian float32. Use an explicit little-endian format (e.g. <{n}f) to avoid platform-dependent output.

Suggested change

packed = struct.pack(f'{len(emb_list)}f', *emb_list)

packed = struct.pack(f'<{len(emb_list)}f', *emb_list)

Copilot · 2026-04-28T03:27:27Z

+        async for res in result_generator:
+            if res.last_hidden_state is not None:
+                last_hidden_state = res.last_hidden_state
+            prompt_tokens = res.input_token_len
+
+        if last_hidden_state is None:
+            return create_error_response(
+                HTTPStatus.INTERNAL_SERVER_ERROR,
+                'Model does not support hidden states output for embeddings.',
+            )


The loop ignores res.finish_reason / error frames from AsyncEngine.generate (e.g. when prefix caching is enabled with output_last_hidden_state='all', the generator yields a finish_reason='error' frame with an error message and no hidden states). Currently this falls through to a generic 500. Handle finish_reason=='error' (and possibly client disconnect) and surface the actual error message/status to the caller.

Copilot · 2026-04-28T03:27:27Z

+        # capture full hidden states before postprocessing slices to last token
+        full_hidden_states = None
+        if return_last_hidden_states:
+            raw_hidden = ret['hidden_states']
+            raw_seq_length = ret.get('seq_length', inputs.seq_length)
+            # raw_hidden shape: [1, total_tokens, hidden_dim] or [total_tokens, hidden_dim]
+            if raw_hidden.dim() == 3:
+                raw_hidden = raw_hidden[0]  # [total_tokens, hidden_dim]
+            # slice per-sequence and mean pool
+            if raw_seq_length.numel() == 1:
+                full_hidden_states = raw_hidden.mean(dim=0, keepdim=True)  # [1, hidden_dim]
+            else:
+                parts = raw_hidden.split(raw_seq_length.tolist(), dim=0)
+                full_hidden_states = torch.stack([p.mean(dim=0) for p in parts], dim=0)  # [bs, hidden_dim]
+


Mean pooling is computed per forward pass using raw_seq_length. For long-context chunking (inputs.is_chunk), intermediate chunks do not emit outputs (EngineLoop skips non-last chunks), so the embedding on the last chunk will only reflect that final chunk rather than the full input sequence. To make embeddings correct for chunked prefill, accumulate a weighted sum/count across chunks (e.g., store partial sums in the sequence state) and finalize the mean on the last chunk.

CUHKSZzxy

Thanks for your contribution, can you resolve the merge conflict and add unit tests for v1/embeddings?

Add support for the standard OpenAI embeddings endpoint that extracts last hidden states from the model and applies mean pooling. This enables downstream tools (LangChain, LlamaIndex, RAG pipelines) to use lmdeploy for text embedding generation. Changes: - Replace stub /v1/embeddings with full implementation supporting float and base64 encoding formats - Thread last_hidden_states through the PyTorch engine pipeline (BatchedOutputs -> InferOutput -> EngineOutput) - Capture full-sequence hidden states before postprocessing slices to last token, and mean pool per-sequence in the engine - Pass do_preprocess=False to avoid chat template being applied - Sum prompt_tokens across all inputs instead of overwriting - Default model to async_engine.model_name when not provided - Use little-endian format for base64 encoding (< prefix) - Handle finish_reason='error' frames from engine - Add unit tests for v1/embeddings endpoint

ZhijunLStudio · 2026-05-07T08:18:04Z

Thanks for your contribution, can you resolve the merge conflict and add unit tests for v1/embeddings?

Thanks, conflicts resolved and unit tests added. PTAL.

lvhan028 requested review from Copilot and grimoire April 28, 2026 03:19

lvhan028 added the enhancement New feature or request label Apr 28, 2026

Copilot started reviewing on behalf of lvhan028 April 28, 2026 03:19 View session

lvhan028 self-requested a review April 28, 2026 03:20

Copilot AI reviewed Apr 28, 2026

View reviewed changes

lvhan028 requested a review from CUHKSZzxy April 29, 2026 06:00

CUHKSZzxy reviewed Apr 29, 2026

View reviewed changes

ZhijunLStudio force-pushed the feat/embeddings-endpoint branch from 84a2f95 to 654dde9 Compare May 7, 2026 08:07

CUHKSZzxy mentioned this pull request May 7, 2026

Support output_logits='generation' and output_last_hidden_state in PyTorch backend #4534

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Implement `/v1/embeddings` endpoint for OpenAI-compatible API#4550

[Feature] Implement `/v1/embeddings` endpoint for OpenAI-compatible API#4550
ZhijunLStudio wants to merge 1 commit intoInternLM:mainfrom
ZhijunLStudio:feat/embeddings-endpoint

ZhijunLStudio commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

CUHKSZzxy left a comment •

edited

Loading

Uh oh!

ZhijunLStudio commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	packed = struct.pack(f'{len(emb_list)}f', *emb_list)
	packed = struct.pack(f'<{len(emb_list)}f', *emb_list)

Conversation

ZhijunLStudio commented Apr 23, 2026

Motivation

Modification

API layer

PyTorch engine pipeline (threading hidden states from model forward to API response)

Tested with

BC-breaking

Use cases

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

CUHKSZzxy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhijunLStudio commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUHKSZzxy left a comment •

edited

Loading