Update turbomind modeling infrastructure by lzhangzz · Pull Request #4557 · InternLM/lmdeploy

lzhangzz · 2026-04-27T06:56:25Z

No description provided.

…t loading, and model loader 674 squashed commits: reorganize turbomind directory structure, refactor weight loading to support heterogeneous weight data types, add WeightFormat enum, replace BaseOutputModel/TextModelLoader with unified ModelLoader, eliminate data_format threading from Linear, and remove dead code.

Copilot

Pull request overview

This PR refactors TurboMind’s modeling + conversion stack by replacing the legacy “deploy/source_model + config dataclasses” pipeline with a spec/builder-driven module system, adding a C++ module registry and new weight-module types, and updating engine/model code to consume the new weight tree.

Changes:

Introduces a registry-backed C++ core::Module infrastructure (plus DataFormat) and new modular weight classes (Linear/Norm/Attention/FFN/MoE/DeltaNet/ModelRoot/ModelWeight).
Reworks the Python-side TurboMind converter to use TextModelSpec + builders/model loader, and removes the legacy lmdeploy.turbomind.deploy pipeline.
Updates engine/model runtime plumbing (TurboMind API, Engine/SequenceManager, llama layers) to use the new module/weight tree.

Reviewed changes

Copilot reviewed 131 out of 131 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_lmdeploy/test_turbomind/test_converter.py	Removes legacy converter tests; leaves a remaining test that still references removed legacy modules.
tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py	Adjusts compressed-tensors tests but still imports removed legacy deploy modules.
tests/test_lmdeploy/test_converter.py	Adds tests for `_deep_merge` plus a logging capture fixture.
src/turbomind/utils/memory_utils.h	Declares dtype-cast kernel + in-place ensure-float-dtype helper.
src/turbomind/utils/memory_utils.cu	Implements dtype casting and `EnsureFloatDtype`.
src/turbomind/turbomind.h	Updates TurboMind API to accept `EngineConfig` and expose module roots + TP ranks.
src/turbomind/python/CMakeLists.txt	Ensures static registrars are linked into the Python extension via `--whole-archive`.
src/turbomind/models/output_processor.h	Refactors ctor signature to avoid `ModelParam` dependency.
src/turbomind/models/output_processor.cc	Implements updated `OutputProcessor` ctor signature.
src/turbomind/models/norm_weight.h	Adds a `NormWeight` module type.
src/turbomind/models/norm_weight.cc	Registers and prepares `NormWeight` (dtype ensure).
src/turbomind/models/moe_weight.h	Adds a modular `MoeWeight` definition/config.
src/turbomind/models/moe_weight.cc	Implements MoE expert linking into a fused block view.
src/turbomind/models/model_weight.h	Adds root `ModelWeight` module for full weight tree.
src/turbomind/models/model_weight.cc	Implements `ModelWeight` prepare/verify + derived metadata.
src/turbomind/models/model_root.h	Adds `ModelRoot` sentinel for stream/allocator ownership.
src/turbomind/models/model_root.cc	Implements `ModelRoot` runtime context + prepare checks.
src/turbomind/models/llama/unified_decoder.h	Updates decoder to consume `ModelWeight`/`DecoderLayerWeight`.
src/turbomind/models/llama/unified_attention_layer.h	Refactors attention layer to use new `AttentionWeight` and rope config.
src/turbomind/models/llama/moe_ffn_layer.h	Refactors MoE FFN layer to use `MoeWeight`.
src/turbomind/models/llama/llama_rope.h	Moves rope param helpers out (now in AttentionWeight impl).
src/turbomind/models/llama/llama_params.h	Replaces model/attn/moe params with `EngineConfig`-based `EngineParam`.
src/turbomind/models/llama/SequenceManager.h	Updates ctor signature to explicit scalar params (no `ModelParam`).
src/turbomind/models/llama/SequenceManager.cc	Implements updated SequenceManager state sizing and cache layout.
src/turbomind/models/llama/LlamaWeight.h	Removes old monolithic `LlamaWeight`.
src/turbomind/models/llama/LlamaWeight.cc	Removes old monolithic `LlamaWeight` implementation.
src/turbomind/models/llama/LlamaLinear.h	Switches linear ops to new `LinearWeight`.
src/turbomind/models/llama/LlamaLinear.cu	Implements GEMM path using `LinearWeight` formats/descriptors.
src/turbomind/models/llama/LlamaFfnLayer.h	Refactors FFN layer to consume `FfnWeight`.
src/turbomind/models/llama/LlamaFfnLayer.cc	Updates FFN forward path for new weight module layout.
src/turbomind/models/llama/LlamaDenseWeight.h	Removes old dense/attention/ffn weight structs.
src/turbomind/models/llama/LlamaDecoderLayerWeight.h	Removes old llama-specific decoder layer weight.
src/turbomind/models/llama/LlamaDecoderLayerWeight.cc	Removes old llama-specific decoder layer weight impl.
src/turbomind/models/llama/GatedDeltaNetWeight.h	Removes old DeltaNet weight module.
src/turbomind/models/llama/GatedDeltaNetWeight.cc	Removes old DeltaNet weight module impl.
src/turbomind/models/llama/GatedDeltaNetLayer.h	Updates GDN layer to consume `DeltaNetWeight`.
src/turbomind/models/llama/CMakeLists.txt	Adjusts llama static lib sources (legacy pieces removed).
src/turbomind/models/linear_weight.h	Adds new `LinearWeight` module and format helpers.
src/turbomind/models/language_model.h	Switches `LanguageModel` to accept `ModelWeight`.
src/turbomind/models/input_processor.h	Refactors ctor to avoid `ModelParam` dependency.
src/turbomind/models/input_processor.cc	Implements updated ctor; allocates embed buffers from explicit dims/dtype.
src/turbomind/models/ffn_weight.h	Adds `FfnWeight` module and config.
src/turbomind/models/ffn_weight.cc	Implements `FfnWeight::prepare` (epilogue + grouped flag propagation).
src/turbomind/models/delta_net_weight.h	Adds `DeltaNetWeight` module and config.
src/turbomind/models/delta_net_weight.cc	Implements `DeltaNetWeight::prepare` dtype enforcement.
src/turbomind/models/decoder_layer_weight.h	Adds architecture-independent `DecoderLayerWeight` composite.
src/turbomind/models/decoder_layer_weight.cc	Implements verify rules and registers the module.
src/turbomind/models/attention_weight.h	Adds `AttentionWeight` module and embedded `RopeConfig`.
src/turbomind/models/attention_weight.cc	Implements rope kernel param init and registers `AttentionWeight`.
src/turbomind/models/CMakeLists.txt	Adds new module sources to `models` library; removes legacy llama weight sources.
src/turbomind/kernels/quantization.cu	Makes `QuantizeSymm` dtype-dispatched.
src/turbomind/kernels/gemm/convert_v3.cu	Comment tweak for “no quantization” case.
src/turbomind/kernels/gemm/CMakeLists.txt	Comments out legacy gemm test executables.
src/turbomind/engine/engine_config.h	Introduces `EngineConfig` struct (X-macro fields).
src/turbomind/engine/engine.h	Updates Engine ctor signature (now takes `ModelWeight`).
src/turbomind/engine/engine.cc	Refactors Engine to derive runtime fields from `ModelWeight` rather than `ModelParam`.
src/turbomind/core/test_data_format.cc	Adds Catch2 tests for `DataFormat`/`ResolveLinearWeightFormat`.
src/turbomind/core/registry.h	Adds module type registry + registration macro.
src/turbomind/core/registry.cc	Implements module registry.
src/turbomind/core/module.cc	Rewrites module base + ModuleList implementation and hooks up registry-based creation.
src/turbomind/core/data_format.h	Adds `DataFormat` + quant-param descriptors and helpers.
src/turbomind/core/data_format.cc	Implements `DataFormat` logic and `ResolveLinearWeightFormat`.
src/turbomind/core/CMakeLists.txt	Builds new core sources + adds data_format test.
src/turbomind/CMakeLists.txt	Adjusts turbomind link libs (removes yaml-cpp).
scripts/test_turbomind_model.py	Adds a CLI smoke-test script for TurboMind models.
lmdeploy/turbomind/supported_models.py	Narrows/updates supported arch mapping and simplifies checks.
lmdeploy/turbomind/spec.py	Adds `TextModelSpec` base (HF parsing → C++ configs + weight commits).
lmdeploy/turbomind/models/base.py	Introduces new `INPUT_MODELS` registry for spec classes.
lmdeploy/turbomind/models/init.py	Imports/registers available specs.
lmdeploy/turbomind/model_loader.py	Adds `ModelLoader` to bind runtime handles and load weights into TM.
lmdeploy/turbomind/loader.py	Adds `all_items()` API to loaders for spec-driven loading.
lmdeploy/turbomind/linear.py	Adds `Linear` bundle type and padding/concat helpers.
lmdeploy/turbomind/deploy/target_model/fp.py	Removes legacy deploy output model stub.
lmdeploy/turbomind/deploy/target_model/init.py	Removes legacy deploy target_model exports.
lmdeploy/turbomind/deploy/source_model/xcomposer2.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/molmo.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/mixtral.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/minicpmv.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/llava.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/internvl.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/internlm2.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/gpt_oss.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/glm4.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/deepseek_vl.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/deepseek2.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/base.py	Removes legacy deploy registries/base classes.
lmdeploy/turbomind/deploy/source_model/baichuan.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/init.py	Removes legacy deploy source_model imports.
lmdeploy/turbomind/deploy/policy.py	Removes legacy tensor processing policy helpers.
lmdeploy/turbomind/deploy/parameter.py	Removes legacy parameter export utilities.
lmdeploy/turbomind/deploy/config.py	Removes legacy turbomind model config dataclasses.
lmdeploy/turbomind/deploy/init.py	Removes legacy deploy package init.
lmdeploy/turbomind/builders/norm.py	Adds builder for Norm module commits.
lmdeploy/turbomind/builders/moe.py	Adds builder for MoE non-expert params and gate commits.
lmdeploy/turbomind/builders/module_list.py	Adds builder for `ModuleList` container commits.
lmdeploy/turbomind/builders/mla.py	Adds MLA fold/pad pipeline + builder.
lmdeploy/turbomind/builders/deltanet.py	Adds DeltaNet fusion helpers + builder.
lmdeploy/turbomind/builders/decoder_layer.py	Adds a decoder-layer container builder.
lmdeploy/turbomind/builders/attention.py	Adds attention fusion pipeline + builder.
lmdeploy/turbomind/builders/init.py	Exposes builder APIs.
lmdeploy/messages.py	Changes `Response.__repr__` formatting.
lmdeploy/archs.py	Changes ImportError handling in backend auto-selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T10:48:16Z

+void invokeDtypeCast(
+    void* dst, const void* src, size_t count, DataType dst_dtype, DataType src_dtype, cudaStream_t stream)
+{
+    const int block = 512;
+    const int grid  = std::min((count + block - 1) / block, (size_t)8192);
+
+    using half_t = turbomind::half_t;
+    using bf16_t = turbomind::bfloat16_t;
+
+    // fp32 -> fp16
+    if (src_dtype == turbomind::kFloat32 && dst_dtype == turbomind::kFloat16) {
+        dtype_cast_kernel<<<grid, block, 0, stream>>>((half_t*)dst, (const float*)src, count);
+    }
+    // fp32 -> bf16
+    else if (src_dtype == turbomind::kFloat32 && dst_dtype == turbomind::kBfloat16) {
+        dtype_cast_kernel<<<grid, block, 0, stream>>>((bf16_t*)dst, (const float*)src, count);
+    }
+    // fp16 -> fp32
+    else if (src_dtype == turbomind::kFloat16 && dst_dtype == turbomind::kFloat32) {
+        dtype_cast_kernel<<<grid, block, 0, stream>>>((float*)dst, (const half_t*)src, count);
+    }
+    // bf16 -> fp32
+    else if (src_dtype == turbomind::kBfloat16 && dst_dtype == turbomind::kFloat32) {
+        dtype_cast_kernel<<<grid, block, 0, stream>>>((float*)dst, (const bf16_t*)src, count);
+    }
+    // fp16 -> bf16
+    else if (src_dtype == turbomind::kFloat16 && dst_dtype == turbomind::kBfloat16) {
+        dtype_cast_kernel<<<grid, block, 0, stream>>>((bf16_t*)dst, (const half_t*)src, count);
+    }
+    // bf16 -> fp16
+    else if (src_dtype == turbomind::kBfloat16 && dst_dtype == turbomind::kFloat16) {
+        dtype_cast_kernel<<<grid, block, 0, stream>>>((half_t*)dst, (const bf16_t*)src, count);
+    }
+}


invokeDtypeCast can launch a kernel with grid==0 when count==0 (CUDA launch error) and silently does nothing for unsupported dtype pairs (no else/check). Add an early return when count==0 and add a failure path (e.g., TM_CHECK/error) for unsupported (src_dtype, dst_dtype); also consider checking/propagating CUDA launch errors for easier debugging.

Copilot · 2026-04-27T10:48:17Z

 def test_ffn_reader_kind_none():
    """FFN readers must handle kind=None (returns filter list, not tensors).



test_ffn_reader_kind_none still depends on the legacy lmdeploy.turbomind.deploy.source_model.* reader classes, but those deploy/source_model modules are deleted in this PR, so this test will now fail with ImportError. Update the test to exercise the new spec/builder codepaths (or remove it if the old reader API is intentionally removed).

Copilot · 2026-04-27T10:48:17Z

 import torch
-
-from lmdeploy.turbomind.deploy import converter
 from lmdeploy.turbomind.deploy.parameter import QuantWeightOnly, pack_u4_row
 from lmdeploy.turbomind.deploy.source_model.qwen import Qwen3_5ReaderMixin



This test still imports lmdeploy.turbomind.deploy.parameter and lmdeploy.turbomind.deploy.source_model.qwen, but the entire lmdeploy.turbomind.deploy package (including parameter.py and source_model/) is removed in this PR. As-is this file will fail on import; update it to the new module locations (or rewrite the test around the new weight-loading API).

Copilot · 2026-04-27T10:48:17Z

        from lmdeploy.turbomind.supported_models import is_supported as is_supported_turbomind
        turbomind_has = is_supported_turbomind(model_path)
    except ImportError:
+        raise


The except ImportError: block now immediately re-raises, making the subsequent fallback (is_turbomind_installed = False and the warning path) unreachable. Either remove the raise to preserve the intended fallback-to-pytorch behavior, or remove the dead code and let the ImportError propagate consistently.

Suggested change

raise

Copilot · 2026-04-27T10:48:18Z


    def __repr__(self):
-        return f'text={self.text!r}\n{self._format_none_text_fields()}'
+        return f'text={self.text}\n{self._format_none_text_fields()}'


Response.__repr__ no longer uses !r (and is now identical to __str__), which makes debugging/logging ambiguous (e.g., newlines/quotes in text are no longer escaped). Restore a proper __repr__ (e.g., use self.text!r and ideally include the class name) so repr(response) is unambiguous.

Suggested change

return f'text={self.text}\n{self._format_none_text_fields()}'

return f'{self.__class__.__name__}(text={self.text!r}\n{self._format_none_text_fields()})'

Copilot · 2026-04-27T10:48:18Z

+@pytest.fixture(autouse=True)
+def _caplog_lmdeploy(caplog):
+    caplog.set_level(logging.WARNING, logger='lmdeploy')
+    logger = logging.getLogger('lmdeploy')
+    logger.propagate = True
+    yield
+    logger.propagate = False
+


The autouse fixture forces logger.propagate to True then unconditionally sets it to False afterwards, which can leak logging configuration into other tests (it doesn't restore the prior value). Capture the original propagate value before mutating it and restore that value in the teardown.

- Move dequant/transform utilities from _base.py into linear.py as the canonical home for all Linear operations - Unify _ensure_compatible_formats and dequant_mixed into a single dequant_mixed function that triggers on any format diversity - Drop 'Spec' suffix from all turbomind model classes and files (TextModelSpec → TextModel, Qwen3TextSpec → Qwen3TextModel, etc.) - Extract TextModelBuilder from _base.py into builders/text_model.py - Move model-specific qk_norm from TextModel to Qwen3 and Qwen3.5 - Fix .gitignore typo (trubomind → turbomind)

irexyc · 2026-05-06T06:46:25Z

For the /nvme4/huggingface_hub/hub/models--Qwen--Qwen3.5-2B/snapshots/15852e8c16360a2fea060d615a32b45270f8a8fc/ model, the results differ from those of the main branch.

this branch

>>> pipe('hello')
text=Hello! I am an AI assistant based on the **Qwen** model. I am **Qwen**3.17************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

main branch

>>> pipe('hello')
text='Hello! How can I help you today?'

…anup Align all turbomind source models with Qwen3 conventions: - Drop engine_cfg from model signatures, wire data_type via Context - Add Context, ParallelGroup, make_moe_config, make_mla_config helpers - Collapse make_*_config functions by removing per-function data_type - Remove dead fields from C++ configs (has_bias, hidden_dim, etc.) - Remove _layer_pattern, _embed_key, _norm_key from all models - Unify FFN padding with group-based pad/round_up helpers - Add TP padding for block-quantized formats and GEMM K-alignment - Remove dead code: _pad_1d, _norm, pad_in_dim, _softmax_scale - Add InternVL3.5, InternLM2/3, Llama turbomind support - Rename fused_moe to is_expert, align Python/C++ config fields - Use direct HF config access, Transformers type hints, all-params loader - Clean up imports, docstrings, formatting across all model files

The raise in archs.py was a debugging leftover. The repr in messages.py needs !r to properly escape control characters.

lzhangzz added 2 commits April 27, 2026 06:40

restore skills and CLAUDE.md from upstream

990f70b

lvhan028 requested review from Copilot, irexyc and lvhan028 April 27, 2026 10:41

lvhan028 added the improvement label Apr 27, 2026

Copilot started reviewing on behalf of lvhan028 April 27, 2026 10:41 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

lzhangzz added 2 commits April 27, 2026 15:07

chore: add CLAUDE.local.md to .gitignore

a1740fb

lvhan028 mentioned this pull request Apr 29, 2026

[Feature] TurboMind后端支持视觉模型 #4562

Open

lzhangzz added 2 commits May 6, 2026 09:07

fix: revert debug raise and restore repr escaping

c39b5bb

The raise in archs.py was a debugging leftover. The repr in messages.py needs !r to properly escape control characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update turbomind modeling infrastructure#4557

Update turbomind modeling infrastructure#4557
lzhangzz wants to merge 6 commits intoInternLM:mainfrom
lzhangzz:modeling-1b

lzhangzz commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

irexyc commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		def test_ffn_reader_kind_none():
		"""FFN readers must handle kind=None (returns filter list, not tensors).

	return f'text={self.text}\n{self._format_none_text_fields()}'
	return f'{self.__class__.__name__}(text={self.text!r}\n{self._format_none_text_fields()})'

Conversation

lzhangzz commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

irexyc commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants