Update turbomind modeling infrastructure#4557
Update turbomind modeling infrastructure#4557lzhangzz wants to merge 6 commits intoInternLM:mainfrom
Conversation
…t loading, and model loader 674 squashed commits: reorganize turbomind directory structure, refactor weight loading to support heterogeneous weight data types, add WeightFormat enum, replace BaseOutputModel/TextModelLoader with unified ModelLoader, eliminate data_format threading from Linear, and remove dead code.
There was a problem hiding this comment.
Pull request overview
This PR refactors TurboMind’s modeling + conversion stack by replacing the legacy “deploy/source_model + config dataclasses” pipeline with a spec/builder-driven module system, adding a C++ module registry and new weight-module types, and updating engine/model code to consume the new weight tree.
Changes:
- Introduces a registry-backed C++
core::Moduleinfrastructure (plusDataFormat) and new modular weight classes (Linear/Norm/Attention/FFN/MoE/DeltaNet/ModelRoot/ModelWeight). - Reworks the Python-side TurboMind converter to use
TextModelSpec+ builders/model loader, and removes the legacylmdeploy.turbomind.deploypipeline. - Updates engine/model runtime plumbing (TurboMind API, Engine/SequenceManager, llama layers) to use the new module/weight tree.
Reviewed changes
Copilot reviewed 131 out of 131 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_turbomind/test_converter.py | Removes legacy converter tests; leaves a remaining test that still references removed legacy modules. |
| tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py | Adjusts compressed-tensors tests but still imports removed legacy deploy modules. |
| tests/test_lmdeploy/test_converter.py | Adds tests for _deep_merge plus a logging capture fixture. |
| src/turbomind/utils/memory_utils.h | Declares dtype-cast kernel + in-place ensure-float-dtype helper. |
| src/turbomind/utils/memory_utils.cu | Implements dtype casting and EnsureFloatDtype. |
| src/turbomind/turbomind.h | Updates TurboMind API to accept EngineConfig and expose module roots + TP ranks. |
| src/turbomind/python/CMakeLists.txt | Ensures static registrars are linked into the Python extension via --whole-archive. |
| src/turbomind/models/output_processor.h | Refactors ctor signature to avoid ModelParam dependency. |
| src/turbomind/models/output_processor.cc | Implements updated OutputProcessor ctor signature. |
| src/turbomind/models/norm_weight.h | Adds a NormWeight module type. |
| src/turbomind/models/norm_weight.cc | Registers and prepares NormWeight (dtype ensure). |
| src/turbomind/models/moe_weight.h | Adds a modular MoeWeight definition/config. |
| src/turbomind/models/moe_weight.cc | Implements MoE expert linking into a fused block view. |
| src/turbomind/models/model_weight.h | Adds root ModelWeight module for full weight tree. |
| src/turbomind/models/model_weight.cc | Implements ModelWeight prepare/verify + derived metadata. |
| src/turbomind/models/model_root.h | Adds ModelRoot sentinel for stream/allocator ownership. |
| src/turbomind/models/model_root.cc | Implements ModelRoot runtime context + prepare checks. |
| src/turbomind/models/llama/unified_decoder.h | Updates decoder to consume ModelWeight/DecoderLayerWeight. |
| src/turbomind/models/llama/unified_attention_layer.h | Refactors attention layer to use new AttentionWeight and rope config. |
| src/turbomind/models/llama/moe_ffn_layer.h | Refactors MoE FFN layer to use MoeWeight. |
| src/turbomind/models/llama/llama_rope.h | Moves rope param helpers out (now in AttentionWeight impl). |
| src/turbomind/models/llama/llama_params.h | Replaces model/attn/moe params with EngineConfig-based EngineParam. |
| src/turbomind/models/llama/SequenceManager.h | Updates ctor signature to explicit scalar params (no ModelParam). |
| src/turbomind/models/llama/SequenceManager.cc | Implements updated SequenceManager state sizing and cache layout. |
| src/turbomind/models/llama/LlamaWeight.h | Removes old monolithic LlamaWeight. |
| src/turbomind/models/llama/LlamaWeight.cc | Removes old monolithic LlamaWeight implementation. |
| src/turbomind/models/llama/LlamaLinear.h | Switches linear ops to new LinearWeight. |
| src/turbomind/models/llama/LlamaLinear.cu | Implements GEMM path using LinearWeight formats/descriptors. |
| src/turbomind/models/llama/LlamaFfnLayer.h | Refactors FFN layer to consume FfnWeight. |
| src/turbomind/models/llama/LlamaFfnLayer.cc | Updates FFN forward path for new weight module layout. |
| src/turbomind/models/llama/LlamaDenseWeight.h | Removes old dense/attention/ffn weight structs. |
| src/turbomind/models/llama/LlamaDecoderLayerWeight.h | Removes old llama-specific decoder layer weight. |
| src/turbomind/models/llama/LlamaDecoderLayerWeight.cc | Removes old llama-specific decoder layer weight impl. |
| src/turbomind/models/llama/GatedDeltaNetWeight.h | Removes old DeltaNet weight module. |
| src/turbomind/models/llama/GatedDeltaNetWeight.cc | Removes old DeltaNet weight module impl. |
| src/turbomind/models/llama/GatedDeltaNetLayer.h | Updates GDN layer to consume DeltaNetWeight. |
| src/turbomind/models/llama/CMakeLists.txt | Adjusts llama static lib sources (legacy pieces removed). |
| src/turbomind/models/linear_weight.h | Adds new LinearWeight module and format helpers. |
| src/turbomind/models/language_model.h | Switches LanguageModel to accept ModelWeight. |
| src/turbomind/models/input_processor.h | Refactors ctor to avoid ModelParam dependency. |
| src/turbomind/models/input_processor.cc | Implements updated ctor; allocates embed buffers from explicit dims/dtype. |
| src/turbomind/models/ffn_weight.h | Adds FfnWeight module and config. |
| src/turbomind/models/ffn_weight.cc | Implements FfnWeight::prepare (epilogue + grouped flag propagation). |
| src/turbomind/models/delta_net_weight.h | Adds DeltaNetWeight module and config. |
| src/turbomind/models/delta_net_weight.cc | Implements DeltaNetWeight::prepare dtype enforcement. |
| src/turbomind/models/decoder_layer_weight.h | Adds architecture-independent DecoderLayerWeight composite. |
| src/turbomind/models/decoder_layer_weight.cc | Implements verify rules and registers the module. |
| src/turbomind/models/attention_weight.h | Adds AttentionWeight module and embedded RopeConfig. |
| src/turbomind/models/attention_weight.cc | Implements rope kernel param init and registers AttentionWeight. |
| src/turbomind/models/CMakeLists.txt | Adds new module sources to models library; removes legacy llama weight sources. |
| src/turbomind/kernels/quantization.cu | Makes QuantizeSymm dtype-dispatched. |
| src/turbomind/kernels/gemm/convert_v3.cu | Comment tweak for “no quantization” case. |
| src/turbomind/kernels/gemm/CMakeLists.txt | Comments out legacy gemm test executables. |
| src/turbomind/engine/engine_config.h | Introduces EngineConfig struct (X-macro fields). |
| src/turbomind/engine/engine.h | Updates Engine ctor signature (now takes ModelWeight). |
| src/turbomind/engine/engine.cc | Refactors Engine to derive runtime fields from ModelWeight rather than ModelParam. |
| src/turbomind/core/test_data_format.cc | Adds Catch2 tests for DataFormat/ResolveLinearWeightFormat. |
| src/turbomind/core/registry.h | Adds module type registry + registration macro. |
| src/turbomind/core/registry.cc | Implements module registry. |
| src/turbomind/core/module.cc | Rewrites module base + ModuleList implementation and hooks up registry-based creation. |
| src/turbomind/core/data_format.h | Adds DataFormat + quant-param descriptors and helpers. |
| src/turbomind/core/data_format.cc | Implements DataFormat logic and ResolveLinearWeightFormat. |
| src/turbomind/core/CMakeLists.txt | Builds new core sources + adds data_format test. |
| src/turbomind/CMakeLists.txt | Adjusts turbomind link libs (removes yaml-cpp). |
| scripts/test_turbomind_model.py | Adds a CLI smoke-test script for TurboMind models. |
| lmdeploy/turbomind/supported_models.py | Narrows/updates supported arch mapping and simplifies checks. |
| lmdeploy/turbomind/spec.py | Adds TextModelSpec base (HF parsing → C++ configs + weight commits). |
| lmdeploy/turbomind/models/base.py | Introduces new INPUT_MODELS registry for spec classes. |
| lmdeploy/turbomind/models/init.py | Imports/registers available specs. |
| lmdeploy/turbomind/model_loader.py | Adds ModelLoader to bind runtime handles and load weights into TM. |
| lmdeploy/turbomind/loader.py | Adds all_items() API to loaders for spec-driven loading. |
| lmdeploy/turbomind/linear.py | Adds Linear bundle type and padding/concat helpers. |
| lmdeploy/turbomind/deploy/target_model/fp.py | Removes legacy deploy output model stub. |
| lmdeploy/turbomind/deploy/target_model/init.py | Removes legacy deploy target_model exports. |
| lmdeploy/turbomind/deploy/source_model/xcomposer2.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/molmo.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/mixtral.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/minicpmv.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/llava.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/internvl.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/internlm2.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/gpt_oss.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/glm4.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/deepseek_vl.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/deepseek2.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/base.py | Removes legacy deploy registries/base classes. |
| lmdeploy/turbomind/deploy/source_model/baichuan.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/init.py | Removes legacy deploy source_model imports. |
| lmdeploy/turbomind/deploy/policy.py | Removes legacy tensor processing policy helpers. |
| lmdeploy/turbomind/deploy/parameter.py | Removes legacy parameter export utilities. |
| lmdeploy/turbomind/deploy/config.py | Removes legacy turbomind model config dataclasses. |
| lmdeploy/turbomind/deploy/init.py | Removes legacy deploy package init. |
| lmdeploy/turbomind/builders/norm.py | Adds builder for Norm module commits. |
| lmdeploy/turbomind/builders/moe.py | Adds builder for MoE non-expert params and gate commits. |
| lmdeploy/turbomind/builders/module_list.py | Adds builder for ModuleList container commits. |
| lmdeploy/turbomind/builders/mla.py | Adds MLA fold/pad pipeline + builder. |
| lmdeploy/turbomind/builders/deltanet.py | Adds DeltaNet fusion helpers + builder. |
| lmdeploy/turbomind/builders/decoder_layer.py | Adds a decoder-layer container builder. |
| lmdeploy/turbomind/builders/attention.py | Adds attention fusion pipeline + builder. |
| lmdeploy/turbomind/builders/init.py | Exposes builder APIs. |
| lmdeploy/messages.py | Changes Response.__repr__ formatting. |
| lmdeploy/archs.py | Changes ImportError handling in backend auto-selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void invokeDtypeCast( | ||
| void* dst, const void* src, size_t count, DataType dst_dtype, DataType src_dtype, cudaStream_t stream) | ||
| { | ||
| const int block = 512; | ||
| const int grid = std::min((count + block - 1) / block, (size_t)8192); | ||
|
|
||
| using half_t = turbomind::half_t; | ||
| using bf16_t = turbomind::bfloat16_t; | ||
|
|
||
| // fp32 -> fp16 | ||
| if (src_dtype == turbomind::kFloat32 && dst_dtype == turbomind::kFloat16) { | ||
| dtype_cast_kernel<<<grid, block, 0, stream>>>((half_t*)dst, (const float*)src, count); | ||
| } | ||
| // fp32 -> bf16 | ||
| else if (src_dtype == turbomind::kFloat32 && dst_dtype == turbomind::kBfloat16) { | ||
| dtype_cast_kernel<<<grid, block, 0, stream>>>((bf16_t*)dst, (const float*)src, count); | ||
| } | ||
| // fp16 -> fp32 | ||
| else if (src_dtype == turbomind::kFloat16 && dst_dtype == turbomind::kFloat32) { | ||
| dtype_cast_kernel<<<grid, block, 0, stream>>>((float*)dst, (const half_t*)src, count); | ||
| } | ||
| // bf16 -> fp32 | ||
| else if (src_dtype == turbomind::kBfloat16 && dst_dtype == turbomind::kFloat32) { | ||
| dtype_cast_kernel<<<grid, block, 0, stream>>>((float*)dst, (const bf16_t*)src, count); | ||
| } | ||
| // fp16 -> bf16 | ||
| else if (src_dtype == turbomind::kFloat16 && dst_dtype == turbomind::kBfloat16) { | ||
| dtype_cast_kernel<<<grid, block, 0, stream>>>((bf16_t*)dst, (const half_t*)src, count); | ||
| } | ||
| // bf16 -> fp16 | ||
| else if (src_dtype == turbomind::kBfloat16 && dst_dtype == turbomind::kFloat16) { | ||
| dtype_cast_kernel<<<grid, block, 0, stream>>>((half_t*)dst, (const bf16_t*)src, count); | ||
| } | ||
| } |
There was a problem hiding this comment.
invokeDtypeCast can launch a kernel with grid==0 when count==0 (CUDA launch error) and silently does nothing for unsupported dtype pairs (no else/check). Add an early return when count==0 and add a failure path (e.g., TM_CHECK/error) for unsupported (src_dtype, dst_dtype); also consider checking/propagating CUDA launch errors for easier debugging.
| def test_ffn_reader_kind_none(): | ||
| """FFN readers must handle kind=None (returns filter list, not tensors). | ||
|
|
There was a problem hiding this comment.
test_ffn_reader_kind_none still depends on the legacy lmdeploy.turbomind.deploy.source_model.* reader classes, but those deploy/source_model modules are deleted in this PR, so this test will now fail with ImportError. Update the test to exercise the new spec/builder codepaths (or remove it if the old reader API is intentionally removed).
| import torch | ||
|
|
||
| from lmdeploy.turbomind.deploy import converter | ||
| from lmdeploy.turbomind.deploy.parameter import QuantWeightOnly, pack_u4_row | ||
| from lmdeploy.turbomind.deploy.source_model.qwen import Qwen3_5ReaderMixin | ||
|
|
There was a problem hiding this comment.
This test still imports lmdeploy.turbomind.deploy.parameter and lmdeploy.turbomind.deploy.source_model.qwen, but the entire lmdeploy.turbomind.deploy package (including parameter.py and source_model/) is removed in this PR. As-is this file will fail on import; update it to the new module locations (or rewrite the test around the new weight-loading API).
| from lmdeploy.turbomind.supported_models import is_supported as is_supported_turbomind | ||
| turbomind_has = is_supported_turbomind(model_path) | ||
| except ImportError: | ||
| raise |
There was a problem hiding this comment.
The except ImportError: block now immediately re-raises, making the subsequent fallback (is_turbomind_installed = False and the warning path) unreachable. Either remove the raise to preserve the intended fallback-to-pytorch behavior, or remove the dead code and let the ImportError propagate consistently.
| raise |
|
|
||
| def __repr__(self): | ||
| return f'text={self.text!r}\n{self._format_none_text_fields()}' | ||
| return f'text={self.text}\n{self._format_none_text_fields()}' |
There was a problem hiding this comment.
Response.__repr__ no longer uses !r (and is now identical to __str__), which makes debugging/logging ambiguous (e.g., newlines/quotes in text are no longer escaped). Restore a proper __repr__ (e.g., use self.text!r and ideally include the class name) so repr(response) is unambiguous.
| return f'text={self.text}\n{self._format_none_text_fields()}' | |
| return f'{self.__class__.__name__}(text={self.text!r}\n{self._format_none_text_fields()})' |
| @pytest.fixture(autouse=True) | ||
| def _caplog_lmdeploy(caplog): | ||
| caplog.set_level(logging.WARNING, logger='lmdeploy') | ||
| logger = logging.getLogger('lmdeploy') | ||
| logger.propagate = True | ||
| yield | ||
| logger.propagate = False | ||
|
|
There was a problem hiding this comment.
The autouse fixture forces logger.propagate to True then unconditionally sets it to False afterwards, which can leak logging configuration into other tests (it doesn't restore the prior value). Capture the original propagate value before mutating it and restore that value in the teardown.
- Move dequant/transform utilities from _base.py into linear.py as the canonical home for all Linear operations - Unify _ensure_compatible_formats and dequant_mixed into a single dequant_mixed function that triggers on any format diversity - Drop 'Spec' suffix from all turbomind model classes and files (TextModelSpec → TextModel, Qwen3TextSpec → Qwen3TextModel, etc.) - Extract TextModelBuilder from _base.py into builders/text_model.py - Move model-specific qk_norm from TextModel to Qwen3 and Qwen3.5 - Fix .gitignore typo (trubomind → turbomind)
|
For the /nvme4/huggingface_hub/hub/models--Qwen--Qwen3.5-2B/snapshots/15852e8c16360a2fea060d615a32b45270f8a8fc/ model, the results differ from those of the main branch. this branch main branch |
…anup Align all turbomind source models with Qwen3 conventions: - Drop engine_cfg from model signatures, wire data_type via Context - Add Context, ParallelGroup, make_moe_config, make_mla_config helpers - Collapse make_*_config functions by removing per-function data_type - Remove dead fields from C++ configs (has_bias, hidden_dim, etc.) - Remove _layer_pattern, _embed_key, _norm_key from all models - Unify FFN padding with group-based pad/round_up helpers - Add TP padding for block-quantized formats and GEMM K-alignment - Remove dead code: _pad_1d, _norm, pad_in_dim, _softmax_scale - Add InternVL3.5, InternLM2/3, Llama turbomind support - Rename fused_moe to is_expert, align Python/C++ config fields - Use direct HF config access, Transformers type hints, all-params loader - Clean up imports, docstrings, formatting across all model files
The raise in archs.py was a debugging leftover. The repr in messages.py needs !r to properly escape control characters.
No description provided.