Skip to content

Update turbomind modeling infrastructure#4557

Open
lzhangzz wants to merge 6 commits intoInternLM:mainfrom
lzhangzz:modeling-1b
Open

Update turbomind modeling infrastructure#4557
lzhangzz wants to merge 6 commits intoInternLM:mainfrom
lzhangzz:modeling-1b

Conversation

@lzhangzz
Copy link
Copy Markdown
Collaborator

No description provided.

…t loading, and model loader

674 squashed commits: reorganize turbomind directory structure, refactor
weight loading to support heterogeneous weight data types, add WeightFormat
enum, replace BaseOutputModel/TextModelLoader with unified ModelLoader,
eliminate data_format threading from Linear, and remove dead code.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors TurboMind’s modeling + conversion stack by replacing the legacy “deploy/source_model + config dataclasses” pipeline with a spec/builder-driven module system, adding a C++ module registry and new weight-module types, and updating engine/model code to consume the new weight tree.

Changes:

  • Introduces a registry-backed C++ core::Module infrastructure (plus DataFormat) and new modular weight classes (Linear/Norm/Attention/FFN/MoE/DeltaNet/ModelRoot/ModelWeight).
  • Reworks the Python-side TurboMind converter to use TextModelSpec + builders/model loader, and removes the legacy lmdeploy.turbomind.deploy pipeline.
  • Updates engine/model runtime plumbing (TurboMind API, Engine/SequenceManager, llama layers) to use the new module/weight tree.

Reviewed changes

Copilot reviewed 131 out of 131 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_turbomind/test_converter.py Removes legacy converter tests; leaves a remaining test that still references removed legacy modules.
tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py Adjusts compressed-tensors tests but still imports removed legacy deploy modules.
tests/test_lmdeploy/test_converter.py Adds tests for _deep_merge plus a logging capture fixture.
src/turbomind/utils/memory_utils.h Declares dtype-cast kernel + in-place ensure-float-dtype helper.
src/turbomind/utils/memory_utils.cu Implements dtype casting and EnsureFloatDtype.
src/turbomind/turbomind.h Updates TurboMind API to accept EngineConfig and expose module roots + TP ranks.
src/turbomind/python/CMakeLists.txt Ensures static registrars are linked into the Python extension via --whole-archive.
src/turbomind/models/output_processor.h Refactors ctor signature to avoid ModelParam dependency.
src/turbomind/models/output_processor.cc Implements updated OutputProcessor ctor signature.
src/turbomind/models/norm_weight.h Adds a NormWeight module type.
src/turbomind/models/norm_weight.cc Registers and prepares NormWeight (dtype ensure).
src/turbomind/models/moe_weight.h Adds a modular MoeWeight definition/config.
src/turbomind/models/moe_weight.cc Implements MoE expert linking into a fused block view.
src/turbomind/models/model_weight.h Adds root ModelWeight module for full weight tree.
src/turbomind/models/model_weight.cc Implements ModelWeight prepare/verify + derived metadata.
src/turbomind/models/model_root.h Adds ModelRoot sentinel for stream/allocator ownership.
src/turbomind/models/model_root.cc Implements ModelRoot runtime context + prepare checks.
src/turbomind/models/llama/unified_decoder.h Updates decoder to consume ModelWeight/DecoderLayerWeight.
src/turbomind/models/llama/unified_attention_layer.h Refactors attention layer to use new AttentionWeight and rope config.
src/turbomind/models/llama/moe_ffn_layer.h Refactors MoE FFN layer to use MoeWeight.
src/turbomind/models/llama/llama_rope.h Moves rope param helpers out (now in AttentionWeight impl).
src/turbomind/models/llama/llama_params.h Replaces model/attn/moe params with EngineConfig-based EngineParam.
src/turbomind/models/llama/SequenceManager.h Updates ctor signature to explicit scalar params (no ModelParam).
src/turbomind/models/llama/SequenceManager.cc Implements updated SequenceManager state sizing and cache layout.
src/turbomind/models/llama/LlamaWeight.h Removes old monolithic LlamaWeight.
src/turbomind/models/llama/LlamaWeight.cc Removes old monolithic LlamaWeight implementation.
src/turbomind/models/llama/LlamaLinear.h Switches linear ops to new LinearWeight.
src/turbomind/models/llama/LlamaLinear.cu Implements GEMM path using LinearWeight formats/descriptors.
src/turbomind/models/llama/LlamaFfnLayer.h Refactors FFN layer to consume FfnWeight.
src/turbomind/models/llama/LlamaFfnLayer.cc Updates FFN forward path for new weight module layout.
src/turbomind/models/llama/LlamaDenseWeight.h Removes old dense/attention/ffn weight structs.
src/turbomind/models/llama/LlamaDecoderLayerWeight.h Removes old llama-specific decoder layer weight.
src/turbomind/models/llama/LlamaDecoderLayerWeight.cc Removes old llama-specific decoder layer weight impl.
src/turbomind/models/llama/GatedDeltaNetWeight.h Removes old DeltaNet weight module.
src/turbomind/models/llama/GatedDeltaNetWeight.cc Removes old DeltaNet weight module impl.
src/turbomind/models/llama/GatedDeltaNetLayer.h Updates GDN layer to consume DeltaNetWeight.
src/turbomind/models/llama/CMakeLists.txt Adjusts llama static lib sources (legacy pieces removed).
src/turbomind/models/linear_weight.h Adds new LinearWeight module and format helpers.
src/turbomind/models/language_model.h Switches LanguageModel to accept ModelWeight.
src/turbomind/models/input_processor.h Refactors ctor to avoid ModelParam dependency.
src/turbomind/models/input_processor.cc Implements updated ctor; allocates embed buffers from explicit dims/dtype.
src/turbomind/models/ffn_weight.h Adds FfnWeight module and config.
src/turbomind/models/ffn_weight.cc Implements FfnWeight::prepare (epilogue + grouped flag propagation).
src/turbomind/models/delta_net_weight.h Adds DeltaNetWeight module and config.
src/turbomind/models/delta_net_weight.cc Implements DeltaNetWeight::prepare dtype enforcement.
src/turbomind/models/decoder_layer_weight.h Adds architecture-independent DecoderLayerWeight composite.
src/turbomind/models/decoder_layer_weight.cc Implements verify rules and registers the module.
src/turbomind/models/attention_weight.h Adds AttentionWeight module and embedded RopeConfig.
src/turbomind/models/attention_weight.cc Implements rope kernel param init and registers AttentionWeight.
src/turbomind/models/CMakeLists.txt Adds new module sources to models library; removes legacy llama weight sources.
src/turbomind/kernels/quantization.cu Makes QuantizeSymm dtype-dispatched.
src/turbomind/kernels/gemm/convert_v3.cu Comment tweak for “no quantization” case.
src/turbomind/kernels/gemm/CMakeLists.txt Comments out legacy gemm test executables.
src/turbomind/engine/engine_config.h Introduces EngineConfig struct (X-macro fields).
src/turbomind/engine/engine.h Updates Engine ctor signature (now takes ModelWeight).
src/turbomind/engine/engine.cc Refactors Engine to derive runtime fields from ModelWeight rather than ModelParam.
src/turbomind/core/test_data_format.cc Adds Catch2 tests for DataFormat/ResolveLinearWeightFormat.
src/turbomind/core/registry.h Adds module type registry + registration macro.
src/turbomind/core/registry.cc Implements module registry.
src/turbomind/core/module.cc Rewrites module base + ModuleList implementation and hooks up registry-based creation.
src/turbomind/core/data_format.h Adds DataFormat + quant-param descriptors and helpers.
src/turbomind/core/data_format.cc Implements DataFormat logic and ResolveLinearWeightFormat.
src/turbomind/core/CMakeLists.txt Builds new core sources + adds data_format test.
src/turbomind/CMakeLists.txt Adjusts turbomind link libs (removes yaml-cpp).
scripts/test_turbomind_model.py Adds a CLI smoke-test script for TurboMind models.
lmdeploy/turbomind/supported_models.py Narrows/updates supported arch mapping and simplifies checks.
lmdeploy/turbomind/spec.py Adds TextModelSpec base (HF parsing → C++ configs + weight commits).
lmdeploy/turbomind/models/base.py Introduces new INPUT_MODELS registry for spec classes.
lmdeploy/turbomind/models/init.py Imports/registers available specs.
lmdeploy/turbomind/model_loader.py Adds ModelLoader to bind runtime handles and load weights into TM.
lmdeploy/turbomind/loader.py Adds all_items() API to loaders for spec-driven loading.
lmdeploy/turbomind/linear.py Adds Linear bundle type and padding/concat helpers.
lmdeploy/turbomind/deploy/target_model/fp.py Removes legacy deploy output model stub.
lmdeploy/turbomind/deploy/target_model/init.py Removes legacy deploy target_model exports.
lmdeploy/turbomind/deploy/source_model/xcomposer2.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/molmo.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/mixtral.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/minicpmv.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/llava.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/internvl.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/internlm2.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/gpt_oss.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/glm4.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/deepseek_vl.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/deepseek2.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/base.py Removes legacy deploy registries/base classes.
lmdeploy/turbomind/deploy/source_model/baichuan.py Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/init.py Removes legacy deploy source_model imports.
lmdeploy/turbomind/deploy/policy.py Removes legacy tensor processing policy helpers.
lmdeploy/turbomind/deploy/parameter.py Removes legacy parameter export utilities.
lmdeploy/turbomind/deploy/config.py Removes legacy turbomind model config dataclasses.
lmdeploy/turbomind/deploy/init.py Removes legacy deploy package init.
lmdeploy/turbomind/builders/norm.py Adds builder for Norm module commits.
lmdeploy/turbomind/builders/moe.py Adds builder for MoE non-expert params and gate commits.
lmdeploy/turbomind/builders/module_list.py Adds builder for ModuleList container commits.
lmdeploy/turbomind/builders/mla.py Adds MLA fold/pad pipeline + builder.
lmdeploy/turbomind/builders/deltanet.py Adds DeltaNet fusion helpers + builder.
lmdeploy/turbomind/builders/decoder_layer.py Adds a decoder-layer container builder.
lmdeploy/turbomind/builders/attention.py Adds attention fusion pipeline + builder.
lmdeploy/turbomind/builders/init.py Exposes builder APIs.
lmdeploy/messages.py Changes Response.__repr__ formatting.
lmdeploy/archs.py Changes ImportError handling in backend auto-selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +77 to +110
void invokeDtypeCast(
void* dst, const void* src, size_t count, DataType dst_dtype, DataType src_dtype, cudaStream_t stream)
{
const int block = 512;
const int grid = std::min((count + block - 1) / block, (size_t)8192);

using half_t = turbomind::half_t;
using bf16_t = turbomind::bfloat16_t;

// fp32 -> fp16
if (src_dtype == turbomind::kFloat32 && dst_dtype == turbomind::kFloat16) {
dtype_cast_kernel<<<grid, block, 0, stream>>>((half_t*)dst, (const float*)src, count);
}
// fp32 -> bf16
else if (src_dtype == turbomind::kFloat32 && dst_dtype == turbomind::kBfloat16) {
dtype_cast_kernel<<<grid, block, 0, stream>>>((bf16_t*)dst, (const float*)src, count);
}
// fp16 -> fp32
else if (src_dtype == turbomind::kFloat16 && dst_dtype == turbomind::kFloat32) {
dtype_cast_kernel<<<grid, block, 0, stream>>>((float*)dst, (const half_t*)src, count);
}
// bf16 -> fp32
else if (src_dtype == turbomind::kBfloat16 && dst_dtype == turbomind::kFloat32) {
dtype_cast_kernel<<<grid, block, 0, stream>>>((float*)dst, (const bf16_t*)src, count);
}
// fp16 -> bf16
else if (src_dtype == turbomind::kFloat16 && dst_dtype == turbomind::kBfloat16) {
dtype_cast_kernel<<<grid, block, 0, stream>>>((bf16_t*)dst, (const half_t*)src, count);
}
// bf16 -> fp16
else if (src_dtype == turbomind::kBfloat16 && dst_dtype == turbomind::kFloat16) {
dtype_cast_kernel<<<grid, block, 0, stream>>>((half_t*)dst, (const bf16_t*)src, count);
}
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invokeDtypeCast can launch a kernel with grid==0 when count==0 (CUDA launch error) and silently does nothing for unsupported dtype pairs (no else/check). Add an early return when count==0 and add a failure path (e.g., TM_CHECK/error) for unsupported (src_dtype, dst_dtype); also consider checking/propagating CUDA launch errors for easier debugging.

Copilot uses AI. Check for mistakes.
Comment on lines 6 to 8
def test_ffn_reader_kind_none():
"""FFN readers must handle kind=None (returns filter list, not tensors).

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_ffn_reader_kind_none still depends on the legacy lmdeploy.turbomind.deploy.source_model.* reader classes, but those deploy/source_model modules are deleted in this PR, so this test will now fail with ImportError. Update the test to exercise the new spec/builder codepaths (or remove it if the old reader API is intentionally removed).

Copilot uses AI. Check for mistakes.
Comment on lines 3 to 6
import torch

from lmdeploy.turbomind.deploy import converter
from lmdeploy.turbomind.deploy.parameter import QuantWeightOnly, pack_u4_row
from lmdeploy.turbomind.deploy.source_model.qwen import Qwen3_5ReaderMixin

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test still imports lmdeploy.turbomind.deploy.parameter and lmdeploy.turbomind.deploy.source_model.qwen, but the entire lmdeploy.turbomind.deploy package (including parameter.py and source_model/) is removed in this PR. As-is this file will fail on import; update it to the new module locations (or rewrite the test around the new weight-loading API).

Copilot uses AI. Check for mistakes.
Comment thread lmdeploy/archs.py Outdated
from lmdeploy.turbomind.supported_models import is_supported as is_supported_turbomind
turbomind_has = is_supported_turbomind(model_path)
except ImportError:
raise
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The except ImportError: block now immediately re-raises, making the subsequent fallback (is_turbomind_installed = False and the warning path) unreachable. Either remove the raise to preserve the intended fallback-to-pytorch behavior, or remove the dead code and let the ImportError propagate consistently.

Suggested change
raise

Copilot uses AI. Check for mistakes.
Comment thread lmdeploy/messages.py Outdated

def __repr__(self):
return f'text={self.text!r}\n{self._format_none_text_fields()}'
return f'text={self.text}\n{self._format_none_text_fields()}'
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response.__repr__ no longer uses !r (and is now identical to __str__), which makes debugging/logging ambiguous (e.g., newlines/quotes in text are no longer escaped). Restore a proper __repr__ (e.g., use self.text!r and ideally include the class name) so repr(response) is unambiguous.

Suggested change
return f'text={self.text}\n{self._format_none_text_fields()}'
return f'{self.__class__.__name__}(text={self.text!r}\n{self._format_none_text_fields()})'

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +15
@pytest.fixture(autouse=True)
def _caplog_lmdeploy(caplog):
caplog.set_level(logging.WARNING, logger='lmdeploy')
logger = logging.getLogger('lmdeploy')
logger.propagate = True
yield
logger.propagate = False

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autouse fixture forces logger.propagate to True then unconditionally sets it to False afterwards, which can leak logging configuration into other tests (it doesn't restore the prior value). Capture the original propagate value before mutating it and restore that value in the teardown.

Copilot uses AI. Check for mistakes.
- Move dequant/transform utilities from _base.py into linear.py as the
  canonical home for all Linear operations
- Unify _ensure_compatible_formats and dequant_mixed into a single
  dequant_mixed function that triggers on any format diversity
- Drop 'Spec' suffix from all turbomind model classes and files
  (TextModelSpec → TextModel, Qwen3TextSpec → Qwen3TextModel, etc.)
- Extract TextModelBuilder from _base.py into builders/text_model.py
- Move model-specific qk_norm from TextModel to Qwen3 and Qwen3.5
- Fix .gitignore typo (trubomind → turbomind)
@irexyc
Copy link
Copy Markdown
Collaborator

irexyc commented May 6, 2026

For the /nvme4/huggingface_hub/hub/models--Qwen--Qwen3.5-2B/snapshots/15852e8c16360a2fea060d615a32b45270f8a8fc/ model, the results differ from those of the main branch.

this branch

>>> pipe('hello')
text=Hello! I am an AI assistant based on the **Qwen** model. I am **Qwen**3.17************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

main branch

>>> pipe('hello')
text='Hello! How can I help you today?'

lzhangzz added 2 commits May 6, 2026 09:07
…anup

Align all turbomind source models with Qwen3 conventions:
- Drop engine_cfg from model signatures, wire data_type via Context
- Add Context, ParallelGroup, make_moe_config, make_mla_config helpers
- Collapse make_*_config functions by removing per-function data_type
- Remove dead fields from C++ configs (has_bias, hidden_dim, etc.)
- Remove _layer_pattern, _embed_key, _norm_key from all models
- Unify FFN padding with group-based pad/round_up helpers
- Add TP padding for block-quantized formats and GEMM K-alignment
- Remove dead code: _pad_1d, _norm, pad_in_dim, _softmax_scale
- Add InternVL3.5, InternLM2/3, Llama turbomind support
- Rename fused_moe to is_expert, align Python/C++ config fields
- Use direct HF config access, Transformers type hints, all-params loader
- Clean up imports, docstrings, formatting across all model files
The raise in archs.py was a debugging leftover. The repr in
messages.py needs !r to properly escape control characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants