Add ViT Attention Plugin Support for Qwen, Mllama, and SigLIP Visual Models#4241
Add ViT Attention Plugin Support for Qwen, Mllama, and SigLIP Visual Models#4241micwill755 wants to merge 5 commits into
Conversation
narendasan
left a comment
There was a problem hiding this comment.
Are we able to use: https://huggingface.co/docs/transformers/v5.5.0/en/serialization#exporting-to-production to avoid as much patching on the model side?
…into TensorRT-Edge-LLM under vitAttentionKernels
narendasan
left a comment
There was a problem hiding this comment.
How does the new plugin operator get inserted into the graph?
| position_ids = torch.arange(input_embeds.shape[1]).unsqueeze(0).to(device) | ||
|
|
||
| use_fp32_acc = False | ||
| use_explicit_typing = False |
There was a problem hiding this comment.
Enabled precision is deprecated in TRT 10.16 and will be removed in the next version so we dont need this code path
There was a problem hiding this comment.
Will do. I’ll clean this up by removing.
It follows the same pattern as the existing AttentionPlugin integration. At a high level, we insert a Torch custom op into the Dynamo graph by wrapping/replacing the model attention module. That custom op is only a graph marker on the PyTorch side. During Torch-TensorRT conversion, the registered converter lowers that marker to the real TensorRT plugin layer by looking up the plugin creator and calling add_plugin_v2.
|
… compile the vision tower through the ViT plugin path, compile the LM separately, insert both back into the VLM structure, and generation succeeds with sensible output. We also verified the vision path at several levels: reconstructed PyTorch visual matches direct HF visual, individual attention/plugin checks pass, and semantic generation is mostly aligned with small FP16 drift
Summary
This PR adds ViT attention plugin integration and validation support to the TensorRT Dynamo examples/tooling path. It wires ViTAttentionPlugin conversion through the Torch-TensorRT/Dynamo flow, supports Qwen-style packed/windowed attention metadata via cu_seqlens and max_seq_len, and adds end-to-end visual model validation for Qwen2.5-VL, Llama 3.2 Vision/Mllama, and GR00T/Eagle/SigLIP-style models.
Changes
Testing