Add Qwen3-Coder-Next contrib model (80B/3B hybrid DeltaNet+MoE) by jimburtoft · Pull Request #170 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-06-01T05:40:20Z

Summary

Adds NxD Inference implementation for Qwen/Qwen3-Coder-Next, a hybrid Gated DeltaNet + GQA + Sparse MoE model (80B total, ~3B active per token)
Custom NKI kernels for DeltaNet linear recurrence and head_dim=256 flash attention
Expert Parallelism (EP=1,2,4,8) validated with shared expert scaling fix
vLLM serving integration included

Model Architecture

Property	Value
Parameters	80B total, ~3B active/token
Layers	48 (36 DeltaNet + 12 GQA)
Experts	512 per layer, top-10 routing
Attention	head_dim=256, 16Q/2KV heads, partial RoPE (25%)
DeltaNet	Gated linear recurrence, head_dim=128, 16 QK / 32 V heads

Performance (trn2.48xlarge, TP=8, batch=1)

Metric	Value
Throughput	77 tok/s
TPOT	13.0 ms (constant, O(1) generation)
TTFT @ 128 tokens	1,235 ms
TTFT @ 512 tokens	3,471 ms

Validation

100% top-1 token match vs HF BF16 CPU reference (14/14 prompts, cosine similarity 0.9998)
Expert Parallelism validated at EP=1,2,4,8 with identical accuracy
Tested on SDK 2.30 (neuronx-cc 2.25.3371, NxDI 0.10.17970)

Files Added

contrib/models/Qwen3-Coder-Next/
├── README.md                           # Documentation & usage
├── src/
│   ├── modeling_qwen35_moe.py          # Model implementation (~3.7k lines)
│   ├── nki_deltanet.py                 # NKI kernel: DeltaNet recurrence
│   └── nki_flash_attn_d256_pipe.py     # NKI kernel: flash attention d=256
├── test/
│   └── integration/test_model.py       # Accuracy + performance tests
└── vllm/
    ├── register_model.py               # vLLM model registration
    ├── start_vllm_server.sh            # Server launch script
    └── test_vllm_client.py             # Client test script

Testing Instructions

# On trn2.48xlarge with SDK 2.30 DLAMI
source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
huggingface-cli download Qwen/Qwen3-Coder-Next --local-dir /mnt/models/Qwen3-Coder-Next/

MODEL_PATH=/mnt/models/Qwen3-Coder-Next \
COMPILED_PATH=/mnt/compiled_qwen3_test/ \
python test/integration/test_model.py

NxD Inference implementation for Qwen/Qwen3-Coder-Next, a hybrid Gated DeltaNet + GQA + Sparse MoE model (80B total, ~3B active/token). Key features: - Custom NKI kernels for DeltaNet recurrence and d=256 flash attention - Expert Parallelism support (EP=1,2,4,8 validated) - Shared expert EP scaling fix for correct world_group all-reduce - vLLM integration for serving - 77 tok/s at batch=1 on trn2.48xlarge (TP=8) - 100% top-1 accuracy match vs HF CPU reference (14/14 prompts) - Constant TPOT regardless of context length (DeltaNet O(1) generation) Architecture: 48 layers (36 DeltaNet + 12 GQA), 512 experts top-10, head_dim=256, partial RoPE (25%), max context 1024 tokens. Tested on: trn2.48xlarge, SDK 2.30, LNC=2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Qwen3-Coder-Next contrib model (80B/3B hybrid DeltaNet+MoE)#170

Add Qwen3-Coder-Next contrib model (80B/3B hybrid DeltaNet+MoE)#170
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3-coder-next-pr

jimburtoft commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jimburtoft commented Jun 1, 2026

Summary

Model Architecture

Performance (trn2.48xlarge, TP=8, batch=1)

Validation

Files Added

Testing Instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant