Skip to content

Add Qwen3-Coder-Next contrib model (80B/3B hybrid DeltaNet+MoE)#170

Open
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3-coder-next-pr
Open

Add Qwen3-Coder-Next contrib model (80B/3B hybrid DeltaNet+MoE)#170
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3-coder-next-pr

Conversation

@jimburtoft

Copy link
Copy Markdown
Contributor

Summary

  • Adds NxD Inference implementation for Qwen/Qwen3-Coder-Next, a hybrid Gated DeltaNet + GQA + Sparse MoE model (80B total, ~3B active per token)
  • Custom NKI kernels for DeltaNet linear recurrence and head_dim=256 flash attention
  • Expert Parallelism (EP=1,2,4,8) validated with shared expert scaling fix
  • vLLM serving integration included

Model Architecture

Property Value
Parameters 80B total, ~3B active/token
Layers 48 (36 DeltaNet + 12 GQA)
Experts 512 per layer, top-10 routing
Attention head_dim=256, 16Q/2KV heads, partial RoPE (25%)
DeltaNet Gated linear recurrence, head_dim=128, 16 QK / 32 V heads

Performance (trn2.48xlarge, TP=8, batch=1)

Metric Value
Throughput 77 tok/s
TPOT 13.0 ms (constant, O(1) generation)
TTFT @ 128 tokens 1,235 ms
TTFT @ 512 tokens 3,471 ms

Validation

  • 100% top-1 token match vs HF BF16 CPU reference (14/14 prompts, cosine similarity 0.9998)
  • Expert Parallelism validated at EP=1,2,4,8 with identical accuracy
  • Tested on SDK 2.30 (neuronx-cc 2.25.3371, NxDI 0.10.17970)

Files Added

contrib/models/Qwen3-Coder-Next/
├── README.md                           # Documentation & usage
├── src/
│   ├── modeling_qwen35_moe.py          # Model implementation (~3.7k lines)
│   ├── nki_deltanet.py                 # NKI kernel: DeltaNet recurrence
│   └── nki_flash_attn_d256_pipe.py     # NKI kernel: flash attention d=256
├── test/
│   └── integration/test_model.py       # Accuracy + performance tests
└── vllm/
    ├── register_model.py               # vLLM model registration
    ├── start_vllm_server.sh            # Server launch script
    └── test_vllm_client.py             # Client test script

Testing Instructions

# On trn2.48xlarge with SDK 2.30 DLAMI
source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
huggingface-cli download Qwen/Qwen3-Coder-Next --local-dir /mnt/models/Qwen3-Coder-Next/

MODEL_PATH=/mnt/models/Qwen3-Coder-Next \
COMPILED_PATH=/mnt/compiled_qwen3_test/ \
python test/integration/test_model.py

NxD Inference implementation for Qwen/Qwen3-Coder-Next, a hybrid
Gated DeltaNet + GQA + Sparse MoE model (80B total, ~3B active/token).

Key features:
- Custom NKI kernels for DeltaNet recurrence and d=256 flash attention
- Expert Parallelism support (EP=1,2,4,8 validated)
- Shared expert EP scaling fix for correct world_group all-reduce
- vLLM integration for serving
- 77 tok/s at batch=1 on trn2.48xlarge (TP=8)
- 100% top-1 accuracy match vs HF CPU reference (14/14 prompts)
- Constant TPOT regardless of context length (DeltaNet O(1) generation)

Architecture: 48 layers (36 DeltaNet + 12 GQA), 512 experts top-10,
head_dim=256, partial RoPE (25%), max context 1024 tokens.

Tested on: trn2.48xlarge, SDK 2.30, LNC=2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant