Qwen3 235B A22B
Released April 29, 2025
Pricing
About
Qwen3-235B-A22B is the flagship model in Alibaba’s Qwen3 series — a 235 billion parameter Mixture-of-Experts (MoE) model that activates only 22 billion parameters per forward pass. Released April 29, 2025 under Apache 2.0, it represents the largest open-weight MoE language model from a Chinese lab, competing directly with DeepSeek-R1, OpenAI’s o1/o3-mini, Grok-3, and Gemini-2.5-Pro on reasoning benchmarks.
Why it matters
Qwen3-235B-A22B introduced hybrid reasoning to the Qwen series — the ability to seamlessly switch between a deep “thinking” mode for complex multi-step problems and a fast “non-thinking” mode for general conversation, all within a single model. This eliminates the need to route between separate chat and reasoning models (like switching between GPT-4o and o1), controlled via /think and /no_think commands or the enable_thinking parameter.
The smaller MoE sibling, Qwen3-30B-A3B, outperforms QwQ-32B despite using 10x fewer activated parameters. Even the tiny Qwen3-4B rivals Qwen2.5-72B-Instruct performance — suggesting the training pipeline improvements matter more than raw scale.
Architecture
The model uses a sparse MoE architecture with 128 total experts and 8 activated per token via top-2 expert routing. Despite 235B total parameters, only ~22B are active during inference, making it substantially more efficient than a dense model of equivalent capability.
| Spec | Value |
|---|---|
| Total Parameters | 235B |
| Active Parameters | 22B |
| Architecture | Qwen3MoeForCausalLM |
| Layers | 94 |
| Attention Heads (Q/KV) | 64 / 4 |
| Experts (Total / Active) | 128 / 8 |
| Hidden Dimension | 10,240 |
| Context Length | 128K tokens (32K default, YaRN-extended) |
| Vocabulary | 152K tokens |
| Normalization | RMSNorm |
| Position Encoding | RoPE with YaRN extension |
| Activation | SwiGLU |
| License | Apache 2.0 |
YaRN (Yet another RoPE extensioN) enables efficient context window extension from the 32K training length to 128K — requiring 10x fewer tokens and 2.5x fewer training steps than previous approaches.
Training
Qwen3 was pretrained on a substantially larger and more diverse dataset than Qwen2.5, expanding multilingual support from 29 to 119 languages and dialects. The training pipeline includes four stages:
- Large-scale pretraining on broad web data for general knowledge
- Reasoning-focused pretraining on math, code, and logic-heavy corpora
- Thinking mode SFT — supervised fine-tuning on chain-of-thought reasoning data
- Reinforcement learning combining thinking and non-thinking modes into a unified model
The RL phase is what enables the hybrid mode switching — the model learns when to engage deep reasoning versus when to respond directly.
Benchmarks
Alibaba reports competitive performance against frontier models. The 235B MoE flagship matches or exceeds DeepSeek-R1 and o1 on coding, math, and general reasoning tasks, while the smaller Qwen3-30B-A3B punches well above its weight class.
Key benchmark context from the Open LLM Leaderboard and internal evaluations:
- Mathematics: Strong performance on MATH, GSM8K, and competition-level problems
- Code generation: Competitive on HumanEval, MBPP, and LiveCodeBench
- Reasoning: State-of-the-art among open models on multi-step logical deduction
- Multilingual: 119 languages with improved cross-lingual instruction following and translation
Deployment
Hardware requirements
Running the full model requires significant hardware. Community benchmarks provide guidance:
- Ollama Q4_K_M quantization: ~142 GB (the most common deployment format)
- Mac Studio M3 Ultra (512 GB RAM): 16 tok/s (GGUF) or 24 tok/s (MLX) with 4-bit quant
- LM Studio minimum: ~134 GB system memory
- FriendliAI benchmarks: 3× faster inference than standard vLLM, with 50%+ GPU cost reduction via online 4-bit/8-bit quantization
Ollama
ollama run qwen3:235b-a22b
22.6M downloads on Ollama. Supports native tool calling and thinking mode.
vLLM / SGLang
# vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-235B-A22B
# SGLang (recommended by Qwen team)
python -m sglang.launch_server --model Qwen/Qwen3-235B-A22B
LM Studio
Available as GGUF and MLX quantizations. The 4-bit MLX variant runs on Apple Silicon Macs with sufficient memory, achieving ~24 tokens/second on M3 Ultra.
API providers
Available through OpenRouter ($0.455/M input, $1.82/M output), Together AI, Fireworks, SiliconFlow, Alibaba Cloud Model Studio, and NVIDIA NIM (now deprecated).
Tool calling and agents
Qwen3 has strong native tool calling support, achieving leading performance among open models on complex agent tasks. The Qwen-Agent framework provides:
- Function calling compatible with OpenAI function calling format
- MCP (Model Context Protocol) support
- Code interpreter for executable code generation
- RAG capabilities
- Browser extension for web interaction
Tool calling works in both thinking and non-thinking modes — the model can reason through complex tool use chains while maintaining structured output.
Thinking budget
A key innovation is the thinking budget mechanism, allowing users to control how much computation the model spends reasoning. The thinking mode supports up to 38K tokens of internal reasoning, and users can:
- Set
enable_thinking=True/Falseglobally - Use
/thinkor/no_thinkinline commands - Allocate token budgets per query based on task complexity
This allows fine-grained latency/quality tradeoffs in production — simple queries get instant responses while complex problems get deep reasoning.
The Qwen3 family
The 235B-A22B is the flagship, but Qwen3 is a full model family:
| Model | Type | Parameters | Active | Context |
|---|---|---|---|---|
| Qwen3-0.6B | Dense | 0.6B | 0.6B | 32K |
| Qwen3-1.7B | Dense | 1.7B | 1.7B | 32K |
| Qwen3-4B | Dense | 4B | 4B | 32K |
| Qwen3-8B | Dense | 8B | 8B | 128K |
| Qwen3-14B | Dense | 14B | 14B | 128K |
| Qwen3-32B | Dense | 32B | 32B | 128K |
| Qwen3-30B-A3B | MoE | 30B | 3B | 128K |
| Qwen3-235B-A22B | MoE | 235B | 22B | 128K |
All models are released under Apache 2.0 and available on Hugging Face, ModelScope, Kaggle, Ollama, and LM Studio.
Ecosystem
Broad framework support from the community:
- Fine-tuning: Unsloth, Llama-Factory, Axolotl, Swift, XTuner, Peft
- Quantization: AutoGPTQ, AutoAWQ, Neural Compressor
- Deployment: vLLM, SGLang, TensorRT-LLM, OpenVINO, TGI
- Local inference: Ollama, LM Studio, Jan, llama.cpp, MLX
- Agents/RAG: Qwen-Agent, Dify, LlamaIndex, CrewAI
- API providers: OpenRouter, Together, Fireworks, SiliconFlow, Alibaba Cloud
References
- HuggingFace huggingface.co/Qwen/Qwen3-235B-A22B
- GitHub github.com/QwenLM/Qwen3
- Paper 1 arxiv.org/abs/2309.00071
- Paper 2 arxiv.org/abs/2505.09388
- Blog qwenlm.github.io/blog/qwen3/
- Docs qwen.readthedocs.io/en/latest/
- Grokipedia grokipedia.com/page/Qwen3-235B-A22B