Qwen (Alibaba) 235B APACHE-2.0 text-generation

Qwen3 235B A22B

Released April 29, 2025

Context Window 131K tokens

≈ 98 pages of text

684.4K Downloads

1.1K Likes

470.2 GB Disk Size

26.8K GitHub ★

Pricing

Input $0.45/M per million tokens

Output $1.82/M per million tokens

About

Qwen3-235B-A22B is the flagship model in Alibaba’s Qwen3 series — a 235 billion parameter Mixture-of-Experts (MoE) model that activates only 22 billion parameters per forward pass. Released April 29, 2025 under Apache 2.0, it represents the largest open-weight MoE language model from a Chinese lab, competing directly with DeepSeek-R1, OpenAI’s o1/o3-mini, Grok-3, and Gemini-2.5-Pro on reasoning benchmarks.

Why it matters

Qwen3-235B-A22B introduced hybrid reasoning to the Qwen series — the ability to seamlessly switch between a deep “thinking” mode for complex multi-step problems and a fast “non-thinking” mode for general conversation, all within a single model. This eliminates the need to route between separate chat and reasoning models (like switching between GPT-4o and o1), controlled via /think and /no_think commands or the enable_thinking parameter.

The smaller MoE sibling, Qwen3-30B-A3B, outperforms QwQ-32B despite using 10x fewer activated parameters. Even the tiny Qwen3-4B rivals Qwen2.5-72B-Instruct performance — suggesting the training pipeline improvements matter more than raw scale.

Architecture

The model uses a sparse MoE architecture with 128 total experts and 8 activated per token via top-2 expert routing. Despite 235B total parameters, only ~22B are active during inference, making it substantially more efficient than a dense model of equivalent capability.

Spec	Value
Total Parameters	235B
Active Parameters	22B
Architecture	Qwen3MoeForCausalLM
Layers	94
Attention Heads (Q/KV)	64 / 4
Experts (Total / Active)	128 / 8
Hidden Dimension	10,240
Context Length	128K tokens (32K default, YaRN-extended)
Vocabulary	152K tokens
Normalization	RMSNorm
Position Encoding	RoPE with YaRN extension
Activation	SwiGLU
License	Apache 2.0

YaRN (Yet another RoPE extensioN) enables efficient context window extension from the 32K training length to 128K — requiring 10x fewer tokens and 2.5x fewer training steps than previous approaches.

Training

Qwen3 was pretrained on a substantially larger and more diverse dataset than Qwen2.5, expanding multilingual support from 29 to 119 languages and dialects. The training pipeline includes four stages:

Large-scale pretraining on broad web data for general knowledge
Reasoning-focused pretraining on math, code, and logic-heavy corpora
Thinking mode SFT — supervised fine-tuning on chain-of-thought reasoning data
Reinforcement learning combining thinking and non-thinking modes into a unified model

The RL phase is what enables the hybrid mode switching — the model learns when to engage deep reasoning versus when to respond directly.

Benchmarks

Alibaba reports competitive performance against frontier models. The 235B MoE flagship matches or exceeds DeepSeek-R1 and o1 on coding, math, and general reasoning tasks, while the smaller Qwen3-30B-A3B punches well above its weight class.

Key benchmark context from the Open LLM Leaderboard and internal evaluations:

Mathematics: Strong performance on MATH, GSM8K, and competition-level problems
Code generation: Competitive on HumanEval, MBPP, and LiveCodeBench
Reasoning: State-of-the-art among open models on multi-step logical deduction
Multilingual: 119 languages with improved cross-lingual instruction following and translation

Deployment

Hardware requirements

Running the full model requires significant hardware. Community benchmarks provide guidance:

Ollama Q4_K_M quantization: ~142 GB (the most common deployment format)
Mac Studio M3 Ultra (512 GB RAM): 16 tok/s (GGUF) or 24 tok/s (MLX) with 4-bit quant
LM Studio minimum: ~134 GB system memory
FriendliAI benchmarks: 3× faster inference than standard vLLM, with 50%+ GPU cost reduction via online 4-bit/8-bit quantization

Ollama

ollama run qwen3:235b-a22b

22.6M downloads on Ollama. Supports native tool calling and thinking mode.

vLLM / SGLang

# vLLM
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B

# SGLang (recommended by Qwen team)
python -m sglang.launch_server --model Qwen/Qwen3-235B-A22B

LM Studio

Available as GGUF and MLX quantizations. The 4-bit MLX variant runs on Apple Silicon Macs with sufficient memory, achieving ~24 tokens/second on M3 Ultra.

API providers

Available through OpenRouter ($0.455/M input, $1.82/M output), Together AI, Fireworks, SiliconFlow, Alibaba Cloud Model Studio, and NVIDIA NIM (now deprecated).

Tool calling and agents

Qwen3 has strong native tool calling support, achieving leading performance among open models on complex agent tasks. The Qwen-Agent framework provides:

Function calling compatible with OpenAI function calling format
MCP (Model Context Protocol) support
Code interpreter for executable code generation
RAG capabilities
Browser extension for web interaction

Tool calling works in both thinking and non-thinking modes — the model can reason through complex tool use chains while maintaining structured output.

Thinking budget

A key innovation is the thinking budget mechanism, allowing users to control how much computation the model spends reasoning. The thinking mode supports up to 38K tokens of internal reasoning, and users can:

Set enable_thinking=True/False globally
Use /think or /no_think inline commands
Allocate token budgets per query based on task complexity

This allows fine-grained latency/quality tradeoffs in production — simple queries get instant responses while complex problems get deep reasoning.

The Qwen3 family

The 235B-A22B is the flagship, but Qwen3 is a full model family:

Model	Type	Parameters	Active	Context
Qwen3-0.6B	Dense	0.6B	0.6B	32K
Qwen3-1.7B	Dense	1.7B	1.7B	32K
Qwen3-4B	Dense	4B	4B	32K
Qwen3-8B	Dense	8B	8B	128K
Qwen3-14B	Dense	14B	14B	128K
Qwen3-32B	Dense	32B	32B	128K
Qwen3-30B-A3B	MoE	30B	3B	128K
Qwen3-235B-A22B	MoE	235B	22B	128K

All models are released under Apache 2.0 and available on Hugging Face, ModelScope, Kaggle, Ollama, and LM Studio.

Ecosystem

Broad framework support from the community:

Fine-tuning: Unsloth, Llama-Factory, Axolotl, Swift, XTuner, Peft
Quantization: AutoGPTQ, AutoAWQ, Neural Compressor
Deployment: vLLM, SGLang, TensorRT-LLM, OpenVINO, TGI
Local inference: Ollama, LM Studio, Jan, llama.cpp, MLX
Agents/RAG: Qwen-Agent, Dify, LlamaIndex, CrewAI
API providers: OpenRouter, Together, Fireworks, SiliconFlow, Alibaba Cloud

References

🤗 HuggingFace huggingface.co/Qwen/Qwen3-235B-A22B
⌨️ GitHub github.com/QwenLM/Qwen3
📄 Paper 1 arxiv.org/abs/2309.00071
📄 Paper 2 arxiv.org/abs/2505.09388
📝 Blog qwenlm.github.io/blog/qwen3/
📖 Docs qwen.readthedocs.io/en/latest/
🔍 Grokipedia grokipedia.com/page/Qwen3-235B-A22B