Qwen (Alibaba) 7B APACHE-2.0 text-generation

Qwen2.5 7B Instruct

Released September 19, 2024

Context Window 131K tokens

≈ 98 pages of text

20.7M Downloads

1.1K Likes

15.2 GB Disk Size

26.8K GitHub ★

Pricing

Input $0.04/M per million tokens

Output $0.04/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 74.4

BBH 43.2

MATH Lvl 5 43.6

GPQA 10.6

MUSR 17.3

MMLU-PRO 22.1

Average: 35.20

Lineage

↑ Base: Qwen/Qwen2.5-7B

About

Qwen2.5-7B-Instruct is a 7 billion parameter instruction-tuned language model from Alibaba’s Qwen team. It is part of the Qwen2.5 family — one of the largest coordinated open source model releases in history, spanning sizes from 0.5B to 72B parameters.

What makes it notable

The 7B Instruct model hits a strong balance between capability and resource efficiency. Pretrained on 18 trillion tokens, Qwen2.5 models represent a significant leap over Qwen2 in knowledge (MMLU 85+), coding (HumanEval 85+), and mathematics (MATH 80+) — benchmarks measured on the 72B variant, with proportional gains at 7B.

Key improvements over Qwen2:

128K context window with up to 8K token generation, using YaRN positional encoding for efficient long-context handling
Structured data comprehension — improved table understanding and structured output generation
Reliable JSON output — critical for tool calling and API integration
Diverse system prompt support — better role-play and condition-setting for chatbots
29+ languages including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic

Architecture

Dense decoder-only transformer using the Qwen2 architecture. Key specs:

Spec	Value
Parameters	7.07B
Architecture	Qwen2ForCausalLM
Context Length	128K tokens
Max Output	8K tokens
Vocab Size	152K
Precision	BF16
License	Apache 2.0

Uses YaRN (Yet another RoPE extensioN) for efficient context window extension — a method that requires 10x fewer tokens and 2.5x fewer training steps than previous approaches to extend context beyond pretraining length.

Deployment

vLLM (recommended for production)

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct

Supports built-in tool calling with vLLM ≥0.6:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Ollama

ollama run qwen2.5:7b-instruct

Supports Ollama’s native tool calling.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

Tool calling

Qwen2.5 supports three tool calling approaches:

vLLM / Ollama — uses a Hermes-style tool calling template, compatible with OpenAI function calling format
Transformers — native HuggingFace tool calling support via chat templates
Qwen-Agent — Qwen’s own agent framework, maintains backward compatibility with Qwen2’s tool calling template

Ecosystem

Broad framework support from the community:

Fine-tuning: Unsloth, Llama-Factory, Axolotl, Swift, XTuner, Peft
Quantization: AutoGPTQ, AutoAWQ, Neural Compressor
Deployment: vLLM, SGLang, TensorRT-LLM, OpenVINO, TGI
Local: Ollama, LM Studio, Jan, llama.cpp, MLX
Agents/RAG: Dify, LlamaIndex, CrewAI
API providers: Together, Fireworks, OpenRouter, SiliconFlow

References

🤗 HuggingFace huggingface.co/Qwen/Qwen2.5-7B-Instruct
⌨️ GitHub github.com/QwenLM/Qwen2.5
📄 Paper arxiv.org/abs/2309.00071
📝 Blog qwenlm.github.io/blog/qwen2.5/
📖 Docs qwen.readthedocs.io/en/latest/