Qwen (Alibaba) 7B APACHE-2.0 text-generation

Qwen2.5 7B Instruct

Released September 19, 2024

Context Window 131K tokens
≈ 98 pages of text
20.7M Downloads
1.1K Likes
15.2 GB Disk Size
26.8K GitHub ★

Pricing

Input $0.04/M per million tokens
Output $0.04/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 74.4
BBH 43.2
MATH Lvl 5 43.6
GPQA 10.6
MUSR 17.3
MMLU-PRO 22.1

Average: 35.20

Lineage

About

Qwen2.5-7B-Instruct is a 7 billion parameter instruction-tuned language model from Alibaba’s Qwen team. It is part of the Qwen2.5 family — one of the largest coordinated open source model releases in history, spanning sizes from 0.5B to 72B parameters.

What makes it notable

The 7B Instruct model hits a strong balance between capability and resource efficiency. Pretrained on 18 trillion tokens, Qwen2.5 models represent a significant leap over Qwen2 in knowledge (MMLU 85+), coding (HumanEval 85+), and mathematics (MATH 80+) — benchmarks measured on the 72B variant, with proportional gains at 7B.

Key improvements over Qwen2:

  • 128K context window with up to 8K token generation, using YaRN positional encoding for efficient long-context handling
  • Structured data comprehension — improved table understanding and structured output generation
  • Reliable JSON output — critical for tool calling and API integration
  • Diverse system prompt support — better role-play and condition-setting for chatbots
  • 29+ languages including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic

Architecture

Dense decoder-only transformer using the Qwen2 architecture. Key specs:

SpecValue
Parameters7.07B
ArchitectureQwen2ForCausalLM
Context Length128K tokens
Max Output8K tokens
Vocab Size152K
PrecisionBF16
LicenseApache 2.0

Uses YaRN (Yet another RoPE extensioN) for efficient context window extension — a method that requires 10x fewer tokens and 2.5x fewer training steps than previous approaches to extend context beyond pretraining length.

Deployment

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct

Supports built-in tool calling with vLLM ≥0.6:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Ollama

ollama run qwen2.5:7b-instruct

Supports Ollama’s native tool calling.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

Tool calling

Qwen2.5 supports three tool calling approaches:

  1. vLLM / Ollama — uses a Hermes-style tool calling template, compatible with OpenAI function calling format
  2. Transformers — native HuggingFace tool calling support via chat templates
  3. Qwen-Agent — Qwen’s own agent framework, maintains backward compatibility with Qwen2’s tool calling template

Ecosystem

Broad framework support from the community:

  • Fine-tuning: Unsloth, Llama-Factory, Axolotl, Swift, XTuner, Peft
  • Quantization: AutoGPTQ, AutoAWQ, Neural Compressor
  • Deployment: vLLM, SGLang, TensorRT-LLM, OpenVINO, TGI
  • Local: Ollama, LM Studio, Jan, llama.cpp, MLX
  • Agents/RAG: Dify, LlamaIndex, CrewAI
  • API providers: Together, Fireworks, OpenRouter, SiliconFlow

References

  • 🤗 HuggingFace huggingface.co/Qwen/Qwen2.5-7B-Instruct
  • ⌨️ GitHub github.com/QwenLM/Qwen2.5
  • 📄 Paper arxiv.org/abs/2309.00071
  • 📝 Blog qwenlm.github.io/blog/qwen2.5/
  • 📖 Docs qwen.readthedocs.io/en/latest/