DeepSeek R1
Released January 20, 2025
Pricing
About
DeepSeek-R1 is a 671 billion parameter reasoning model released January 20, 2025 under the MIT license. Built on the DeepSeek-V3-Base backbone, it uses a Mixture-of-Experts architecture that activates only 37 billion parameters per token. The model supports a 128K token context window with generation up to 32,768 tokens. At 688.6 GB on disk, it is one of the largest open-weight models available — and one of the few to match OpenAI’s o1 on reasoning benchmarks while being fully open-source with explicit distillation rights.
Why it matters
DeepSeek-R1 matters for three reasons, each validated by the paper (arXiv:2501.12948):
Pure RL reasoning. The precursor model, R1-Zero, was trained exclusively with reinforcement learning — no supervised fine-tuning at all. It naturally developed self-verification, reflection, and extended chain-of-thought reasoning. On AIME 2024, R1-Zero’s pass@1 climbed from 15.6% to 77.9% during training, surpassing average human competitor performance. This was the first open research demonstrating that reasoning can be incentivized purely through RL, without human-labeled reasoning trajectories.
Frontier performance at $294K. The total training cost for R1 was approximately $294K in H800 GPU-hours — a fraction of what comparable proprietary models cost. R1 matches or exceeds OpenAI o1 on MATH-500 (97.3 vs 96.4), AIME 2024 (79.8 vs 79.2), LiveCodeBench (65.9 vs 63.4), and AlpacaEval 2.0 (87.6 vs —).
Open distillation. The MIT license explicitly allows using R1’s outputs to train other models. The distilled R1-Distill-Qwen-32B outperforms OpenAI o1-mini on multiple benchmarks — a 32B open model beating a frontier proprietary reasoning system.
Architecture
| Spec | Value |
|---|---|
| Total Parameters | 671B |
| Activated per Token | 37B (MoE) |
| Base Model | DeepSeek-V3-Base |
| V3 Architecture | MoE with Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, Multi-Token Prediction (MTP) |
| V3-Base Pretraining | 14.8T tokens (plain web pages + e-books, no synthetic data) |
| Context Length | 128K tokens |
| Max Generation | 32,768 tokens |
| Languages | Primarily Chinese and English |
| Disk Size | 688.6 GB |
| License | MIT |
The MoE design activates only 37B of 671B parameters per token — providing the knowledge capacity of a massive model at a fraction of the inference cost. The architecture inherits DeepSeek-V3’s innovations: Multi-head Latent Attention for efficient inference, auxiliary-loss-free load balancing for stable expert routing, and Multi-Token Prediction for improved throughput.
Training
The R1-Zero experiment (pure RL, no SFT)
Before R1, DeepSeek trained R1-Zero by applying GRPO (Group Relative Policy Optimization) directly to the V3-Base model with zero supervised fine-tuning.
| R1-Zero Detail | Value |
|---|---|
| Algorithm | GRPO (eliminates PPO’s value model/critic) |
| Hardware | 64×8 H800 GPUs |
| Training Time | ~198 hours |
| Steps | 10,400 (1.6 epochs) |
| Samples per Question | 16 |
| Batch Size | 512 (32 questions × 16) |
| Max Length | 32,768 tokens (→ 65,536 at step 8.2K) |
| Temperature | 1.0 |
| Learning Rate | 3e-6 |
| KL Coefficient | 0.001 |
| Clip Ratio (ε) | 10 |
| Reward | Rule-based only (accuracy + format). No neural reward models. |
The reward was deliberately simple: accuracy (is the answer correct?) and format (did the model use <think> and <answer> tags?). No neural reward models — the paper notes these are susceptible to reward hacking at scale.
R1-Zero’s AIME 2024 performance rose from 15.6% to 77.9% pass@1 during training. With consensus@16 decoding, it reached 86.7%. The model spontaneously developed an “aha moment” — a sudden increase in using the word “wait” during reflections, marking a shift in reasoning patterns. However, R1-Zero exhibited endless repetition, language mixing (English/Chinese), and poor readability.
R1 pipeline (4 stages)
Stage 1 — Cold-Start SFT: Thousands of examples (not millions) with conversational, human-aligned thinking processes. Principles: concise paragraphs, conversational tone, no markdown formatting, understand complete user context. Human annotators verified accuracy.
Stage 2 — RL Stage 1 (Reasoning): Same GRPO hyperparameters as R1-Zero. Adds a language consistency reward (proportion of target-language words in the chain-of-thought). Only reasoning prompts with rule-based rewards.
Stage 3 — SFT on 800K samples: Rejection sampling from the Stage 2 checkpoint. ~600K reasoning + ~200K non-reasoning samples.
| Domain | Samples | Avg Tokens |
|---|---|---|
| Math | 395,285 | 6,094 |
| Code | 211,129 | 7,436 |
| STEM | 10,124 | 4,929 |
| Logic | 10,395 | 2,739 |
| General | 177,812 | 1,420 |
Reasoning data was filtered to remove mixed languages, long paragraphs, and code blocks within chain-of-thought. Non-reasoning data reused portions of the DeepSeek-V3 SFT dataset plus software engineering data.
Stage 4 — RL Stage 2 (Alignment): 1,700 total steps at temperature 0.7 (higher caused incoherent generation). Rule-based rewards for reasoning, model-based rewards for general data. General instruction data and preference rewards introduced only in the final 400 steps to limit reward hacking.
Reward models
- Helpful RM: 66,000 preference pairs, pairwise loss, batch 256, LR 6e-6, 1 epoch
- Safety RM: 106,000 prompts annotated safe/unsafe, point-wise classification
Training cost
| Phase | H800 GPU-hours | Cost ($2/GPU-hr) |
|---|---|---|
| R1-Zero | 101,000 | $202K |
| SFT data creation | 5,000 | $10K |
| R1 | 41,000 | $82K |
| Total | 147,000 | ~$294K |
R1-Zero: 64×8 H800 GPUs for ~198 hours. R1: same cluster for ~80 hours (4 days).
Benchmarks
All scores from the paper. Temperature 0.6, top-p 0.95, 64 responses per query for pass@1.
| Benchmark | DeepSeek R1 | OpenAI o1 | Claude 3.5 Sonnet | GPT-4o |
|---|---|---|---|---|
| MATH-500 | 97.3 | 96.4 | 78.3 | 74.6 |
| AIME 2024 | 79.8 | 79.2 | 16.0 | 9.3 |
| MMLU | 90.8 | 91.8 | 88.3 | 87.2 |
| MMLU-Pro | 84.0 | — | 78.0 | 72.6 |
| GPQA Diamond | 71.5 | 75.7 | 65.0 | 49.9 |
| LiveCodeBench | 65.9 | 63.4 | 33.8 | 34.2 |
| Codeforces Rating | 2029 | 2061 | 717 | 759 |
| SWE Verified | 49.2 | 48.9 | 50.8 | 38.8 |
| AlpacaEval 2.0 | 87.6 | — | 52.0 | 51.1 |
| ArenaHard | 92.3 | — | 85.2 | 80.4 |
| DROP (3-shot F1) | 92.2 | 90.2 | 88.3 | 83.7 |
| SimpleQA | 30.1 | 47.0 | 28.4 | 38.2 |
| IFEval | 83.3 | — | 86.5 | 84.3 |
| Aider-Polyglot | 53.3 | 61.7 | 45.3 | 16.0 |
R1 beats o1 on MATH-500, AIME, LiveCodeBench, DROP, and SWE Verified. o1 wins on MMLU, GPQA Diamond, Codeforces, SimpleQA (factual recall), and Aider. The overall picture: R1 is a genuine peer to o1 with different strengths — stronger on math competitions and benchmarks, weaker on factual recall and multi-language coding.
Distillation
Using the same 800K samples from Stage 3, DeepSeek fine-tuned six smaller dense models for 2–3 epochs with cosine-decay learning rate (to 1/10 of initial), max context 32,768 tokens, batch size 64.
| Model | Base | LR | AIME 2024 | MATH-500 | GPQA Diamond | Codeforces |
|---|---|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 1e-4 | 28.9 | 83.9 | 33.8 | 954 |
| R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 8e-5 | 55.5 | 92.8 | 49.1 | 1189 |
| R1-Distill-Llama-8B | Llama-3.1-8B | 5e-5 | 50.4 | 89.1 | 49.0 | 1205 |
| R1-Distill-Qwen-14B | Qwen2.5-14B | 7e-5 | 69.7 | 93.9 | 59.1 | 1481 |
| R1-Distill-Qwen-32B | Qwen2.5-32B | 6e-5 | 72.6 | 94.3 | 62.1 | 1691 |
| R1-Distill-Llama-70B | Llama-3.3-70B | 2e-5 | 70.0 | 94.5 | 65.2 | 1633 |
The Qwen-32B distill is the standout — it outperforms OpenAI o1-mini (AIME: 72.6 vs 63.6, MATH-500: 94.3 vs 90.0, Codeforces: 1691 vs 1820). The paper demonstrates that distilling reasoning patterns from large models produces better results than discovering them through RL on small models directly.
Deployment
Inference: HuggingFace Transformers is not directly supported. Use vLLM or SGLang:
# Full model (requires 8x A100-80GB or equivalent)
vllm serve deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8
# Distilled models (standard Qwen/Llama architectures)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768
Usage recommendations (from paper):
- Temperature: 0.5–0.7 (0.6 recommended) — too high causes repetition
- No system prompt — all instructions go in the user message
- Force thinking by prefilling assistant with
<think>\n - Zero-shot only — few-shot prompting consistently degrades performance
- For math: include “Please reason step by step, and put your final answer within \boxed{}”
API: Available on OpenRouter ($0.70/M input, $2.50/M output) and DeepSeek’s own platform at platform.deepseek.com. Chat interface at chat.deepseek.com with “DeepThink” toggle.
Limitations
From the paper’s own assessment:
- No tool use — structured output is suboptimal; the model cannot leverage search engines, calculators, or other tools. The paper notes this “will be addressed in the next version.”
- Token inefficiency — overthinks simple questions, generating excessive reasoning tokens where brief answers suffice.
- Language mixing — optimized for Chinese and English only. Other languages trigger English reasoning even when the query is in another language.
- Prompt sensitivity — few-shot prompting consistently degrades performance. Use zero-shot with direct problem descriptions.
- Limited SWE gains — long evaluation times prevented large-scale RL on software engineering tasks, so R1 shows limited improvement over V3 on SWE benchmarks.
- Reward hacking — model-based preference rewards are susceptible to exploitation; the paper limited preference RL to only 400 steps to mitigate this.
Community
- 91.9K GitHub stars — among the most starred AI repositories ever
- 1M+ HuggingFace downloads, 13,111 likes
- MIT license with explicit distillation permission spawned an ecosystem of community quantizations, fine-tunes, and derivative models
- The release triggered a measurable dip in AI-related stock indices, demonstrating competitive impact
References
- HuggingFace huggingface.co/deepseek-ai/DeepSeek-R1
- GitHub github.com/deepseek-ai/DeepSeek-R1
- Paper arxiv.org/abs/2501.12948