DeepSeek 684B MIT text-generation

DeepSeek R1

Released January 20, 2025

Context Window 64K tokens

≈ 48 pages of text

1.0M Downloads

13.1K Likes

688.6 GB Disk Size

91.9K GitHub ★

Pricing

Input $0.70/M per million tokens

Output $2.50/M per million tokens

About

DeepSeek-R1 is a 671 billion parameter reasoning model released January 20, 2025 under the MIT license. Built on the DeepSeek-V3-Base backbone, it uses a Mixture-of-Experts architecture that activates only 37 billion parameters per token. The model supports a 128K token context window with generation up to 32,768 tokens. At 688.6 GB on disk, it is one of the largest open-weight models available — and one of the few to match OpenAI’s o1 on reasoning benchmarks while being fully open-source with explicit distillation rights.

Why it matters

DeepSeek-R1 matters for three reasons, each validated by the paper (arXiv:2501.12948):

Pure RL reasoning. The precursor model, R1-Zero, was trained exclusively with reinforcement learning — no supervised fine-tuning at all. It naturally developed self-verification, reflection, and extended chain-of-thought reasoning. On AIME 2024, R1-Zero’s pass@1 climbed from 15.6% to 77.9% during training, surpassing average human competitor performance. This was the first open research demonstrating that reasoning can be incentivized purely through RL, without human-labeled reasoning trajectories.

Frontier performance at $294K. The total training cost for R1 was approximately $294K in H800 GPU-hours — a fraction of what comparable proprietary models cost. R1 matches or exceeds OpenAI o1 on MATH-500 (97.3 vs 96.4), AIME 2024 (79.8 vs 79.2), LiveCodeBench (65.9 vs 63.4), and AlpacaEval 2.0 (87.6 vs —).

Open distillation. The MIT license explicitly allows using R1’s outputs to train other models. The distilled R1-Distill-Qwen-32B outperforms OpenAI o1-mini on multiple benchmarks — a 32B open model beating a frontier proprietary reasoning system.

Architecture

Spec	Value
Total Parameters	671B
Activated per Token	37B (MoE)
Base Model	DeepSeek-V3-Base
V3 Architecture	MoE with Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, Multi-Token Prediction (MTP)
V3-Base Pretraining	14.8T tokens (plain web pages + e-books, no synthetic data)
Context Length	128K tokens
Max Generation	32,768 tokens
Languages	Primarily Chinese and English
Disk Size	688.6 GB
License	MIT

The MoE design activates only 37B of 671B parameters per token — providing the knowledge capacity of a massive model at a fraction of the inference cost. The architecture inherits DeepSeek-V3’s innovations: Multi-head Latent Attention for efficient inference, auxiliary-loss-free load balancing for stable expert routing, and Multi-Token Prediction for improved throughput.

Training

The R1-Zero experiment (pure RL, no SFT)

Before R1, DeepSeek trained R1-Zero by applying GRPO (Group Relative Policy Optimization) directly to the V3-Base model with zero supervised fine-tuning.

R1-Zero Detail	Value
Algorithm	GRPO (eliminates PPO’s value model/critic)
Hardware	64×8 H800 GPUs
Training Time	~198 hours
Steps	10,400 (1.6 epochs)
Samples per Question	16
Batch Size	512 (32 questions × 16)
Max Length	32,768 tokens (→ 65,536 at step 8.2K)
Temperature	1.0
Learning Rate	3e-6
KL Coefficient	0.001
Clip Ratio (ε)	10
Reward	Rule-based only (accuracy + format). No neural reward models.

The reward was deliberately simple: accuracy (is the answer correct?) and format (did the model use <think> and <answer> tags?). No neural reward models — the paper notes these are susceptible to reward hacking at scale.

R1-Zero’s AIME 2024 performance rose from 15.6% to 77.9% pass@1 during training. With consensus@16 decoding, it reached 86.7%. The model spontaneously developed an “aha moment” — a sudden increase in using the word “wait” during reflections, marking a shift in reasoning patterns. However, R1-Zero exhibited endless repetition, language mixing (English/Chinese), and poor readability.

R1 pipeline (4 stages)

Stage 1 — Cold-Start SFT: Thousands of examples (not millions) with conversational, human-aligned thinking processes. Principles: concise paragraphs, conversational tone, no markdown formatting, understand complete user context. Human annotators verified accuracy.

Stage 2 — RL Stage 1 (Reasoning): Same GRPO hyperparameters as R1-Zero. Adds a language consistency reward (proportion of target-language words in the chain-of-thought). Only reasoning prompts with rule-based rewards.

Stage 3 — SFT on 800K samples: Rejection sampling from the Stage 2 checkpoint. ~600K reasoning + ~200K non-reasoning samples.

Domain	Samples	Avg Tokens
Math	395,285	6,094
Code	211,129	7,436
STEM	10,124	4,929
Logic	10,395	2,739
General	177,812	1,420

Reasoning data was filtered to remove mixed languages, long paragraphs, and code blocks within chain-of-thought. Non-reasoning data reused portions of the DeepSeek-V3 SFT dataset plus software engineering data.

Stage 4 — RL Stage 2 (Alignment): 1,700 total steps at temperature 0.7 (higher caused incoherent generation). Rule-based rewards for reasoning, model-based rewards for general data. General instruction data and preference rewards introduced only in the final 400 steps to limit reward hacking.

Reward models

Helpful RM: 66,000 preference pairs, pairwise loss, batch 256, LR 6e-6, 1 epoch
Safety RM: 106,000 prompts annotated safe/unsafe, point-wise classification

Training cost

Phase	H800 GPU-hours	Cost ($2/GPU-hr)
R1-Zero	101,000	$202K
SFT data creation	5,000	$10K
R1	41,000	$82K
Total	147,000	~$294K

R1-Zero: 64×8 H800 GPUs for ~198 hours. R1: same cluster for ~80 hours (4 days).

Benchmarks

All scores from the paper. Temperature 0.6, top-p 0.95, 64 responses per query for pass@1.

Benchmark	DeepSeek R1	OpenAI o1	Claude 3.5 Sonnet	GPT-4o
MATH-500	97.3	96.4	78.3	74.6
AIME 2024	79.8	79.2	16.0	9.3
MMLU	90.8	91.8	88.3	87.2
MMLU-Pro	84.0	—	78.0	72.6
GPQA Diamond	71.5	75.7	65.0	49.9
LiveCodeBench	65.9	63.4	33.8	34.2
Codeforces Rating	2029	2061	717	759
SWE Verified	49.2	48.9	50.8	38.8
AlpacaEval 2.0	87.6	—	52.0	51.1
ArenaHard	92.3	—	85.2	80.4
DROP (3-shot F1)	92.2	90.2	88.3	83.7
SimpleQA	30.1	47.0	28.4	38.2
IFEval	83.3	—	86.5	84.3
Aider-Polyglot	53.3	61.7	45.3	16.0

R1 beats o1 on MATH-500, AIME, LiveCodeBench, DROP, and SWE Verified. o1 wins on MMLU, GPQA Diamond, Codeforces, SimpleQA (factual recall), and Aider. The overall picture: R1 is a genuine peer to o1 with different strengths — stronger on math competitions and benchmarks, weaker on factual recall and multi-language coding.

Distillation

Using the same 800K samples from Stage 3, DeepSeek fine-tuned six smaller dense models for 2–3 epochs with cosine-decay learning rate (to 1/10 of initial), max context 32,768 tokens, batch size 64.

Model	Base	LR	AIME 2024	MATH-500	GPQA Diamond	Codeforces
R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	1e-4	28.9	83.9	33.8	954
R1-Distill-Qwen-7B	Qwen2.5-Math-7B	8e-5	55.5	92.8	49.1	1189
R1-Distill-Llama-8B	Llama-3.1-8B	5e-5	50.4	89.1	49.0	1205
R1-Distill-Qwen-14B	Qwen2.5-14B	7e-5	69.7	93.9	59.1	1481
R1-Distill-Qwen-32B	Qwen2.5-32B	6e-5	72.6	94.3	62.1	1691
R1-Distill-Llama-70B	Llama-3.3-70B	2e-5	70.0	94.5	65.2	1633

The Qwen-32B distill is the standout — it outperforms OpenAI o1-mini (AIME: 72.6 vs 63.6, MATH-500: 94.3 vs 90.0, Codeforces: 1691 vs 1820). The paper demonstrates that distilling reasoning patterns from large models produces better results than discovering them through RL on small models directly.

Deployment

Inference: HuggingFace Transformers is not directly supported. Use vLLM or SGLang:

# Full model (requires 8x A100-80GB or equivalent)
vllm serve deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8

# Distilled models (standard Qwen/Llama architectures)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768

Usage recommendations (from paper):

Temperature: 0.5–0.7 (0.6 recommended) — too high causes repetition
No system prompt — all instructions go in the user message
Force thinking by prefilling assistant with <think>\n
Zero-shot only — few-shot prompting consistently degrades performance
For math: include “Please reason step by step, and put your final answer within \boxed{}”

API: Available on OpenRouter ($0.70/M input, $2.50/M output) and DeepSeek’s own platform at platform.deepseek.com. Chat interface at chat.deepseek.com with “DeepThink” toggle.

Limitations

From the paper’s own assessment:

No tool use — structured output is suboptimal; the model cannot leverage search engines, calculators, or other tools. The paper notes this “will be addressed in the next version.”
Token inefficiency — overthinks simple questions, generating excessive reasoning tokens where brief answers suffice.
Language mixing — optimized for Chinese and English only. Other languages trigger English reasoning even when the query is in another language.
Prompt sensitivity — few-shot prompting consistently degrades performance. Use zero-shot with direct problem descriptions.
Limited SWE gains — long evaluation times prevented large-scale RL on software engineering tasks, so R1 shows limited improvement over V3 on SWE benchmarks.
Reward hacking — model-based preference rewards are susceptible to exploitation; the paper limited preference RL to only 400 steps to mitigate this.

Community

91.9K GitHub stars — among the most starred AI repositories ever
1M+ HuggingFace downloads, 13,111 likes
MIT license with explicit distillation permission spawned an ecosystem of community quantizations, fine-tunes, and derivative models
The release triggered a measurable dip in AI-related stock indices, demonstrating competitive impact

References

🤗 HuggingFace huggingface.co/deepseek-ai/DeepSeek-R1
⌨️ GitHub github.com/deepseek-ai/DeepSeek-R1
📄 Paper arxiv.org/abs/2501.12948