DeepSeek 684B MIT text-generation

DeepSeek R1

Released January 20, 2025

Context Window 64K tokens
≈ 48 pages of text
1.0M Downloads
13.1K Likes
688.6 GB Disk Size
91.9K GitHub ★

Pricing

Input $0.70/M per million tokens
Output $2.50/M per million tokens

About

DeepSeek-R1 is a 671 billion parameter reasoning model released January 20, 2025 under the MIT license. Built on the DeepSeek-V3-Base backbone, it uses a Mixture-of-Experts architecture that activates only 37 billion parameters per token. The model supports a 128K token context window with generation up to 32,768 tokens. At 688.6 GB on disk, it is one of the largest open-weight models available — and one of the few to match OpenAI’s o1 on reasoning benchmarks while being fully open-source with explicit distillation rights.

Why it matters

DeepSeek-R1 matters for three reasons, each validated by the paper (arXiv:2501.12948):

Pure RL reasoning. The precursor model, R1-Zero, was trained exclusively with reinforcement learning — no supervised fine-tuning at all. It naturally developed self-verification, reflection, and extended chain-of-thought reasoning. On AIME 2024, R1-Zero’s pass@1 climbed from 15.6% to 77.9% during training, surpassing average human competitor performance. This was the first open research demonstrating that reasoning can be incentivized purely through RL, without human-labeled reasoning trajectories.

Frontier performance at $294K. The total training cost for R1 was approximately $294K in H800 GPU-hours — a fraction of what comparable proprietary models cost. R1 matches or exceeds OpenAI o1 on MATH-500 (97.3 vs 96.4), AIME 2024 (79.8 vs 79.2), LiveCodeBench (65.9 vs 63.4), and AlpacaEval 2.0 (87.6 vs —).

Open distillation. The MIT license explicitly allows using R1’s outputs to train other models. The distilled R1-Distill-Qwen-32B outperforms OpenAI o1-mini on multiple benchmarks — a 32B open model beating a frontier proprietary reasoning system.

Architecture

SpecValue
Total Parameters671B
Activated per Token37B (MoE)
Base ModelDeepSeek-V3-Base
V3 ArchitectureMoE with Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, Multi-Token Prediction (MTP)
V3-Base Pretraining14.8T tokens (plain web pages + e-books, no synthetic data)
Context Length128K tokens
Max Generation32,768 tokens
LanguagesPrimarily Chinese and English
Disk Size688.6 GB
LicenseMIT

The MoE design activates only 37B of 671B parameters per token — providing the knowledge capacity of a massive model at a fraction of the inference cost. The architecture inherits DeepSeek-V3’s innovations: Multi-head Latent Attention for efficient inference, auxiliary-loss-free load balancing for stable expert routing, and Multi-Token Prediction for improved throughput.

Training

The R1-Zero experiment (pure RL, no SFT)

Before R1, DeepSeek trained R1-Zero by applying GRPO (Group Relative Policy Optimization) directly to the V3-Base model with zero supervised fine-tuning.

R1-Zero DetailValue
AlgorithmGRPO (eliminates PPO’s value model/critic)
Hardware64×8 H800 GPUs
Training Time~198 hours
Steps10,400 (1.6 epochs)
Samples per Question16
Batch Size512 (32 questions × 16)
Max Length32,768 tokens (→ 65,536 at step 8.2K)
Temperature1.0
Learning Rate3e-6
KL Coefficient0.001
Clip Ratio (ε)10
RewardRule-based only (accuracy + format). No neural reward models.

The reward was deliberately simple: accuracy (is the answer correct?) and format (did the model use <think> and <answer> tags?). No neural reward models — the paper notes these are susceptible to reward hacking at scale.

R1-Zero’s AIME 2024 performance rose from 15.6% to 77.9% pass@1 during training. With consensus@16 decoding, it reached 86.7%. The model spontaneously developed an “aha moment” — a sudden increase in using the word “wait” during reflections, marking a shift in reasoning patterns. However, R1-Zero exhibited endless repetition, language mixing (English/Chinese), and poor readability.

R1 pipeline (4 stages)

Stage 1 — Cold-Start SFT: Thousands of examples (not millions) with conversational, human-aligned thinking processes. Principles: concise paragraphs, conversational tone, no markdown formatting, understand complete user context. Human annotators verified accuracy.

Stage 2 — RL Stage 1 (Reasoning): Same GRPO hyperparameters as R1-Zero. Adds a language consistency reward (proportion of target-language words in the chain-of-thought). Only reasoning prompts with rule-based rewards.

Stage 3 — SFT on 800K samples: Rejection sampling from the Stage 2 checkpoint. ~600K reasoning + ~200K non-reasoning samples.

DomainSamplesAvg Tokens
Math395,2856,094
Code211,1297,436
STEM10,1244,929
Logic10,3952,739
General177,8121,420

Reasoning data was filtered to remove mixed languages, long paragraphs, and code blocks within chain-of-thought. Non-reasoning data reused portions of the DeepSeek-V3 SFT dataset plus software engineering data.

Stage 4 — RL Stage 2 (Alignment): 1,700 total steps at temperature 0.7 (higher caused incoherent generation). Rule-based rewards for reasoning, model-based rewards for general data. General instruction data and preference rewards introduced only in the final 400 steps to limit reward hacking.

Reward models

  • Helpful RM: 66,000 preference pairs, pairwise loss, batch 256, LR 6e-6, 1 epoch
  • Safety RM: 106,000 prompts annotated safe/unsafe, point-wise classification

Training cost

PhaseH800 GPU-hoursCost ($2/GPU-hr)
R1-Zero101,000$202K
SFT data creation5,000$10K
R141,000$82K
Total147,000~$294K

R1-Zero: 64×8 H800 GPUs for ~198 hours. R1: same cluster for ~80 hours (4 days).

Benchmarks

All scores from the paper. Temperature 0.6, top-p 0.95, 64 responses per query for pass@1.

BenchmarkDeepSeek R1OpenAI o1Claude 3.5 SonnetGPT-4o
MATH-50097.396.478.374.6
AIME 202479.879.216.09.3
MMLU90.891.888.387.2
MMLU-Pro84.078.072.6
GPQA Diamond71.575.765.049.9
LiveCodeBench65.963.433.834.2
Codeforces Rating20292061717759
SWE Verified49.248.950.838.8
AlpacaEval 2.087.652.051.1
ArenaHard92.385.280.4
DROP (3-shot F1)92.290.288.383.7
SimpleQA30.147.028.438.2
IFEval83.386.584.3
Aider-Polyglot53.361.745.316.0

R1 beats o1 on MATH-500, AIME, LiveCodeBench, DROP, and SWE Verified. o1 wins on MMLU, GPQA Diamond, Codeforces, SimpleQA (factual recall), and Aider. The overall picture: R1 is a genuine peer to o1 with different strengths — stronger on math competitions and benchmarks, weaker on factual recall and multi-language coding.

Distillation

Using the same 800K samples from Stage 3, DeepSeek fine-tuned six smaller dense models for 2–3 epochs with cosine-decay learning rate (to 1/10 of initial), max context 32,768 tokens, batch size 64.

ModelBaseLRAIME 2024MATH-500GPQA DiamondCodeforces
R1-Distill-Qwen-1.5BQwen2.5-Math-1.5B1e-428.983.933.8954
R1-Distill-Qwen-7BQwen2.5-Math-7B8e-555.592.849.11189
R1-Distill-Llama-8BLlama-3.1-8B5e-550.489.149.01205
R1-Distill-Qwen-14BQwen2.5-14B7e-569.793.959.11481
R1-Distill-Qwen-32BQwen2.5-32B6e-572.694.362.11691
R1-Distill-Llama-70BLlama-3.3-70B2e-570.094.565.21633

The Qwen-32B distill is the standout — it outperforms OpenAI o1-mini (AIME: 72.6 vs 63.6, MATH-500: 94.3 vs 90.0, Codeforces: 1691 vs 1820). The paper demonstrates that distilling reasoning patterns from large models produces better results than discovering them through RL on small models directly.

Deployment

Inference: HuggingFace Transformers is not directly supported. Use vLLM or SGLang:

# Full model (requires 8x A100-80GB or equivalent)
vllm serve deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8

# Distilled models (standard Qwen/Llama architectures)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768

Usage recommendations (from paper):

  • Temperature: 0.5–0.7 (0.6 recommended) — too high causes repetition
  • No system prompt — all instructions go in the user message
  • Force thinking by prefilling assistant with <think>\n
  • Zero-shot only — few-shot prompting consistently degrades performance
  • For math: include “Please reason step by step, and put your final answer within \boxed{}”

API: Available on OpenRouter ($0.70/M input, $2.50/M output) and DeepSeek’s own platform at platform.deepseek.com. Chat interface at chat.deepseek.com with “DeepThink” toggle.

Limitations

From the paper’s own assessment:

  1. No tool use — structured output is suboptimal; the model cannot leverage search engines, calculators, or other tools. The paper notes this “will be addressed in the next version.”
  2. Token inefficiency — overthinks simple questions, generating excessive reasoning tokens where brief answers suffice.
  3. Language mixing — optimized for Chinese and English only. Other languages trigger English reasoning even when the query is in another language.
  4. Prompt sensitivity — few-shot prompting consistently degrades performance. Use zero-shot with direct problem descriptions.
  5. Limited SWE gains — long evaluation times prevented large-scale RL on software engineering tasks, so R1 shows limited improvement over V3 on SWE benchmarks.
  6. Reward hacking — model-based preference rewards are susceptible to exploitation; the paper limited preference RL to only 400 steps to mitigate this.

Community

  • 91.9K GitHub stars — among the most starred AI repositories ever
  • 1M+ HuggingFace downloads, 13,111 likes
  • MIT license with explicit distillation permission spawned an ecosystem of community quantizations, fine-tunes, and derivative models
  • The release triggered a measurable dip in AI-related stock indices, demonstrating competitive impact

References

  • 🤗 HuggingFace huggingface.co/deepseek-ai/DeepSeek-R1
  • ⌨️ GitHub github.com/deepseek-ai/DeepSeek-R1
  • 📄 Paper arxiv.org/abs/2501.12948