phi 4
Released December 11, 2024
Pricing
Benchmarks
Open LLM Leaderboard v2
Average: 30.40
About
Phi-4 is a 14 billion parameter language model from Microsoft Research, released December 12, 2024 under the MIT license. The paper (arXiv:2412.08905) frames the model around a single thesis: data quality can substitute for model scale. Phi-4 surpasses its own teacher model GPT-4o on GPQA (56.1 vs 50.6) and MATH (80.4 vs 74.6) — a 14B student exceeding a frontier teacher on STEM reasoning, which the paper argues goes beyond simple distillation. It was pretrained on ~9.8T tokens with 40% synthetic data, trained on 1,920 H100 GPUs for 21 days.
Why it matters
Phi-4’s performance on reasoning benchmarks is disproportionate to its size:
- GPQA (graduate-level STEM): 56.1% — beats GPT-4o (50.6%), GPT-4o-mini (40.9%), Qwen 2.5 14B (42.9%), and even Llama-3.3-70B (49.1%)
- MATH: 80.4% — beats GPT-4o (74.6%), GPT-4o-mini (73.0%), and its predecessor phi-3 (44.6%)
- MMLU: 84.8% — approaches Llama-3.3-70B (86.3%) and Qwen 2.5 72B (85.3%) despite being 5× smaller
- AMC 10/12 November 2024 (post-cutoff, contamination-proof): phi-4 outperforms GPT-4o, Gemini, and Llama-3.3-70B on fresh math competition data
The tradeoff is explicit. SimpleQA (factual recall) is 3.0% — the model deliberately learns to not attempt factual questions rather than hallucinate. The paper tracks this through post-training: the “not attempted” rate goes from 3.2% (base) to 81.1% (final), while incorrect answers drop from 90% to 15.8%. This is a design choice, not a failure.
At 29.3 GB with an MIT license (no restrictions on commercial use, redistribution, or modification), it’s one of the most capable models that fits on a single consumer GPU.
Architecture
| Spec | Value |
|---|---|
| Architecture | Dense decoder-only Transformer (Phi3ForCausalLM) |
| Parameters | 14B |
| Tokenizer | tiktoken (100,352 padded vocab, multilingual support) |
| Context Length | 4K (pretraining) → 16K (after midtraining) |
| Attention | Full attention (no sliding window, unlike phi-3-medium) |
| Prompt Format | ChatML variant (<|im_start|>, <|im_end|>, <|im_sep|>) |
| Disk Size | 29.3 GB |
| License | MIT |
Architecture closely follows phi-3-medium with two changes: tiktoken tokenizer for better multilingual support, and full attention replacing the 2K sliding window.
Training
Pretraining
| Detail | Value |
|---|---|
| Tokens | ~9.8T |
| Hardware | 1,920 H100-80GB GPUs |
| Duration | 21 days |
| Peak Learning Rate | 0.0003 |
| Weight Decay | 0.1 |
| Global Batch Size | 5,760 |
| Schedule | Linear warmup and decay |
| Data Cutoff | June 2024 |
Data mixture (from paper Table 5)
| Source | % of Training | Unique Tokens | Epochs |
|---|---|---|---|
| Synthetic | 40% | 290B | 13.8 |
| Code | 20% | 820B | 2.4 |
| Web | 15% | 1.3T | 1.2 |
| Web rewrites | 15% | 290B | 5.2 |
| Acquired sources | 10% | 580B | 1.7 |
The most striking detail: synthetic data accounts for only 290B unique tokens but is seen 13.8 times — the paper shows that more epochs on synthetic data outperforms adding fresh web tokens (Figure 2). Synthetic data doesn’t plateau like organic data. The paper tested a 13B model trained entirely on synthetic data — it improved on most benchmarks but dropped -14.8 points on TriviaQA (knowledge), confirming that synthetic data builds reasoning at the expense of factual breadth.
50 broad types of synthetic datasets were created using:
- Multi-agent prompting
- Self-revision workflows (model critiques and improves its own outputs)
- Instruction reversal (take existing code → generate instructions → pair together)
- Seeds curated from web pages, code repos, Q&A platforms, books, arXiv, PubMed Central
- Plurality-based difficulty filtering (discard questions where all or no model answers agree)
- Code validated through execution loops and tests
Midtraining (context extension)
Context extended from 4K to 16K tokens via:
- RoPE base frequency increased to 250,000
- Learning rate dropped 10× from pretraining
- 250B tokens: 30% long-context curated data + 70% from pretraining distribution
Post-training (3 stages)
Stage 1 — SFT: ~8B tokens at learning rate 1e-6, covering math, coding, reasoning, conversation, model identity, safety, and 40 languages.
Stage 2 — DPO with Pivotal Token Search (novel): The paper introduces PTS, which identifies individual tokens in a response where the next token choice has an outsized effect on solution correctness. Instead of creating DPO pairs from full responses, PTS creates pairs targeting single pivotal tokens. Example from the paper: in a math solution, the choice between “cross-multiplying” and “multiplying both sides by” can shift the success probability from 0.42 to 0.93. PTS data: 133K multiple-choice Q&A, 77K math, 16K Python, 22K other code, 3K safety.
Stage 3 — Judge-guided DPO: ~850K pairs where responses are generated by GPT-4o, GPT-4t, and phi-4 itself, then scored by GPT-4o on accuracy, style, and detail.
Benchmarks
From the paper’s Table 1 (OpenAI simple-evals framework, temperature 0.5):
| Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 14B | GPT-4o-mini | GPT-4o |
|---|---|---|---|---|---|
| MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 88.1 |
| GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 50.6 |
| MATH | 80.4 | 44.6 | 75.6 | 73.0 | 74.6 |
| HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 90.6 |
| MGSM | 80.6 | 53.5 | 79.6 | 86.5 | 90.4 |
| SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 39.4 |
| DROP | 75.5 | 68.3 | 85.5 | 79.3 | 80.9 |
| MMLU-Pro | 70.4 | 51.3 | 63.2 | 63.4 | 73.0 |
| ArenaHard | 75.4 | 45.8 | 70.2 | 76.2 | 75.6 |
| IFEval | 63.0 | 57.9 | 78.7 | 80.0 | 84.8 |
The phi-3 → phi-4 jump at the same parameter count: +35.8 on MATH, +24.9 on GPQA, +27.1 on MGSM, +19.1 on MMLU-Pro. These gains come entirely from the data pipeline and post-training innovations, not architectural changes.
Open LLM Leaderboard (community eval)
| Benchmark | Score |
|---|---|
| Average | 30.4 |
| BBH | 52.4 |
| MMLU-Pro | 47.6 |
| MATH | 31.6 |
| MUSR | 23.8 |
| GPQA | 20.8 |
| IFEval | 5.9 |
The Open LLM Leaderboard scores are substantially lower than simple-evals, particularly IFEval (5.9 vs 63.0). This discrepancy likely reflects differences in evaluation harness formatting — phi-4 was optimized for simple-evals prompting.
Deployment
Hardware
| Format | Size | Minimum Hardware |
|---|---|---|
| Full (BF16) | 29.3 GB | RTX 4090 or A100-40GB |
| 8-bit | ~15 GB | RTX 3090/4080 |
| 4-bit | ~8 GB | RTX 3070 (8 GB) |
Transformers
import transformers
pipeline = transformers.pipeline(
"text-generation",
model="microsoft/phi-4",
model_kwargs={"torch_dtype": "auto"},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful math tutor."},
{"role": "user", "content": "Prove that sqrt(2) is irrational."},
]
outputs = pipeline(messages, max_new_tokens=512)
API
Available on OpenRouter ($0.06/M input, $0.14/M output) — one of the cheapest capable models available via API.
Limitations
- Factual recall: SimpleQA 3.0% — the model is trained to abstain rather than guess, but this means it can’t answer many factual questions
- Context window: 16K tokens — limiting for document-level tasks where competitors offer 128K+
- Language: Primarily English. Multilingual data is ~8% of pretraining; SFT covers 40 languages but English dominates
- IFEval: 63.0% — instruction following is notably weaker than Qwen 2.5 14B (78.7%) and GPT-4o-mini (80.0%)
- Code: Strong in Python, less reliable in other languages
The Phi series
| Model | Parameters | Key Innovation |
|---|---|---|
| Phi-1 (2023) | 1.3B | ”Textbooks Are All You Need” — synthetic data for code |
| Phi-2 (2023) | 2.7B | Scaled synthetic data approach |
| Phi-3 (2024) | 3.8B–14B | Extended to general reasoning |
| Phi-4 (2024) | 14B | 40% synthetic pretraining, Pivotal Token Search, surpasses teacher on STEM |
The trajectory validates Microsoft’s thesis: synthetic data quality can compensate for 5× fewer parameters on reasoning tasks. The novel Pivotal Token Search technique for DPO — targeting individual decision-point tokens rather than full responses — is a contribution that extends beyond phi-4 itself.
References
- HuggingFace huggingface.co/microsoft/phi-4
- GitHub github.com/microsoft/Phi-3CookBook
- Paper arxiv.org/abs/2412.08905