Microsoft 14B MIT text-generation

phi 4

Released December 11, 2024

Context Window 16K tokens

≈ 12 pages of text

880.1K Downloads

2.2K Likes

29.3 GB Disk Size

3.7K GitHub ★

Pricing

Input $0.06/M per million tokens

Output $0.14/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 5.9

BBH 52.4

MATH Lvl 5 31.6

GPQA 20.8

MUSR 23.8

MMLU-PRO 47.6

Average: 30.40

About

Phi-4 is a 14 billion parameter language model from Microsoft Research, released December 12, 2024 under the MIT license. The paper (arXiv:2412.08905) frames the model around a single thesis: data quality can substitute for model scale. Phi-4 surpasses its own teacher model GPT-4o on GPQA (56.1 vs 50.6) and MATH (80.4 vs 74.6) — a 14B student exceeding a frontier teacher on STEM reasoning, which the paper argues goes beyond simple distillation. It was pretrained on ~9.8T tokens with 40% synthetic data, trained on 1,920 H100 GPUs for 21 days.

Why it matters

Phi-4’s performance on reasoning benchmarks is disproportionate to its size:

GPQA (graduate-level STEM): 56.1% — beats GPT-4o (50.6%), GPT-4o-mini (40.9%), Qwen 2.5 14B (42.9%), and even Llama-3.3-70B (49.1%)
MATH: 80.4% — beats GPT-4o (74.6%), GPT-4o-mini (73.0%), and its predecessor phi-3 (44.6%)
MMLU: 84.8% — approaches Llama-3.3-70B (86.3%) and Qwen 2.5 72B (85.3%) despite being 5× smaller
AMC 10/12 November 2024 (post-cutoff, contamination-proof): phi-4 outperforms GPT-4o, Gemini, and Llama-3.3-70B on fresh math competition data

The tradeoff is explicit. SimpleQA (factual recall) is 3.0% — the model deliberately learns to not attempt factual questions rather than hallucinate. The paper tracks this through post-training: the “not attempted” rate goes from 3.2% (base) to 81.1% (final), while incorrect answers drop from 90% to 15.8%. This is a design choice, not a failure.

At 29.3 GB with an MIT license (no restrictions on commercial use, redistribution, or modification), it’s one of the most capable models that fits on a single consumer GPU.

Architecture

Spec	Value
Architecture	Dense decoder-only Transformer (Phi3ForCausalLM)
Parameters	14B
Tokenizer	tiktoken (100,352 padded vocab, multilingual support)
Context Length	4K (pretraining) → 16K (after midtraining)
Attention	Full attention (no sliding window, unlike phi-3-medium)
Prompt Format	ChatML variant (`<\|im_start\|>`, `<\|im_end\|>`, `<\|im_sep\|>`)
Disk Size	29.3 GB
License	MIT

Architecture closely follows phi-3-medium with two changes: tiktoken tokenizer for better multilingual support, and full attention replacing the 2K sliding window.

Training

Pretraining

Detail	Value
Tokens	~9.8T
Hardware	1,920 H100-80GB GPUs
Duration	21 days
Peak Learning Rate	0.0003
Weight Decay	0.1
Global Batch Size	5,760
Schedule	Linear warmup and decay
Data Cutoff	June 2024

Data mixture (from paper Table 5)

Source	% of Training	Unique Tokens	Epochs
Synthetic	40%	290B	13.8
Code	20%	820B	2.4
Web	15%	1.3T	1.2
Web rewrites	15%	290B	5.2
Acquired sources	10%	580B	1.7

The most striking detail: synthetic data accounts for only 290B unique tokens but is seen 13.8 times — the paper shows that more epochs on synthetic data outperforms adding fresh web tokens (Figure 2). Synthetic data doesn’t plateau like organic data. The paper tested a 13B model trained entirely on synthetic data — it improved on most benchmarks but dropped -14.8 points on TriviaQA (knowledge), confirming that synthetic data builds reasoning at the expense of factual breadth.

50 broad types of synthetic datasets were created using:

Multi-agent prompting
Self-revision workflows (model critiques and improves its own outputs)
Instruction reversal (take existing code → generate instructions → pair together)
Seeds curated from web pages, code repos, Q&A platforms, books, arXiv, PubMed Central
Plurality-based difficulty filtering (discard questions where all or no model answers agree)
Code validated through execution loops and tests

Midtraining (context extension)

Context extended from 4K to 16K tokens via:

RoPE base frequency increased to 250,000
Learning rate dropped 10× from pretraining
250B tokens: 30% long-context curated data + 70% from pretraining distribution

Post-training (3 stages)

Stage 1 — SFT: ~8B tokens at learning rate 1e-6, covering math, coding, reasoning, conversation, model identity, safety, and 40 languages.

Stage 2 — DPO with Pivotal Token Search (novel): The paper introduces PTS, which identifies individual tokens in a response where the next token choice has an outsized effect on solution correctness. Instead of creating DPO pairs from full responses, PTS creates pairs targeting single pivotal tokens. Example from the paper: in a math solution, the choice between “cross-multiplying” and “multiplying both sides by” can shift the success probability from 0.42 to 0.93. PTS data: 133K multiple-choice Q&A, 77K math, 16K Python, 22K other code, 3K safety.

Stage 3 — Judge-guided DPO: ~850K pairs where responses are generated by GPT-4o, GPT-4t, and phi-4 itself, then scored by GPT-4o on accuracy, style, and detail.

Benchmarks

From the paper’s Table 1 (OpenAI simple-evals framework, temperature 0.5):

Benchmark	phi-4 (14B)	phi-3 (14B)	Qwen 2.5 14B	GPT-4o-mini	GPT-4o
MMLU	84.8	77.9	79.9	81.8	88.1
GPQA	56.1	31.2	42.9	40.9	50.6
MATH	80.4	44.6	75.6	73.0	74.6
HumanEval	82.6	67.8	72.1	86.2	90.6
MGSM	80.6	53.5	79.6	86.5	90.4
SimpleQA	3.0	7.6	5.4	9.9	39.4
DROP	75.5	68.3	85.5	79.3	80.9
MMLU-Pro	70.4	51.3	63.2	63.4	73.0
ArenaHard	75.4	45.8	70.2	76.2	75.6
IFEval	63.0	57.9	78.7	80.0	84.8

The phi-3 → phi-4 jump at the same parameter count: +35.8 on MATH, +24.9 on GPQA, +27.1 on MGSM, +19.1 on MMLU-Pro. These gains come entirely from the data pipeline and post-training innovations, not architectural changes.

Open LLM Leaderboard (community eval)

Benchmark	Score
Average	30.4
BBH	52.4
MMLU-Pro	47.6
MATH	31.6
MUSR	23.8
GPQA	20.8
IFEval	5.9

The Open LLM Leaderboard scores are substantially lower than simple-evals, particularly IFEval (5.9 vs 63.0). This discrepancy likely reflects differences in evaluation harness formatting — phi-4 was optimized for simple-evals prompting.

Deployment

Hardware

Format	Size	Minimum Hardware
Full (BF16)	29.3 GB	RTX 4090 or A100-40GB
8-bit	~15 GB	RTX 3090/4080
4-bit	~8 GB	RTX 3070 (8 GB)

Transformers

import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/phi-4",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "Prove that sqrt(2) is irrational."},
]
outputs = pipeline(messages, max_new_tokens=512)

API

Available on OpenRouter ($0.06/M input, $0.14/M output) — one of the cheapest capable models available via API.

Limitations

Factual recall: SimpleQA 3.0% — the model is trained to abstain rather than guess, but this means it can’t answer many factual questions
Context window: 16K tokens — limiting for document-level tasks where competitors offer 128K+
Language: Primarily English. Multilingual data is ~8% of pretraining; SFT covers 40 languages but English dominates
IFEval: 63.0% — instruction following is notably weaker than Qwen 2.5 14B (78.7%) and GPT-4o-mini (80.0%)
Code: Strong in Python, less reliable in other languages

The Phi series

Model	Parameters	Key Innovation
Phi-1 (2023)	1.3B	”Textbooks Are All You Need” — synthetic data for code
Phi-2 (2023)	2.7B	Scaled synthetic data approach
Phi-3 (2024)	3.8B–14B	Extended to general reasoning
Phi-4 (2024)	14B	40% synthetic pretraining, Pivotal Token Search, surpasses teacher on STEM

The trajectory validates Microsoft’s thesis: synthetic data quality can compensate for 5× fewer parameters on reasoning tasks. The novel Pivotal Token Search technique for DPO — targeting individual decision-point tokens rather than full responses — is a contribution that extends beyond phi-4 itself.

References

🤗 HuggingFace huggingface.co/microsoft/phi-4
⌨️ GitHub github.com/microsoft/Phi-3CookBook
📄 Paper arxiv.org/abs/2412.08905