Microsoft 14B MIT text-generation

phi 4

Released December 11, 2024

Context Window 16K tokens
≈ 12 pages of text
880.1K Downloads
2.2K Likes
29.3 GB Disk Size
3.7K GitHub ★

Pricing

Input $0.06/M per million tokens
Output $0.14/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 5.9
BBH 52.4
MATH Lvl 5 31.6
GPQA 20.8
MUSR 23.8
MMLU-PRO 47.6

Average: 30.40

About

Phi-4 is a 14 billion parameter language model from Microsoft Research, released December 12, 2024 under the MIT license. The paper (arXiv:2412.08905) frames the model around a single thesis: data quality can substitute for model scale. Phi-4 surpasses its own teacher model GPT-4o on GPQA (56.1 vs 50.6) and MATH (80.4 vs 74.6) — a 14B student exceeding a frontier teacher on STEM reasoning, which the paper argues goes beyond simple distillation. It was pretrained on ~9.8T tokens with 40% synthetic data, trained on 1,920 H100 GPUs for 21 days.

Why it matters

Phi-4’s performance on reasoning benchmarks is disproportionate to its size:

  • GPQA (graduate-level STEM): 56.1% — beats GPT-4o (50.6%), GPT-4o-mini (40.9%), Qwen 2.5 14B (42.9%), and even Llama-3.3-70B (49.1%)
  • MATH: 80.4% — beats GPT-4o (74.6%), GPT-4o-mini (73.0%), and its predecessor phi-3 (44.6%)
  • MMLU: 84.8% — approaches Llama-3.3-70B (86.3%) and Qwen 2.5 72B (85.3%) despite being 5× smaller
  • AMC 10/12 November 2024 (post-cutoff, contamination-proof): phi-4 outperforms GPT-4o, Gemini, and Llama-3.3-70B on fresh math competition data

The tradeoff is explicit. SimpleQA (factual recall) is 3.0% — the model deliberately learns to not attempt factual questions rather than hallucinate. The paper tracks this through post-training: the “not attempted” rate goes from 3.2% (base) to 81.1% (final), while incorrect answers drop from 90% to 15.8%. This is a design choice, not a failure.

At 29.3 GB with an MIT license (no restrictions on commercial use, redistribution, or modification), it’s one of the most capable models that fits on a single consumer GPU.

Architecture

SpecValue
ArchitectureDense decoder-only Transformer (Phi3ForCausalLM)
Parameters14B
Tokenizertiktoken (100,352 padded vocab, multilingual support)
Context Length4K (pretraining) → 16K (after midtraining)
AttentionFull attention (no sliding window, unlike phi-3-medium)
Prompt FormatChatML variant (<|im_start|>, <|im_end|>, <|im_sep|>)
Disk Size29.3 GB
LicenseMIT

Architecture closely follows phi-3-medium with two changes: tiktoken tokenizer for better multilingual support, and full attention replacing the 2K sliding window.

Training

Pretraining

DetailValue
Tokens~9.8T
Hardware1,920 H100-80GB GPUs
Duration21 days
Peak Learning Rate0.0003
Weight Decay0.1
Global Batch Size5,760
ScheduleLinear warmup and decay
Data CutoffJune 2024

Data mixture (from paper Table 5)

Source% of TrainingUnique TokensEpochs
Synthetic40%290B13.8
Code20%820B2.4
Web15%1.3T1.2
Web rewrites15%290B5.2
Acquired sources10%580B1.7

The most striking detail: synthetic data accounts for only 290B unique tokens but is seen 13.8 times — the paper shows that more epochs on synthetic data outperforms adding fresh web tokens (Figure 2). Synthetic data doesn’t plateau like organic data. The paper tested a 13B model trained entirely on synthetic data — it improved on most benchmarks but dropped -14.8 points on TriviaQA (knowledge), confirming that synthetic data builds reasoning at the expense of factual breadth.

50 broad types of synthetic datasets were created using:

  • Multi-agent prompting
  • Self-revision workflows (model critiques and improves its own outputs)
  • Instruction reversal (take existing code → generate instructions → pair together)
  • Seeds curated from web pages, code repos, Q&A platforms, books, arXiv, PubMed Central
  • Plurality-based difficulty filtering (discard questions where all or no model answers agree)
  • Code validated through execution loops and tests

Midtraining (context extension)

Context extended from 4K to 16K tokens via:

  • RoPE base frequency increased to 250,000
  • Learning rate dropped 10× from pretraining
  • 250B tokens: 30% long-context curated data + 70% from pretraining distribution

Post-training (3 stages)

Stage 1 — SFT: ~8B tokens at learning rate 1e-6, covering math, coding, reasoning, conversation, model identity, safety, and 40 languages.

Stage 2 — DPO with Pivotal Token Search (novel): The paper introduces PTS, which identifies individual tokens in a response where the next token choice has an outsized effect on solution correctness. Instead of creating DPO pairs from full responses, PTS creates pairs targeting single pivotal tokens. Example from the paper: in a math solution, the choice between “cross-multiplying” and “multiplying both sides by” can shift the success probability from 0.42 to 0.93. PTS data: 133K multiple-choice Q&A, 77K math, 16K Python, 22K other code, 3K safety.

Stage 3 — Judge-guided DPO: ~850K pairs where responses are generated by GPT-4o, GPT-4t, and phi-4 itself, then scored by GPT-4o on accuracy, style, and detail.

Benchmarks

From the paper’s Table 1 (OpenAI simple-evals framework, temperature 0.5):

Benchmarkphi-4 (14B)phi-3 (14B)Qwen 2.5 14BGPT-4o-miniGPT-4o
MMLU84.877.979.981.888.1
GPQA56.131.242.940.950.6
MATH80.444.675.673.074.6
HumanEval82.667.872.186.290.6
MGSM80.653.579.686.590.4
SimpleQA3.07.65.49.939.4
DROP75.568.385.579.380.9
MMLU-Pro70.451.363.263.473.0
ArenaHard75.445.870.276.275.6
IFEval63.057.978.780.084.8

The phi-3 → phi-4 jump at the same parameter count: +35.8 on MATH, +24.9 on GPQA, +27.1 on MGSM, +19.1 on MMLU-Pro. These gains come entirely from the data pipeline and post-training innovations, not architectural changes.

Open LLM Leaderboard (community eval)

BenchmarkScore
Average30.4
BBH52.4
MMLU-Pro47.6
MATH31.6
MUSR23.8
GPQA20.8
IFEval5.9

The Open LLM Leaderboard scores are substantially lower than simple-evals, particularly IFEval (5.9 vs 63.0). This discrepancy likely reflects differences in evaluation harness formatting — phi-4 was optimized for simple-evals prompting.

Deployment

Hardware

FormatSizeMinimum Hardware
Full (BF16)29.3 GBRTX 4090 or A100-40GB
8-bit~15 GBRTX 3090/4080
4-bit~8 GBRTX 3070 (8 GB)

Transformers

import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/phi-4",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "Prove that sqrt(2) is irrational."},
]
outputs = pipeline(messages, max_new_tokens=512)

API

Available on OpenRouter ($0.06/M input, $0.14/M output) — one of the cheapest capable models available via API.

Limitations

  • Factual recall: SimpleQA 3.0% — the model is trained to abstain rather than guess, but this means it can’t answer many factual questions
  • Context window: 16K tokens — limiting for document-level tasks where competitors offer 128K+
  • Language: Primarily English. Multilingual data is ~8% of pretraining; SFT covers 40 languages but English dominates
  • IFEval: 63.0% — instruction following is notably weaker than Qwen 2.5 14B (78.7%) and GPT-4o-mini (80.0%)
  • Code: Strong in Python, less reliable in other languages

The Phi series

ModelParametersKey Innovation
Phi-1 (2023)1.3B”Textbooks Are All You Need” — synthetic data for code
Phi-2 (2023)2.7BScaled synthetic data approach
Phi-3 (2024)3.8B–14BExtended to general reasoning
Phi-4 (2024)14B40% synthetic pretraining, Pivotal Token Search, surpasses teacher on STEM

The trajectory validates Microsoft’s thesis: synthetic data quality can compensate for 5× fewer parameters on reasoning tasks. The novel Pivotal Token Search technique for DPO — targeting individual decision-point tokens rather than full responses — is a contribution that extends beyond phi-4 itself.

References

  • 🤗 HuggingFace huggingface.co/microsoft/phi-4
  • ⌨️ GitHub github.com/microsoft/Phi-3CookBook
  • 📄 Paper arxiv.org/abs/2412.08905