OpenAI 20.0B APACHE-2.0 text-generation

gpt oss 20b

Released August 4, 2025

Context Window 131K tokens

≈ 98 pages of text

7.0M Downloads

4.4K Likes

13.8 GB Disk Size

Pricing

Input $0.03/M per million tokens

Output $0.14/M per million tokens

About

gpt-oss-20b is OpenAI’s smaller open-weight reasoning model, released August 2025 under Apache 2.0. It has 20.9 billion total parameters with 3.6 billion active per token via MoE routing across 32 experts. The checkpoint is quantized to MXFP4 at 12.8 GiB — small enough to run on systems with as little as 16 GB memory. Along with its 120B sibling, it is OpenAI’s first release of open model weights, and the paper (arXiv:2508.10925) describes it as “surprisingly competitive” despite being 6× smaller than gpt-oss-120b.

Why it matters

This is OpenAI releasing open-weight reasoning models under Apache 2.0 with full chain-of-thought access. That alone is significant. But the performance is the real story:

AIME 2025 (with tools): 98.7% at high reasoning — actually beating gpt-oss-120b (97.9%) and matching o4-mini (99.5%)
SWE-Bench Verified: 60.7% — exceeds o3-mini (49.3%) on software engineering
Codeforces (with tools): 2516 Elo — competitive with o3-mini’s 2073
GPQA Diamond: 71.5% — strong for a 3.6B active model

The model demonstrates smooth test-time scaling: accuracy improves log-linearly as reasoning effort increases from low to high. On AIME 2024, it jumps from 42.1% (low) to 92.1% (high). The paper notes gpt-oss-20b uses 20K+ CoT tokens per problem on AIME at high reasoning — it’s doing serious thinking.

At $0.03/M input and $0.14/M output on OpenRouter, this is one of the cheapest reasoning-capable models available.

Architecture

Spec	Value
Total Parameters	20.9B
Active per Token	3.6B
Architecture	MoE Transformer (GPT-2/GPT-3 family)
Layers	24
Experts	32, top-4 routing
Residual Dimension	2,880
Query Heads	64 (dim 64)
KV Heads	8 (GQA)
Attention Pattern	Alternating banded window (128 tokens) and dense
Position Encoding	RoPE with YaRN extension to 131K
Activation	Gated SwiGLU (with clamping + residual)
Normalization	RMSNorm, Pre-LN placement
Tokenizer	o200k_harmony (BPE, 201,088 tokens)
Quantization	MXFP4 (4.25 bits/param on MoE weights)
Context Length	128K tokens
Checkpoint	12.8 GiB
License	Apache 2.0

Parameter breakdown (from paper Table 1)

Component	Parameters
MLP (MoE)	19.12B
Attention	0.64B
Embed + Unembed	1.16B
Active	3.61B
Total	20.91B

The architecture introduces a learned bias in the attention softmax denominator, enabling the model to “pay no attention” to any tokens when appropriate. MoE weights are quantized to MXFP4 (4.25 bits per parameter), which covers 90%+ of total parameters and is what enables the 12.8 GiB checkpoint.

Training

Pretraining

Data: Text-only, trillions of tokens, focus on STEM, coding, general knowledge
Safety: CBRN pre-training filters reused from GPT-4o
Hardware: NVIDIA H100 GPUs with PyTorch + Triton kernels
Compute: ~210K H100-hours (gpt-oss-120b needed 2.1M — almost 10× more)
Knowledge cutoff: June 2024

Post-training

Post-training uses CoT RL techniques similar to OpenAI o3 — the same reinforcement learning approach that powers OpenAI’s proprietary reasoning models. The training teaches:

Reasoning via chain-of-thought on problems from coding, math, science
Variable effort: three reasoning levels (low, medium, high), configured via system prompt keyword "Reasoning: low/medium/high"
Tool use: browsing (web search), Python (stateful Jupyter), and developer-defined functions

The Harmony chat format

gpt-oss models require the harmony chat format — they will not work correctly without it. Key details:

Special tokens for message boundaries with keyword arguments (User, Assistant, System, Developer)
Role hierarchy for conflict resolution: System > Developer > User > Assistant > Tool
Channels: analysis (CoT tokens), commentary (tool calling), final (user-facing answers)
Critical: previous turns’ reasoning traces must be stripped in multi-turn conversations
Supports interleaving CoT, function calls, function responses, intermediate messages, and final answers

Benchmarks

All results from the paper (Table 3 and Figures 1-2). gpt-oss-20b at each reasoning level:

Core reasoning

Benchmark	Low	Medium	High
AIME 2024 (no tools)	42.1%	80.0%	92.1%
AIME 2024 (with tools)	61.2%	86.0%	96.0%
AIME 2025 (no tools)	37.1%	72.1%	91.7%
AIME 2025 (with tools)	57.5%	90.4%	98.7%
GPQA Diamond (no tools)	56.8%	66.0%	71.5%
GPQA Diamond (with tools)	58.0%	67.1%	74.2%
HLE (no tools)	4.2%	7.0%	10.9%
HLE (with tools)	6.3%	8.8%	17.3%
MMLU	80.4%	84.0%	85.3%

Coding and tool use

Benchmark	Low	Medium	High
Codeforces (no tools)	1366	1998	2230
Codeforces (with tools)	1251	2064	2516
SWE-Bench Verified	37.4%	53.2%	60.7%
Tau-Bench Retail	35.0%	47.3%	54.8%

Against OpenAI proprietary models (all at high reasoning)

Benchmark	gpt-oss-20b	gpt-oss-120b	o3-mini	o4-mini	o3
AIME 2025 (tools)	98.7	97.9	—	99.5	87.3
GPQA Diamond	71.5	80.1	77.0	81.4	83.3
MMLU	85.3	90.0	86.5	93.0	93.4
SWE-Bench	60.7	62.4	49.3	69.1	68.1
Codeforces (tools)	2516	2622	2073	2706	2719
HLE (tools)	17.3	19.0	13.4	17.7	24.9

gpt-oss-20b exceeds o3-mini on nearly every benchmark and approaches o4-mini on math and coding tasks. The AIME 2025 result of 98.7% is remarkable — it actually beats the 120b sibling.

Multilingual (MMMLU, 14 languages)

Level	Average
Low	67.0%
Medium	73.5%
High	75.7%

Health (HealthBench)

Benchmark	Low	Medium	High
HealthBench	40.4	41.8	42.5
HealthBench Hard	9.0	12.9	10.8
HealthBench Consensus	84.9	83.0	82.6

The 20b model performs slightly better than OpenAI o1 on HealthBench despite being significantly smaller.

Deployment

Hardware

The MXFP4 quantized checkpoint is 12.8 GiB, fitting on:

Single GPU with 16+ GB VRAM — RTX 4080/4090, A100, H100
Consumer hardware — any system with 16 GB GPU memory

Critical: Harmony format required

The model will not work correctly without the harmony chat format. Do not use standard ChatML or Llama chat templates. OpenAI provides reference implementations and harnesses for proper deployment.

Tool harnesses

Reference implementations are provided for:

Browsing: search and open functions
Python: stateful Jupyter notebook environment
Custom functions: developer-defined schemas in Developer messages

API

Available on OpenRouter ($0.03/M input, $0.14/M output) and through OpenAI’s Responses API as gpt-oss-20b.

The gpt-oss family

Model	Total Params	Active	Layers	Experts	Checkpoint	Target
gpt-oss-120b	116.8B	5.1B	36	128	60.8 GiB	Single 80GB GPU (H100/MI300X)
gpt-oss-20b	20.9B	3.6B	24	32	12.8 GiB	Consumer GPU (16GB+)

Both models share the same harmony format, variable reasoning, and tool-use capabilities. The 120b version is targeted at production workloads; the 20b at local and specialized use cases.

Significance

gpt-oss represents a strategic shift for OpenAI. After years of proprietary-only releases, these are:

First open weights from OpenAI — under Apache 2.0 with no copyleft restrictions
Full CoT access — complete chain-of-thought visible (not intended for end users, but available for debugging)
Same RL techniques as o3 — the post-training pipeline that powers OpenAI’s frontier reasoning models
7M+ HuggingFace downloads — rapid community adoption

The safety analysis in the paper is notably extensive (17 pages covering adversarial fine-tuning, biosecurity, cybersecurity, instruction hierarchy), reflecting OpenAI’s concern about open-weight risk profiles.

Limitations

Text only — no image, audio, or video inputs
Harmony format required — won’t work with standard chat templates
Knowledge cutoff June 2024 — no knowledge of events after this date without tool use
Safety — open weights mean determined actors can fine-tune away safety refusals. OpenAI recommends system-level safeguards for production deployments.

References

🤗 HuggingFace huggingface.co/openai/gpt-oss-20b
📄 Paper arxiv.org/abs/2508.10925