OpenAI 20.0B APACHE-2.0 text-generation

gpt oss 20b

Released August 4, 2025

Context Window 131K tokens
≈ 98 pages of text
7.0M Downloads
4.4K Likes
13.8 GB Disk Size

Pricing

Input $0.03/M per million tokens
Output $0.14/M per million tokens

About

gpt-oss-20b is OpenAI’s smaller open-weight reasoning model, released August 2025 under Apache 2.0. It has 20.9 billion total parameters with 3.6 billion active per token via MoE routing across 32 experts. The checkpoint is quantized to MXFP4 at 12.8 GiB — small enough to run on systems with as little as 16 GB memory. Along with its 120B sibling, it is OpenAI’s first release of open model weights, and the paper (arXiv:2508.10925) describes it as “surprisingly competitive” despite being 6× smaller than gpt-oss-120b.

Why it matters

This is OpenAI releasing open-weight reasoning models under Apache 2.0 with full chain-of-thought access. That alone is significant. But the performance is the real story:

  • AIME 2025 (with tools): 98.7% at high reasoning — actually beating gpt-oss-120b (97.9%) and matching o4-mini (99.5%)
  • SWE-Bench Verified: 60.7% — exceeds o3-mini (49.3%) on software engineering
  • Codeforces (with tools): 2516 Elo — competitive with o3-mini’s 2073
  • GPQA Diamond: 71.5% — strong for a 3.6B active model

The model demonstrates smooth test-time scaling: accuracy improves log-linearly as reasoning effort increases from low to high. On AIME 2024, it jumps from 42.1% (low) to 92.1% (high). The paper notes gpt-oss-20b uses 20K+ CoT tokens per problem on AIME at high reasoning — it’s doing serious thinking.

At $0.03/M input and $0.14/M output on OpenRouter, this is one of the cheapest reasoning-capable models available.

Architecture

SpecValue
Total Parameters20.9B
Active per Token3.6B
ArchitectureMoE Transformer (GPT-2/GPT-3 family)
Layers24
Experts32, top-4 routing
Residual Dimension2,880
Query Heads64 (dim 64)
KV Heads8 (GQA)
Attention PatternAlternating banded window (128 tokens) and dense
Position EncodingRoPE with YaRN extension to 131K
ActivationGated SwiGLU (with clamping + residual)
NormalizationRMSNorm, Pre-LN placement
Tokenizero200k_harmony (BPE, 201,088 tokens)
QuantizationMXFP4 (4.25 bits/param on MoE weights)
Context Length128K tokens
Checkpoint12.8 GiB
LicenseApache 2.0

Parameter breakdown (from paper Table 1)

ComponentParameters
MLP (MoE)19.12B
Attention0.64B
Embed + Unembed1.16B
Active3.61B
Total20.91B

The architecture introduces a learned bias in the attention softmax denominator, enabling the model to “pay no attention” to any tokens when appropriate. MoE weights are quantized to MXFP4 (4.25 bits per parameter), which covers 90%+ of total parameters and is what enables the 12.8 GiB checkpoint.

Training

Pretraining

  • Data: Text-only, trillions of tokens, focus on STEM, coding, general knowledge
  • Safety: CBRN pre-training filters reused from GPT-4o
  • Hardware: NVIDIA H100 GPUs with PyTorch + Triton kernels
  • Compute: ~210K H100-hours (gpt-oss-120b needed 2.1M — almost 10× more)
  • Knowledge cutoff: June 2024

Post-training

Post-training uses CoT RL techniques similar to OpenAI o3 — the same reinforcement learning approach that powers OpenAI’s proprietary reasoning models. The training teaches:

  1. Reasoning via chain-of-thought on problems from coding, math, science
  2. Variable effort: three reasoning levels (low, medium, high), configured via system prompt keyword "Reasoning: low/medium/high"
  3. Tool use: browsing (web search), Python (stateful Jupyter), and developer-defined functions

The Harmony chat format

gpt-oss models require the harmony chat format — they will not work correctly without it. Key details:

  • Special tokens for message boundaries with keyword arguments (User, Assistant, System, Developer)
  • Role hierarchy for conflict resolution: System > Developer > User > Assistant > Tool
  • Channels: analysis (CoT tokens), commentary (tool calling), final (user-facing answers)
  • Critical: previous turns’ reasoning traces must be stripped in multi-turn conversations
  • Supports interleaving CoT, function calls, function responses, intermediate messages, and final answers

Benchmarks

All results from the paper (Table 3 and Figures 1-2). gpt-oss-20b at each reasoning level:

Core reasoning

BenchmarkLowMediumHigh
AIME 2024 (no tools)42.1%80.0%92.1%
AIME 2024 (with tools)61.2%86.0%96.0%
AIME 2025 (no tools)37.1%72.1%91.7%
AIME 2025 (with tools)57.5%90.4%98.7%
GPQA Diamond (no tools)56.8%66.0%71.5%
GPQA Diamond (with tools)58.0%67.1%74.2%
HLE (no tools)4.2%7.0%10.9%
HLE (with tools)6.3%8.8%17.3%
MMLU80.4%84.0%85.3%

Coding and tool use

BenchmarkLowMediumHigh
Codeforces (no tools)136619982230
Codeforces (with tools)125120642516
SWE-Bench Verified37.4%53.2%60.7%
Tau-Bench Retail35.0%47.3%54.8%

Against OpenAI proprietary models (all at high reasoning)

Benchmarkgpt-oss-20bgpt-oss-120bo3-minio4-minio3
AIME 2025 (tools)98.797.999.587.3
GPQA Diamond71.580.177.081.483.3
MMLU85.390.086.593.093.4
SWE-Bench60.762.449.369.168.1
Codeforces (tools)25162622207327062719
HLE (tools)17.319.013.417.724.9

gpt-oss-20b exceeds o3-mini on nearly every benchmark and approaches o4-mini on math and coding tasks. The AIME 2025 result of 98.7% is remarkable — it actually beats the 120b sibling.

Multilingual (MMMLU, 14 languages)

LevelAverage
Low67.0%
Medium73.5%
High75.7%

Health (HealthBench)

BenchmarkLowMediumHigh
HealthBench40.441.842.5
HealthBench Hard9.012.910.8
HealthBench Consensus84.983.082.6

The 20b model performs slightly better than OpenAI o1 on HealthBench despite being significantly smaller.

Deployment

Hardware

The MXFP4 quantized checkpoint is 12.8 GiB, fitting on:

  • Single GPU with 16+ GB VRAM — RTX 4080/4090, A100, H100
  • Consumer hardware — any system with 16 GB GPU memory

Critical: Harmony format required

The model will not work correctly without the harmony chat format. Do not use standard ChatML or Llama chat templates. OpenAI provides reference implementations and harnesses for proper deployment.

Tool harnesses

Reference implementations are provided for:

  • Browsing: search and open functions
  • Python: stateful Jupyter notebook environment
  • Custom functions: developer-defined schemas in Developer messages

API

Available on OpenRouter ($0.03/M input, $0.14/M output) and through OpenAI’s Responses API as gpt-oss-20b.

The gpt-oss family

ModelTotal ParamsActiveLayersExpertsCheckpointTarget
gpt-oss-120b116.8B5.1B3612860.8 GiBSingle 80GB GPU (H100/MI300X)
gpt-oss-20b20.9B3.6B243212.8 GiBConsumer GPU (16GB+)

Both models share the same harmony format, variable reasoning, and tool-use capabilities. The 120b version is targeted at production workloads; the 20b at local and specialized use cases.

Significance

gpt-oss represents a strategic shift for OpenAI. After years of proprietary-only releases, these are:

  • First open weights from OpenAI — under Apache 2.0 with no copyleft restrictions
  • Full CoT access — complete chain-of-thought visible (not intended for end users, but available for debugging)
  • Same RL techniques as o3 — the post-training pipeline that powers OpenAI’s frontier reasoning models
  • 7M+ HuggingFace downloads — rapid community adoption

The safety analysis in the paper is notably extensive (17 pages covering adversarial fine-tuning, biosecurity, cybersecurity, instruction hierarchy), reflecting OpenAI’s concern about open-weight risk profiles.

Limitations

  • Text only — no image, audio, or video inputs
  • Harmony format required — won’t work with standard chat templates
  • Knowledge cutoff June 2024 — no knowledge of events after this date without tool use
  • Safety — open weights mean determined actors can fine-tune away safety refusals. OpenAI recommends system-level safeguards for production deployments.

References

  • 🤗 HuggingFace huggingface.co/openai/gpt-oss-20b
  • 📄 Paper arxiv.org/abs/2508.10925