NVIDIA 8.0B other text-generation

Llama 3.1 Nemotron Nano 8B v1

Released March 16, 2025

Context Window 131K tokens

≈ 98 pages of text

212.2K Downloads

218 Likes

16.1 GB Disk Size

13.0K GitHub ★

About

Llama-3.1-Nemotron-Nano-8B-v1 is the smallest model in NVIDIA’s Llama-Nemotron family — a dense 8B parameter reasoning model derived from Meta’s Llama-3.1-8B-Instruct. Released March 2025 under the commercially permissive NVIDIA Open Model License, it supports a 128K token context window and introduces a dynamic reasoning toggle that switches between standard chat and chain-of-thought reasoning via a system prompt. At 16.1 GB on disk, it fits on a single consumer GPU.

Why it matters

Nemotron Nano is one of the first open models to support a dynamic reasoning toggle — controlled entirely by the system prompt. Setting detailed thinking on activates chain-of-thought reasoning; detailed thinking off runs in standard chat mode. The performance difference is dramatic:

MATH-500: 36.6% (off) → 95.4% (on) — a 2.6× improvement from the same weights
AIME 2024: 3.0% (off) → 61.3% (on)
GPQA-Diamond: 39.4% (off) → 54.1% (on)
MBPP 0-shot: 66.1% (off) → 84.6% (on)

This means a single 8B model can serve both fast chat responses and deep reasoning — no separate deployment needed.

Notably, the paper (arXiv:2505.00949) finds that RL yields suboptimal results for smaller models compared to distillation. Nano relies on SFT distillation from strong teachers (primarily DeepSeek-R1), while only the 253B Ultra model gets large-scale reasoning RL. The fact that distillation alone produces these results at 8B is the story — Nano with reasoning on scores 95.4% on MATH-500 compared to DeepSeek-R1-Distill-Llama-8B’s 89.1%.

Architecture

Spec	Value
Architecture	Dense decoder-only Transformer (LlamaForCausalLM)
Parameters	8B
Base Model	Meta Llama-3.1-8B-Instruct
Context Length	128K tokens
Evaluation Context	32K tokens
Position Encoding	RoPE
Activation	SwiGLU
Normalization	RMSNorm
Disk Size	16.1 GB
License	NVIDIA Open Model License (commercially permissive)

Unlike the larger siblings (Super 49B and Ultra 253B), Nano does not use NVIDIA’s Puzzle neural architecture search framework. It’s a standard Llama 3.1 architecture — all improvements come from the post-training pipeline.

Training

Nano’s post-training consists of a 3-stage SFT pipeline followed by 2 rounds of offline RPO (Reward-aware Preference Optimization).

SFT Pipeline

All stages use global batch size 256 with sequence packing at effective length 32K tokens.

Stage 1 — Reasoning-only SFT: Fine-tuned exclusively on reasoning data from code, math, and science domains at learning rate 1e-4 for 4 epochs. The paper notes this prevents failure modes like repetitive completions.

Stage 2 — Mixed data: Non-reasoning data introduced alongside reasoning samples. This stage teaches the model to respond to the reasoning toggle — generating chain-of-thought when the system prompt says detailed thinking on and direct answers when it says detailed thinking off.

Stage 3 — Chat and tool calling: A smaller blend focused on chat, instruction following, and tool calling to round out the model’s general capabilities.

Synthetic training data

The Llama-Nemotron family shares a training dataset of 33 million samples (also open-sourced as Llama-Nemotron-Post-Training-Dataset):

Domain	Samples	% of Total	Reasoning On	Reasoning Off
Math	22,066,397	66.8%	2,225,427	19,840,970
Code	10,108,883	30.6%	991,706	9,117,177
Science	708,920	2.1%	708,920	0
Chat	39,792	0.12%	8,574	31,218
Instruction Following	56,339	0.17%	—	—
Safety	31,426	0.10%	—	—
Total	33,011,757

How this data was generated:

Math: Problems from Art of Problem Solving (AoPS) forums. Solutions generated by DeepSeek-R1 (16 generations per problem) and Qwen2.5-Math-7B-Instruct (64 generations per problem). Filtered by correctness, decontaminated against benchmarks.
Code: 28,904 unique competitive programming questions from TACO, APPS, CodeContests, and Codeforces. Solutions from DeepSeek-R1 using nucleus sampling (temp 0.6, top-p 0.95). Yielded ~488K Python samples after filtering.
Science: Synthetic MCQs generated by Qwen2.5 models across physics, biology, chemistry topics. Solutions by DeepSeek-R1, decontaminated against GPQA, MMLU, MMLU-Pro.
General: Synthetic prompts via NVIDIA’s pipeline (Nemotron-4-340B-Instruct), responses by DeepSeek-R1 with rejection sampling using Llama-3.1-Nemotron-70B reward model.

RPO (Preference Optimization)

After SFT, two rounds of offline RPO with on-policy data:

Round 1: Mixture of reasoning and non-reasoning data with appropriate system prompts to improve reasoning control
Round 2: On-policy generations targeting instruction following improvements

Each round: up to 400 steps, learning rate 7e-7, KL penalty β=3e-2, batch size 512. The paper notes RPO “mainly targeted IFEval accuracy improvement” — IFEval jumped from 69.9 (SFT only) to 79.29 after RPO.

Benchmarks

All evaluations at 32K context length, up to 16 completions per prompt, average pass@1. From paper Table 3.

Reasoning ON vs comparable 8B-class models

Benchmark	LN-Nano	DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B-Instruct	DeepSeek-R1-Distill-Qwen-7B
GPQA-Diamond	54.1	49.0	25.3	49.1
AIME 2024	61.3	50.4	10.0	55.6
AIME 2025-I	47.1	40.0	10.0	41.7
MATH-500	95.4	89.1	50.4	92.8
BFCL V2 Live	63.9	37.8	44.3	39.2
LiveCodeBench	46.6	39.6	11.8	37.6
IFEval	79.3	73.4	81.8	67.6

Nano leads every benchmark except IFEval, where the base Llama-3.1-8B-Instruct still wins. The BFCL V2 (tool calling) score of 63.9 is a standout — nearly double the distilled models — showing the benefit of dedicated tool-calling data in Stage 3 SFT.

Reasoning ON vs OFF

Benchmark	Reasoning ON	Reasoning OFF
GPQA-Diamond	54.1	39.4
AIME 2024	61.3	3.0
AIME 2025-I	47.1	0.0
MATH-500	95.4	36.6
BFCL V2 Live	63.9	63.6
IFEval	79.3	82.1

Tool calling (BFCL V2) is stable across modes — the toggle doesn’t interfere with function-calling. IFEval is slightly better with reasoning off, consistent with the observation that the model can overthink simple instructions.

Deployment

Hardware

FP16: Any GPU with ≥16 GB VRAM (RTX 4090, A100, H100)
8-bit: ~8 GB VRAM (RTX 3070/3080)
4-bit: ~5 GB VRAM
Edge: Jetson AGX Thor with JetPack 6.0

Inference settings

Reasoning ON:

System prompt: "detailed thinking on"
Temperature: 0.6, Top-p: 0.95

Reasoning OFF:

System prompt: "detailed thinking off"
Greedy decoding (temperature 0)

Software

NeMo 24.12 — NVIDIA’s training and inference framework
TensorRT-LLM — Optimized serving on NVIDIA GPUs
vLLM — Compatible via LlamaForCausalLM
Transformers — Standard HuggingFace pipeline

The Llama-Nemotron family

Model	Parameters	Base	Target Hardware	Key Difference
LN-Nano	8B	Llama 3.1 8B	Single GPU / edge	SFT + RPO only
LN-Super	49B	Llama 3.3 70B (NAS compressed)	Single H100	Puzzle NAS → 5× throughput over 70B at TP1
LN-Ultra	253B	Llama 3.1 405B (NAS + FFN Fusion)	8×H100 node	Reasoning RL (GRPO, ~140K H100-hours)

Only Ultra receives reasoning RL training. The paper finds distillation is more effective than RL for smaller models. Super uses Puzzle NAS to compress 70B → 49B with 5× throughput improvement. Ultra uses NAS + FFN Fusion to compress 405B → 253B with 1.71× latency improvement.

Languages

English (primary), coding languages, plus German, French, Italian, Portuguese, Hindi, Spanish, and Thai — inherited from Llama 3.1.

References

🤗 HuggingFace huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
⌨️ GitHub github.com/NVIDIA/TensorRT-LLM
📄 Paper 1 arxiv.org/abs/2505.00949
📄 Paper 2 arxiv.org/abs/2502.00203