Llama 3.1 Nemotron Nano 8B v1
Released March 16, 2025
About
Llama-3.1-Nemotron-Nano-8B-v1 is the smallest model in NVIDIA’s Llama-Nemotron family — a dense 8B parameter reasoning model derived from Meta’s Llama-3.1-8B-Instruct. Released March 2025 under the commercially permissive NVIDIA Open Model License, it supports a 128K token context window and introduces a dynamic reasoning toggle that switches between standard chat and chain-of-thought reasoning via a system prompt. At 16.1 GB on disk, it fits on a single consumer GPU.
Why it matters
Nemotron Nano is one of the first open models to support a dynamic reasoning toggle — controlled entirely by the system prompt. Setting detailed thinking on activates chain-of-thought reasoning; detailed thinking off runs in standard chat mode. The performance difference is dramatic:
- MATH-500: 36.6% (off) → 95.4% (on) — a 2.6× improvement from the same weights
- AIME 2024: 3.0% (off) → 61.3% (on)
- GPQA-Diamond: 39.4% (off) → 54.1% (on)
- MBPP 0-shot: 66.1% (off) → 84.6% (on)
This means a single 8B model can serve both fast chat responses and deep reasoning — no separate deployment needed.
Notably, the paper (arXiv:2505.00949) finds that RL yields suboptimal results for smaller models compared to distillation. Nano relies on SFT distillation from strong teachers (primarily DeepSeek-R1), while only the 253B Ultra model gets large-scale reasoning RL. The fact that distillation alone produces these results at 8B is the story — Nano with reasoning on scores 95.4% on MATH-500 compared to DeepSeek-R1-Distill-Llama-8B’s 89.1%.
Architecture
| Spec | Value |
|---|---|
| Architecture | Dense decoder-only Transformer (LlamaForCausalLM) |
| Parameters | 8B |
| Base Model | Meta Llama-3.1-8B-Instruct |
| Context Length | 128K tokens |
| Evaluation Context | 32K tokens |
| Position Encoding | RoPE |
| Activation | SwiGLU |
| Normalization | RMSNorm |
| Disk Size | 16.1 GB |
| License | NVIDIA Open Model License (commercially permissive) |
Unlike the larger siblings (Super 49B and Ultra 253B), Nano does not use NVIDIA’s Puzzle neural architecture search framework. It’s a standard Llama 3.1 architecture — all improvements come from the post-training pipeline.
Training
Nano’s post-training consists of a 3-stage SFT pipeline followed by 2 rounds of offline RPO (Reward-aware Preference Optimization).
SFT Pipeline
All stages use global batch size 256 with sequence packing at effective length 32K tokens.
Stage 1 — Reasoning-only SFT: Fine-tuned exclusively on reasoning data from code, math, and science domains at learning rate 1e-4 for 4 epochs. The paper notes this prevents failure modes like repetitive completions.
Stage 2 — Mixed data: Non-reasoning data introduced alongside reasoning samples. This stage teaches the model to respond to the reasoning toggle — generating chain-of-thought when the system prompt says detailed thinking on and direct answers when it says detailed thinking off.
Stage 3 — Chat and tool calling: A smaller blend focused on chat, instruction following, and tool calling to round out the model’s general capabilities.
Synthetic training data
The Llama-Nemotron family shares a training dataset of 33 million samples (also open-sourced as Llama-Nemotron-Post-Training-Dataset):
| Domain | Samples | % of Total | Reasoning On | Reasoning Off |
|---|---|---|---|---|
| Math | 22,066,397 | 66.8% | 2,225,427 | 19,840,970 |
| Code | 10,108,883 | 30.6% | 991,706 | 9,117,177 |
| Science | 708,920 | 2.1% | 708,920 | 0 |
| Chat | 39,792 | 0.12% | 8,574 | 31,218 |
| Instruction Following | 56,339 | 0.17% | — | — |
| Safety | 31,426 | 0.10% | — | — |
| Total | 33,011,757 |
How this data was generated:
- Math: Problems from Art of Problem Solving (AoPS) forums. Solutions generated by DeepSeek-R1 (16 generations per problem) and Qwen2.5-Math-7B-Instruct (64 generations per problem). Filtered by correctness, decontaminated against benchmarks.
- Code: 28,904 unique competitive programming questions from TACO, APPS, CodeContests, and Codeforces. Solutions from DeepSeek-R1 using nucleus sampling (temp 0.6, top-p 0.95). Yielded ~488K Python samples after filtering.
- Science: Synthetic MCQs generated by Qwen2.5 models across physics, biology, chemistry topics. Solutions by DeepSeek-R1, decontaminated against GPQA, MMLU, MMLU-Pro.
- General: Synthetic prompts via NVIDIA’s pipeline (Nemotron-4-340B-Instruct), responses by DeepSeek-R1 with rejection sampling using Llama-3.1-Nemotron-70B reward model.
RPO (Preference Optimization)
After SFT, two rounds of offline RPO with on-policy data:
- Round 1: Mixture of reasoning and non-reasoning data with appropriate system prompts to improve reasoning control
- Round 2: On-policy generations targeting instruction following improvements
Each round: up to 400 steps, learning rate 7e-7, KL penalty β=3e-2, batch size 512. The paper notes RPO “mainly targeted IFEval accuracy improvement” — IFEval jumped from 69.9 (SFT only) to 79.29 after RPO.
Benchmarks
All evaluations at 32K context length, up to 16 completions per prompt, average pass@1. From paper Table 3.
Reasoning ON vs comparable 8B-class models
| Benchmark | LN-Nano | DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B-Instruct | DeepSeek-R1-Distill-Qwen-7B |
|---|---|---|---|---|
| GPQA-Diamond | 54.1 | 49.0 | 25.3 | 49.1 |
| AIME 2024 | 61.3 | 50.4 | 10.0 | 55.6 |
| AIME 2025-I | 47.1 | 40.0 | 10.0 | 41.7 |
| MATH-500 | 95.4 | 89.1 | 50.4 | 92.8 |
| BFCL V2 Live | 63.9 | 37.8 | 44.3 | 39.2 |
| LiveCodeBench | 46.6 | 39.6 | 11.8 | 37.6 |
| IFEval | 79.3 | 73.4 | 81.8 | 67.6 |
Nano leads every benchmark except IFEval, where the base Llama-3.1-8B-Instruct still wins. The BFCL V2 (tool calling) score of 63.9 is a standout — nearly double the distilled models — showing the benefit of dedicated tool-calling data in Stage 3 SFT.
Reasoning ON vs OFF
| Benchmark | Reasoning ON | Reasoning OFF |
|---|---|---|
| GPQA-Diamond | 54.1 | 39.4 |
| AIME 2024 | 61.3 | 3.0 |
| AIME 2025-I | 47.1 | 0.0 |
| MATH-500 | 95.4 | 36.6 |
| BFCL V2 Live | 63.9 | 63.6 |
| IFEval | 79.3 | 82.1 |
Tool calling (BFCL V2) is stable across modes — the toggle doesn’t interfere with function-calling. IFEval is slightly better with reasoning off, consistent with the observation that the model can overthink simple instructions.
Deployment
Hardware
- FP16: Any GPU with ≥16 GB VRAM (RTX 4090, A100, H100)
- 8-bit: ~8 GB VRAM (RTX 3070/3080)
- 4-bit: ~5 GB VRAM
- Edge: Jetson AGX Thor with JetPack 6.0
Inference settings
Reasoning ON:
System prompt: "detailed thinking on"
Temperature: 0.6, Top-p: 0.95
Reasoning OFF:
System prompt: "detailed thinking off"
Greedy decoding (temperature 0)
Software
- NeMo 24.12 — NVIDIA’s training and inference framework
- TensorRT-LLM — Optimized serving on NVIDIA GPUs
- vLLM — Compatible via LlamaForCausalLM
- Transformers — Standard HuggingFace pipeline
The Llama-Nemotron family
| Model | Parameters | Base | Target Hardware | Key Difference |
|---|---|---|---|---|
| LN-Nano | 8B | Llama 3.1 8B | Single GPU / edge | SFT + RPO only |
| LN-Super | 49B | Llama 3.3 70B (NAS compressed) | Single H100 | Puzzle NAS → 5× throughput over 70B at TP1 |
| LN-Ultra | 253B | Llama 3.1 405B (NAS + FFN Fusion) | 8×H100 node | Reasoning RL (GRPO, ~140K H100-hours) |
Only Ultra receives reasoning RL training. The paper finds distillation is more effective than RL for smaller models. Super uses Puzzle NAS to compress 70B → 49B with 5× throughput improvement. Ultra uses NAS + FFN Fusion to compress 405B → 253B with 1.71× latency improvement.
Languages
English (primary), coding languages, plus German, French, Italian, Portuguese, Hindi, Spanish, and Thai — inherited from Llama 3.1.
References
- HuggingFace huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
- GitHub github.com/NVIDIA/TensorRT-LLM
- Paper 1 arxiv.org/abs/2505.00949
- Paper 2 arxiv.org/abs/2502.00203