NVIDIA 8.0B other text-generation

Llama 3.1 Nemotron Nano 8B v1

Released March 16, 2025

Context Window 131K tokens
≈ 98 pages of text
212.2K Downloads
218 Likes
16.1 GB Disk Size
13.0K GitHub ★

About

Llama-3.1-Nemotron-Nano-8B-v1 is the smallest model in NVIDIA’s Llama-Nemotron family — a dense 8B parameter reasoning model derived from Meta’s Llama-3.1-8B-Instruct. Released March 2025 under the commercially permissive NVIDIA Open Model License, it supports a 128K token context window and introduces a dynamic reasoning toggle that switches between standard chat and chain-of-thought reasoning via a system prompt. At 16.1 GB on disk, it fits on a single consumer GPU.

Why it matters

Nemotron Nano is one of the first open models to support a dynamic reasoning toggle — controlled entirely by the system prompt. Setting detailed thinking on activates chain-of-thought reasoning; detailed thinking off runs in standard chat mode. The performance difference is dramatic:

  • MATH-500: 36.6% (off) → 95.4% (on) — a 2.6× improvement from the same weights
  • AIME 2024: 3.0% (off) → 61.3% (on)
  • GPQA-Diamond: 39.4% (off) → 54.1% (on)
  • MBPP 0-shot: 66.1% (off) → 84.6% (on)

This means a single 8B model can serve both fast chat responses and deep reasoning — no separate deployment needed.

Notably, the paper (arXiv:2505.00949) finds that RL yields suboptimal results for smaller models compared to distillation. Nano relies on SFT distillation from strong teachers (primarily DeepSeek-R1), while only the 253B Ultra model gets large-scale reasoning RL. The fact that distillation alone produces these results at 8B is the story — Nano with reasoning on scores 95.4% on MATH-500 compared to DeepSeek-R1-Distill-Llama-8B’s 89.1%.

Architecture

SpecValue
ArchitectureDense decoder-only Transformer (LlamaForCausalLM)
Parameters8B
Base ModelMeta Llama-3.1-8B-Instruct
Context Length128K tokens
Evaluation Context32K tokens
Position EncodingRoPE
ActivationSwiGLU
NormalizationRMSNorm
Disk Size16.1 GB
LicenseNVIDIA Open Model License (commercially permissive)

Unlike the larger siblings (Super 49B and Ultra 253B), Nano does not use NVIDIA’s Puzzle neural architecture search framework. It’s a standard Llama 3.1 architecture — all improvements come from the post-training pipeline.

Training

Nano’s post-training consists of a 3-stage SFT pipeline followed by 2 rounds of offline RPO (Reward-aware Preference Optimization).

SFT Pipeline

All stages use global batch size 256 with sequence packing at effective length 32K tokens.

Stage 1 — Reasoning-only SFT: Fine-tuned exclusively on reasoning data from code, math, and science domains at learning rate 1e-4 for 4 epochs. The paper notes this prevents failure modes like repetitive completions.

Stage 2 — Mixed data: Non-reasoning data introduced alongside reasoning samples. This stage teaches the model to respond to the reasoning toggle — generating chain-of-thought when the system prompt says detailed thinking on and direct answers when it says detailed thinking off.

Stage 3 — Chat and tool calling: A smaller blend focused on chat, instruction following, and tool calling to round out the model’s general capabilities.

Synthetic training data

The Llama-Nemotron family shares a training dataset of 33 million samples (also open-sourced as Llama-Nemotron-Post-Training-Dataset):

DomainSamples% of TotalReasoning OnReasoning Off
Math22,066,39766.8%2,225,42719,840,970
Code10,108,88330.6%991,7069,117,177
Science708,9202.1%708,9200
Chat39,7920.12%8,57431,218
Instruction Following56,3390.17%
Safety31,4260.10%
Total33,011,757

How this data was generated:

  • Math: Problems from Art of Problem Solving (AoPS) forums. Solutions generated by DeepSeek-R1 (16 generations per problem) and Qwen2.5-Math-7B-Instruct (64 generations per problem). Filtered by correctness, decontaminated against benchmarks.
  • Code: 28,904 unique competitive programming questions from TACO, APPS, CodeContests, and Codeforces. Solutions from DeepSeek-R1 using nucleus sampling (temp 0.6, top-p 0.95). Yielded ~488K Python samples after filtering.
  • Science: Synthetic MCQs generated by Qwen2.5 models across physics, biology, chemistry topics. Solutions by DeepSeek-R1, decontaminated against GPQA, MMLU, MMLU-Pro.
  • General: Synthetic prompts via NVIDIA’s pipeline (Nemotron-4-340B-Instruct), responses by DeepSeek-R1 with rejection sampling using Llama-3.1-Nemotron-70B reward model.

RPO (Preference Optimization)

After SFT, two rounds of offline RPO with on-policy data:

  1. Round 1: Mixture of reasoning and non-reasoning data with appropriate system prompts to improve reasoning control
  2. Round 2: On-policy generations targeting instruction following improvements

Each round: up to 400 steps, learning rate 7e-7, KL penalty β=3e-2, batch size 512. The paper notes RPO “mainly targeted IFEval accuracy improvement” — IFEval jumped from 69.9 (SFT only) to 79.29 after RPO.

Benchmarks

All evaluations at 32K context length, up to 16 completions per prompt, average pass@1. From paper Table 3.

Reasoning ON vs comparable 8B-class models

BenchmarkLN-NanoDeepSeek-R1-Distill-Llama-8BLlama-3.1-8B-InstructDeepSeek-R1-Distill-Qwen-7B
GPQA-Diamond54.149.025.349.1
AIME 202461.350.410.055.6
AIME 2025-I47.140.010.041.7
MATH-50095.489.150.492.8
BFCL V2 Live63.937.844.339.2
LiveCodeBench46.639.611.837.6
IFEval79.373.481.867.6

Nano leads every benchmark except IFEval, where the base Llama-3.1-8B-Instruct still wins. The BFCL V2 (tool calling) score of 63.9 is a standout — nearly double the distilled models — showing the benefit of dedicated tool-calling data in Stage 3 SFT.

Reasoning ON vs OFF

BenchmarkReasoning ONReasoning OFF
GPQA-Diamond54.139.4
AIME 202461.33.0
AIME 2025-I47.10.0
MATH-50095.436.6
BFCL V2 Live63.963.6
IFEval79.382.1

Tool calling (BFCL V2) is stable across modes — the toggle doesn’t interfere with function-calling. IFEval is slightly better with reasoning off, consistent with the observation that the model can overthink simple instructions.

Deployment

Hardware

  • FP16: Any GPU with ≥16 GB VRAM (RTX 4090, A100, H100)
  • 8-bit: ~8 GB VRAM (RTX 3070/3080)
  • 4-bit: ~5 GB VRAM
  • Edge: Jetson AGX Thor with JetPack 6.0

Inference settings

Reasoning ON:

System prompt: "detailed thinking on"
Temperature: 0.6, Top-p: 0.95

Reasoning OFF:

System prompt: "detailed thinking off"
Greedy decoding (temperature 0)

Software

  • NeMo 24.12 — NVIDIA’s training and inference framework
  • TensorRT-LLM — Optimized serving on NVIDIA GPUs
  • vLLM — Compatible via LlamaForCausalLM
  • Transformers — Standard HuggingFace pipeline

The Llama-Nemotron family

ModelParametersBaseTarget HardwareKey Difference
LN-Nano8BLlama 3.1 8BSingle GPU / edgeSFT + RPO only
LN-Super49BLlama 3.3 70B (NAS compressed)Single H100Puzzle NAS → 5× throughput over 70B at TP1
LN-Ultra253BLlama 3.1 405B (NAS + FFN Fusion)8×H100 nodeReasoning RL (GRPO, ~140K H100-hours)

Only Ultra receives reasoning RL training. The paper finds distillation is more effective than RL for smaller models. Super uses Puzzle NAS to compress 70B → 49B with 5× throughput improvement. Ultra uses NAS + FFN Fusion to compress 405B → 253B with 1.71× latency improvement.

Languages

English (primary), coding languages, plus German, French, Italian, Portuguese, Hindi, Spanish, and Thai — inherited from Llama 3.1.

References

  • 🤗 HuggingFace huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
  • ⌨️ GitHub github.com/NVIDIA/TensorRT-LLM
  • 📄 Paper 1 arxiv.org/abs/2505.00949
  • 📄 Paper 2 arxiv.org/abs/2502.00203