Llama 3.3 70B Instruct
Released November 26, 2024
Pricing
Benchmarks
Open LLM Leaderboard v2
Average: 44.80
Lineage
About
Llama-3.3-70B-Instruct is Meta’s 70 billion parameter instruction-tuned model, released December 6, 2024 under the Llama 3.3 Community License (commercial use allowed). The headline: this 70B model matches or exceeds the 405B Llama 3.1 on several key benchmarks — IFEval (92.1 vs 88.6), MATH (77.0 vs 73.8), and GPQA Diamond (50.5 vs 49.0). It uses Grouped-Query Attention, supports 128K token context, and was pretrained on 15T+ tokens with a December 2023 knowledge cutoff.
Why it matters
A 70B model outperforming a 405B model on reasoning benchmarks represents a significant shift in the efficiency frontier. The numbers from Meta’s own evaluation:
- MATH (CoT, 0-shot): 77.0 vs Llama 3.1 405B’s 73.8 — the smaller model wins
- IFEval: 92.1 vs 88.6 — +3.5 points on instruction following
- GPQA Diamond: 50.5 vs 49.0 — +1.5 on graduate-level science questions
- MGSM: 91.1 vs 91.6 — within 0.5 points on multilingual math
The improvements over the previous generation (Llama 3.1 70B) are equally dramatic: +9.0 on MATH, +7.9 on HumanEval, +4.6 on IFEval, +4.2 on MGSM. Same parameter count, substantially better post-training.
This means 6× cheaper inference (70B vs 405B) for equivalent or better quality on math, reasoning, and instruction following. Single-node deployable on 2× A100-80GB. With 666K+ downloads and the Llama 3.3 Community License permitting commercial use, it quickly became one of the most deployed open models in production.
Architecture
| Spec | Value |
|---|---|
| Architecture | Auto-regressive Transformer (decoder-only) |
| Parameters | 70B |
| Context Length | 128K tokens |
| Attention | Grouped-Query Attention (GQA) |
| Tokenizer | BPE (Llama 3 family) |
| Disk Size | 141.1 GB |
| License | Llama 3.3 Community License |
GQA reduces key/value heads while maintaining query heads, cutting memory bandwidth requirements. This makes the 128K context window practical without proportional memory cost growth.
Training
| Aspect | Detail |
|---|---|
| Pretraining Data | 15T+ tokens from publicly available sources |
| Knowledge Cutoff | December 2023 |
| Fine-tuning Data | 25M+ synthetically generated examples + public instruction data |
| Alignment | SFT + RLHF (human preferences for helpfulness and safety) |
| Compute | 7.0M GPU-hours on H100-80GB (700W TDP) |
| Emissions (location-based) | 2,040 tCO2eq |
| Emissions (market-based) | 0 tCO2eq (100% renewable energy) |
The 7.0M GPU-hours figure includes both pretraining and fine-tuning. The training pipeline combines supervised fine-tuning on instruction data with reinforcement learning from human feedback. Meta’s safety approach includes borderline and adversarial prompts in training data, with attention to reducing false refusals on benign prompts.
Benchmarks
Meta’s official evaluation
| Benchmark | Llama 3.3 70B | Llama 3.1 70B | Llama 3.1 405B |
|---|---|---|---|
| MMLU (CoT, 0-shot) | 86.0 | 86.0 | 88.6 |
| MMLU-Pro (CoT, 5-shot) | 68.9 | 66.4 | 73.3 |
| IFEval | 92.1 | 87.5 | 88.6 |
| GPQA Diamond (CoT, 0-shot) | 50.5 | 48.0 | 49.0 |
| HumanEval (0-shot) | 88.4 | 80.5 | 89.0 |
| MBPP EvalPlus (0-shot) | 87.6 | 86.0 | 88.6 |
| MATH (CoT, 0-shot) | 77.0 | 68.0 | 73.8 |
| BFCL v2 (tool use) | 77.3 | 77.5 | 81.1 |
| Nexus (function calling) | 49.4 | 38.7 | 58.7 |
| MGSM (multilingual) | 91.1 | 86.9 | 91.6 |
Bold indicates where Llama 3.3 70B beats the 405B.
Open LLM Leaderboard
| Benchmark | Score |
|---|---|
| Average | 44.8 |
| IFEval | 90.0 |
| BBH | 56.6 |
| MMLU-Pro | 48.1 |
| MATH | 48.3 |
| MUSR | 15.6 |
| GPQA | 10.5 |
Tool use
Llama 3.3 supports native function calling through the Transformers chat template system with a tool role:
def get_current_temperature(location: str) -> float:
"""Get the current temperature at a location."""
return 22.0
inputs = tokenizer.apply_chat_template(
messages, tools=[get_current_temperature],
add_generation_prompt=True
)
BFCL v2 score of 77.3. Nexus function calling improved significantly from 38.7 (Llama 3.1 70B) to 49.4.
Deployment
Hardware
| Format | Size | Minimum Hardware |
|---|---|---|
| BF16 | 141.1 GB | 2× A100-80GB |
| 8-bit | ~71 GB | A100-80GB |
| 4-bit | ~35 GB | RTX 4090 (24 GB) |
Serving
import transformers, torch
pipeline = transformers.pipeline(
"text-generation",
model="meta-llama/Llama-3.3-70B-Instruct",
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain grouped-query attention."},
]
outputs = pipeline(messages, max_new_tokens=256)
# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct
API
Available on OpenRouter ($0.10/M input, $0.32/M output), Together AI, Fireworks, Groq, and Amazon Bedrock.
Safety ecosystem
Meta provides companion safety tools designed for use with Llama 3.3:
- Llama Guard 3 — content filter for blocking disallowed outputs
- Prompt Guard — prompt injection detection and mitigation
- Code Shield — static analysis to identify insecure patterns in generated code
Languages
Officially supported: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
References
- HuggingFace huggingface.co/meta-llama/Llama-3.3-70B-Instruct
- GitHub github.com/meta-llama/llama
- Paper arxiv.org/abs/2407.21783