Meta 70.0B Llama License text-generation

Llama 3.3 70B Instruct

Released November 26, 2024

Context Window 131K tokens

≈ 98 pages of text

666.7K Downloads

2.7K Likes

141.1 GB Disk Size

59.2K GitHub ★

Pricing

Input $0.10/M per million tokens

Output $0.32/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 90.0

BBH 56.6

MATH Lvl 5 48.3

GPQA 10.5

MUSR 15.6

MMLU-PRO 48.1

Average: 44.80

Lineage

↑ Base: meta-llama/Llama-3.1-70B

About

Llama-3.3-70B-Instruct is Meta’s 70 billion parameter instruction-tuned model, released December 6, 2024 under the Llama 3.3 Community License (commercial use allowed). The headline: this 70B model matches or exceeds the 405B Llama 3.1 on several key benchmarks — IFEval (92.1 vs 88.6), MATH (77.0 vs 73.8), and GPQA Diamond (50.5 vs 49.0). It uses Grouped-Query Attention, supports 128K token context, and was pretrained on 15T+ tokens with a December 2023 knowledge cutoff.

Why it matters

A 70B model outperforming a 405B model on reasoning benchmarks represents a significant shift in the efficiency frontier. The numbers from Meta’s own evaluation:

MATH (CoT, 0-shot): 77.0 vs Llama 3.1 405B’s 73.8 — the smaller model wins
IFEval: 92.1 vs 88.6 — +3.5 points on instruction following
GPQA Diamond: 50.5 vs 49.0 — +1.5 on graduate-level science questions
MGSM: 91.1 vs 91.6 — within 0.5 points on multilingual math

The improvements over the previous generation (Llama 3.1 70B) are equally dramatic: +9.0 on MATH, +7.9 on HumanEval, +4.6 on IFEval, +4.2 on MGSM. Same parameter count, substantially better post-training.

This means 6× cheaper inference (70B vs 405B) for equivalent or better quality on math, reasoning, and instruction following. Single-node deployable on 2× A100-80GB. With 666K+ downloads and the Llama 3.3 Community License permitting commercial use, it quickly became one of the most deployed open models in production.

Architecture

Spec	Value
Architecture	Auto-regressive Transformer (decoder-only)
Parameters	70B
Context Length	128K tokens
Attention	Grouped-Query Attention (GQA)
Tokenizer	BPE (Llama 3 family)
Disk Size	141.1 GB
License	Llama 3.3 Community License

GQA reduces key/value heads while maintaining query heads, cutting memory bandwidth requirements. This makes the 128K context window practical without proportional memory cost growth.

Training

Aspect	Detail
Pretraining Data	15T+ tokens from publicly available sources
Knowledge Cutoff	December 2023
Fine-tuning Data	25M+ synthetically generated examples + public instruction data
Alignment	SFT + RLHF (human preferences for helpfulness and safety)
Compute	7.0M GPU-hours on H100-80GB (700W TDP)
Emissions (location-based)	2,040 tCO2eq
Emissions (market-based)	0 tCO2eq (100% renewable energy)

The 7.0M GPU-hours figure includes both pretraining and fine-tuning. The training pipeline combines supervised fine-tuning on instruction data with reinforcement learning from human feedback. Meta’s safety approach includes borderline and adversarial prompts in training data, with attention to reducing false refusals on benign prompts.

Benchmarks

Meta’s official evaluation

Benchmark	Llama 3.3 70B	Llama 3.1 70B	Llama 3.1 405B
MMLU (CoT, 0-shot)	86.0	86.0	88.6
MMLU-Pro (CoT, 5-shot)	68.9	66.4	73.3
IFEval	92.1	87.5	88.6
GPQA Diamond (CoT, 0-shot)	50.5	48.0	49.0
HumanEval (0-shot)	88.4	80.5	89.0
MBPP EvalPlus (0-shot)	87.6	86.0	88.6
MATH (CoT, 0-shot)	77.0	68.0	73.8
BFCL v2 (tool use)	77.3	77.5	81.1
Nexus (function calling)	49.4	38.7	58.7
MGSM (multilingual)	91.1	86.9	91.6

Bold indicates where Llama 3.3 70B beats the 405B.

Open LLM Leaderboard

Benchmark	Score
Average	44.8
IFEval	90.0
BBH	56.6
MMLU-Pro	48.1
MATH	48.3
MUSR	15.6
GPQA	10.5

Tool use

Llama 3.3 supports native function calling through the Transformers chat template system with a tool role:

def get_current_temperature(location: str) -> float:
    """Get the current temperature at a location."""
    return 22.0

inputs = tokenizer.apply_chat_template(
    messages, tools=[get_current_temperature],
    add_generation_prompt=True
)

BFCL v2 score of 77.3. Nexus function calling improved significantly from 38.7 (Llama 3.1 70B) to 49.4.

Deployment

Hardware

Format	Size	Minimum Hardware
BF16	141.1 GB	2× A100-80GB
8-bit	~71 GB	A100-80GB
4-bit	~35 GB	RTX 4090 (24 GB)

Serving

import transformers, torch

pipeline = transformers.pipeline(
    "text-generation",
    model="meta-llama/Llama-3.3-70B-Instruct",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain grouped-query attention."},
]
outputs = pipeline(messages, max_new_tokens=256)

# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct

API

Available on OpenRouter ($0.10/M input, $0.32/M output), Together AI, Fireworks, Groq, and Amazon Bedrock.

Safety ecosystem

Meta provides companion safety tools designed for use with Llama 3.3:

Llama Guard 3 — content filter for blocking disallowed outputs
Prompt Guard — prompt injection detection and mitigation
Code Shield — static analysis to identify insecure patterns in generated code

Languages

Officially supported: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

References

🤗 HuggingFace huggingface.co/meta-llama/Llama-3.3-70B-Instruct
⌨️ GitHub github.com/meta-llama/llama
📄 Paper arxiv.org/abs/2407.21783