Meta 70.0B Llama License text-generation

Llama 3.3 70B Instruct

Released November 26, 2024

Context Window 131K tokens
≈ 98 pages of text
666.7K Downloads
2.7K Likes
141.1 GB Disk Size
59.2K GitHub ★

Pricing

Input $0.10/M per million tokens
Output $0.32/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 90.0
BBH 56.6
MATH Lvl 5 48.3
GPQA 10.5
MUSR 15.6
MMLU-PRO 48.1

Average: 44.80

Lineage

About

Llama-3.3-70B-Instruct is Meta’s 70 billion parameter instruction-tuned model, released December 6, 2024 under the Llama 3.3 Community License (commercial use allowed). The headline: this 70B model matches or exceeds the 405B Llama 3.1 on several key benchmarks — IFEval (92.1 vs 88.6), MATH (77.0 vs 73.8), and GPQA Diamond (50.5 vs 49.0). It uses Grouped-Query Attention, supports 128K token context, and was pretrained on 15T+ tokens with a December 2023 knowledge cutoff.

Why it matters

A 70B model outperforming a 405B model on reasoning benchmarks represents a significant shift in the efficiency frontier. The numbers from Meta’s own evaluation:

  • MATH (CoT, 0-shot): 77.0 vs Llama 3.1 405B’s 73.8 — the smaller model wins
  • IFEval: 92.1 vs 88.6 — +3.5 points on instruction following
  • GPQA Diamond: 50.5 vs 49.0 — +1.5 on graduate-level science questions
  • MGSM: 91.1 vs 91.6 — within 0.5 points on multilingual math

The improvements over the previous generation (Llama 3.1 70B) are equally dramatic: +9.0 on MATH, +7.9 on HumanEval, +4.6 on IFEval, +4.2 on MGSM. Same parameter count, substantially better post-training.

This means 6× cheaper inference (70B vs 405B) for equivalent or better quality on math, reasoning, and instruction following. Single-node deployable on 2× A100-80GB. With 666K+ downloads and the Llama 3.3 Community License permitting commercial use, it quickly became one of the most deployed open models in production.

Architecture

SpecValue
ArchitectureAuto-regressive Transformer (decoder-only)
Parameters70B
Context Length128K tokens
AttentionGrouped-Query Attention (GQA)
TokenizerBPE (Llama 3 family)
Disk Size141.1 GB
LicenseLlama 3.3 Community License

GQA reduces key/value heads while maintaining query heads, cutting memory bandwidth requirements. This makes the 128K context window practical without proportional memory cost growth.

Training

AspectDetail
Pretraining Data15T+ tokens from publicly available sources
Knowledge CutoffDecember 2023
Fine-tuning Data25M+ synthetically generated examples + public instruction data
AlignmentSFT + RLHF (human preferences for helpfulness and safety)
Compute7.0M GPU-hours on H100-80GB (700W TDP)
Emissions (location-based)2,040 tCO2eq
Emissions (market-based)0 tCO2eq (100% renewable energy)

The 7.0M GPU-hours figure includes both pretraining and fine-tuning. The training pipeline combines supervised fine-tuning on instruction data with reinforcement learning from human feedback. Meta’s safety approach includes borderline and adversarial prompts in training data, with attention to reducing false refusals on benign prompts.

Benchmarks

Meta’s official evaluation

BenchmarkLlama 3.3 70BLlama 3.1 70BLlama 3.1 405B
MMLU (CoT, 0-shot)86.086.088.6
MMLU-Pro (CoT, 5-shot)68.966.473.3
IFEval92.187.588.6
GPQA Diamond (CoT, 0-shot)50.548.049.0
HumanEval (0-shot)88.480.589.0
MBPP EvalPlus (0-shot)87.686.088.6
MATH (CoT, 0-shot)77.068.073.8
BFCL v2 (tool use)77.377.581.1
Nexus (function calling)49.438.758.7
MGSM (multilingual)91.186.991.6

Bold indicates where Llama 3.3 70B beats the 405B.

Open LLM Leaderboard

BenchmarkScore
Average44.8
IFEval90.0
BBH56.6
MMLU-Pro48.1
MATH48.3
MUSR15.6
GPQA10.5

Tool use

Llama 3.3 supports native function calling through the Transformers chat template system with a tool role:

def get_current_temperature(location: str) -> float:
    """Get the current temperature at a location."""
    return 22.0

inputs = tokenizer.apply_chat_template(
    messages, tools=[get_current_temperature],
    add_generation_prompt=True
)

BFCL v2 score of 77.3. Nexus function calling improved significantly from 38.7 (Llama 3.1 70B) to 49.4.

Deployment

Hardware

FormatSizeMinimum Hardware
BF16141.1 GB2× A100-80GB
8-bit~71 GBA100-80GB
4-bit~35 GBRTX 4090 (24 GB)

Serving

import transformers, torch

pipeline = transformers.pipeline(
    "text-generation",
    model="meta-llama/Llama-3.3-70B-Instruct",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain grouped-query attention."},
]
outputs = pipeline(messages, max_new_tokens=256)
# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct

API

Available on OpenRouter ($0.10/M input, $0.32/M output), Together AI, Fireworks, Groq, and Amazon Bedrock.

Safety ecosystem

Meta provides companion safety tools designed for use with Llama 3.3:

  • Llama Guard 3 — content filter for blocking disallowed outputs
  • Prompt Guard — prompt injection detection and mitigation
  • Code Shield — static analysis to identify insecure patterns in generated code

Languages

Officially supported: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

References

  • 🤗 HuggingFace huggingface.co/meta-llama/Llama-3.3-70B-Instruct
  • ⌨️ GitHub github.com/meta-llama/llama
  • 📄 Paper arxiv.org/abs/2407.21783